Models and Methods for Speech Separation in Digital Systems

dc.citation.epage127
dc.citation.issue2
dc.citation.journalTitleДосягнення у кіберфізичних системах
dc.citation.spage121
dc.citation.volume9
dc.contributor.affiliationIvan Franko National University of Lviv
dc.contributor.authorTsemko, Andrii
dc.contributor.authorKarbovnyk, Ivan
dc.coverage.placenameЛьвів
dc.coverage.placenameLviv
dc.date.accessioned2025-11-06T08:48:13Z
dc.date.created2024-02-27
dc.date.issued2024-02-27
dc.description.abstractThe main purpose of the article is to describe state-of-the-art approaches to speech separation and demonstrate the structures and challenges of building and training such systems. Designing efficient optimized neural network model for speech recognition requires using encoder-decoder model structure with masks estimation flow. The fully-convolutinoal SuDoRM-Rf model demonstrates the high efficiency with relatively small number of parameters and can be boosted with accelerators, that supports convolutional operations. The highest separation performance has been shown by the SepTDA model with 24 dB in SI-SNR with 21.2 million of trainable parameters, while SuDoRM-Rf with only 2.66 million has demonsrated 12.02 dB. Another transformer-based neural network approaches has demonstrated almost the same performance as SepTDA model but requires more trainable parameters.
dc.format.extent121-127
dc.format.pages7
dc.identifier.citationTsemko A. Models and Methods for Speech Separation in Digital Systems / Andrii Tsemko, Ivan Karbovnyk // Advances in Cyber-Physical Systems. — Lviv : Lviv Politechnic Publishing House, 2024. — Vol 9. — No 2. — P. 121–127.
dc.identifier.citationenTsemko A. Models and Methods for Speech Separation in Digital Systems / Andrii Tsemko, Ivan Karbovnyk // Advances in Cyber-Physical Systems. — Lviv : Lviv Politechnic Publishing House, 2024. — Vol 9. — No 2. — P. 121–127.
dc.identifier.doidoi.org/10.23939/acps2024.02.121
dc.identifier.urihttps://ena.lpnu.ua/handle/ntb/117385
dc.language.isoen
dc.publisherВидавництво Львівської політехніки
dc.publisherLviv Politechnic Publishing House
dc.relation.ispartofДосягнення у кіберфізичних системах, 2 (9), 2024
dc.relation.ispartofAdvances in Cyber-Physical Systems, 2 (9), 2024
dc.relation.references[1] M. Lichouri, K. Lounnas, R. Djeradi & A. Djeradi (2022). Performance of End-to-End vs Pipeline Spoken Language Understanding Models on Multilingual Synthetic Voice. In 2022 International Conference on Advanced Aspects of Software Engineering (pp. 1–6). ICAASE. DOI: https://doi.org/10.1109/icaase56196.2022.9931594
dc.relation.references[2] Z. -Q. Wang, J. L. Roux & J. R. Hershey (2018). Alternative Objective Functions for Deep Clustering. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 686–690). ICASSP. DOI: https://doi.org/10.1109/icassp.2018.8462507
dc.relation.references[3] E. Tzinis, S. Venkataramani, Z. Wang, C. Subakan & P. Smaragdis (2020). Two-Step Sound Source Separation: Training On Learned Latent Targets. In ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31–35). ICASSP. DOI: https://doi.org/10.1109/icassp40776.2020.9054172
dc.relation.references[4] Y. Luo, Z. Chen & T. Yoshioka (2020). Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In ICASSP 2020 –2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 46–50). ICASSP. DOI: https://doi.org/10.1109/icassp40776.2020.9054266
dc.relation.references[5] J. Q. Yip et al. (2024). SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance. In ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 326–330). ICASSP. DOI: https://doi.org/10.1109/icassp48485.2024.10447030
dc.relation.references[6] Pu H, Cai C, Hu M, Deng T, Zheng R, Luo J. (2021). Towards Robust Multiple Blind Source Localization Using Source Separation and Beamforming. In Sensors. DOI: https://doi.org/10.3390/s21020532
dc.relation.references[7] D. Wang & J. Chen (2018, October). Supervised Speech Separation Based on Deep Learning: An Overview. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 26, no. 10, pp. 1702–1726). IEEE. DOI: https://doi.org/10.1109/taslp.2018.2842159
dc.relation.references[8] Y. Luo & N. Mesgarani (2018). TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 696–700). ICASSP. DOI: 10.1109/ICASSP.2018.8462116
dc.relation.references[9] Tzinis, E., Wang, Z., Jiang, X. et al. (2021). Compute and Memory Efficient Universal Sound Source Separation. In Journal of Signal Processing Systems 94 (pp. 245–259). DOI: https://doi.org/10.1007/s11265-021-01683-x
dc.relation.references[10] J. R. Hershey, Z. Chen, J. Le Roux & S. Watanabe (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31–35). ICASSP. DOI: https://doi.org/10.1109/icassp.2016.7471631
dc.relation.references[11] J. L. Roux, S. Wisdom, H. Erdogan & J. R. Hershey (2019). SDR – Half-baked or Well Done? In ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 626–630). ICASSP. DOI: https://doi.org/10.1109/icassp.2019.8683855
dc.relation.references[12] M. Kolbæk, D. Yu, Z.-H. Tan & J. Jensen (2017). Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 25, no. 10, pp. 1901–1913, Oct. 2017). DOI: https://doi.org/10.1109/taslp.2017.2726762
dc.relation.references[13] Cosentino, Joris et al. (2020). LibriMix: An Open-Source Dataset for Generalizable Speech Separation. In arXiv: Audio and Speech Processing. DOI: https://doi.org/10.48550/arxiv.2005.11262
dc.relation.references[14] Dauphin, Yann et al. (2016). Language Modeling with Gated Convolutional Networks. In International Conference on Machine Learning. DOI: https://doi.org/10.48550/arxiv.1612.08083
dc.relation.references[15] Y. Luo & N. Mesgarani (2019, August). Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 8, pp. 1256–1266). DOI: https://doi.org/10.1109/taslp.2019.2915167
dc.relation.references[16] E. Tzinis, Z. Wang & P. Smaragdis (2020). Sudo RM-RF: Efficient Networks for Universal Audio Source Separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (pp. 1–6). MLSP. DOI: https://doi.org/10.1109/mlsp49062.2020.9231900
dc.relation.references[17] Y. Liu & D. Wang (2019, December). Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 12, pp. 2092–2102). DOI: https://doi.org/10.1109/taslp.2019.2941148
dc.relation.references[18] N. Zeghidour & D. Grangier (2021). Wavesplit: End-to-End Speech Separation by Speaker Clustering. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 29, pp. 2840–2849). DOI: https://doi.org/10.1109/taslp.2021.3099291
dc.relation.references[19] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi & J. Zhong (2021). Attention Is All You Need In Speech Separation. In ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 21–25). ICASSP. DOI: https://doi.org/10.1109/icassp39728.2021.9413901
dc.relation.references[20] S. Zhao & B. Ma (2023). MossFormer: Pushing the Performance Limit of Monaural Speech Separation Using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions. In ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1–5). ICASSP. DOI: https://doi.org/10.1109/icassp49357.2023.10096646
dc.relation.references[21] Lutati, Shahar et al. (2023). Separate and Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation. In ArXiv abs/2301.10752. DOI: https://doi.org/10.48550/arxiv.2301.10752
dc.relation.references[22] S. Zhao et al. (2024). MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. In ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 10356–10360). ICASSP. DOI: https://doi.org/10.1109/ICASSP48485.2024.10445985
dc.relation.references[23] Y. Lee, S. Choi, B. -Y. Kim, Z. -Q. Wang & S. Watanabe (2024). Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. In ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 446–450). ICASSP. DOI: https://doi.org/10.1109/icassp48485.2024.10446032
dc.relation.referencesen[1] M. Lichouri, K. Lounnas, R. Djeradi & A. Djeradi (2022). Performance of End-to-End vs Pipeline Spoken Language Understanding Models on Multilingual Synthetic Voice. In 2022 International Conference on Advanced Aspects of Software Engineering (pp. 1–6). ICAASE. DOI: https://doi.org/10.1109/icaase56196.2022.9931594
dc.relation.referencesen[2] Z. -Q. Wang, J. L. Roux & J. R. Hershey (2018). Alternative Objective Functions for Deep Clustering. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 686–690). ICASSP. DOI: https://doi.org/10.1109/icassp.2018.8462507
dc.relation.referencesen[3] E. Tzinis, S. Venkataramani, Z. Wang, C. Subakan & P. Smaragdis (2020). Two-Step Sound Source Separation: Training On Learned Latent Targets. In ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31–35). ICASSP. DOI: https://doi.org/10.1109/icassp40776.2020.9054172
dc.relation.referencesen[4] Y. Luo, Z. Chen & T. Yoshioka (2020). Dual-Path RNN: Efficient Long Sequence Modeling for Time-Domain Single-Channel Speech Separation. In ICASSP 2020 –2020 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 46–50). ICASSP. DOI: https://doi.org/10.1109/icassp40776.2020.9054266
dc.relation.referencesen[5] J. Q. Yip et al. (2024). SPGM: Prioritizing Local Features for Enhanced Speech Separation Performance. In ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 326–330). ICASSP. DOI: https://doi.org/10.1109/icassp48485.2024.10447030
dc.relation.referencesen[6] Pu H, Cai C, Hu M, Deng T, Zheng R, Luo J. (2021). Towards Robust Multiple Blind Source Localization Using Source Separation and Beamforming. In Sensors. DOI: https://doi.org/10.3390/s21020532
dc.relation.referencesen[7] D. Wang & J. Chen (2018, October). Supervised Speech Separation Based on Deep Learning: An Overview. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 26, no. 10, pp. 1702–1726). IEEE. DOI: https://doi.org/10.1109/taslp.2018.2842159
dc.relation.referencesen[8] Y. Luo & N. Mesgarani (2018). TaSNet: Time-Domain Audio Separation Network for Real-Time, Single-Channel Speech Separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 696–700). ICASSP. DOI: 10.1109/ICASSP.2018.8462116
dc.relation.referencesen[9] Tzinis, E., Wang, Z., Jiang, X. et al. (2021). Compute and Memory Efficient Universal Sound Source Separation. In Journal of Signal Processing Systems 94 (pp. 245–259). DOI: https://doi.org/10.1007/s11265-021-01683-x
dc.relation.referencesen[10] J. R. Hershey, Z. Chen, J. Le Roux & S. Watanabe (2016). Deep clustering: Discriminative embeddings for segmentation and separation. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 31–35). ICASSP. DOI: https://doi.org/10.1109/icassp.2016.7471631
dc.relation.referencesen[11] J. L. Roux, S. Wisdom, H. Erdogan & J. R. Hershey (2019). SDR – Half-baked or Well Done? In ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 626–630). ICASSP. DOI: https://doi.org/10.1109/icassp.2019.8683855
dc.relation.referencesen[12] M. Kolbæk, D. Yu, Z.-H. Tan & J. Jensen (2017). Multitalker Speech Separation with Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 25, no. 10, pp. 1901–1913, Oct. 2017). DOI: https://doi.org/10.1109/taslp.2017.2726762
dc.relation.referencesen[13] Cosentino, Joris et al. (2020). LibriMix: An Open-Source Dataset for Generalizable Speech Separation. In arXiv: Audio and Speech Processing. DOI: https://doi.org/10.48550/arxiv.2005.11262
dc.relation.referencesen[14] Dauphin, Yann et al. (2016). Language Modeling with Gated Convolutional Networks. In International Conference on Machine Learning. DOI: https://doi.org/10.48550/arxiv.1612.08083
dc.relation.referencesen[15] Y. Luo & N. Mesgarani (2019, August). Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 8, pp. 1256–1266). DOI: https://doi.org/10.1109/taslp.2019.2915167
dc.relation.referencesen[16] E. Tzinis, Z. Wang & P. Smaragdis (2020). Sudo RM-RF: Efficient Networks for Universal Audio Source Separation. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (pp. 1–6). MLSP. DOI: https://doi.org/10.1109/mlsp49062.2020.9231900
dc.relation.referencesen[17] Y. Liu & D. Wang (2019, December). Divide and Conquer: A Deep CASA Approach to Talker-Independent Monaural Speaker Separation. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 27, no. 12, pp. 2092–2102). DOI: https://doi.org/10.1109/taslp.2019.2941148
dc.relation.referencesen[18] N. Zeghidour & D. Grangier (2021). Wavesplit: End-to-End Speech Separation by Speaker Clustering. In IEEE/ACM Transactions on Audio, Speech, and Language Processing (vol. 29, pp. 2840–2849). DOI: https://doi.org/10.1109/taslp.2021.3099291
dc.relation.referencesen[19] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi & J. Zhong (2021). Attention Is All You Need In Speech Separation. In ICASSP 2021 – 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 21–25). ICASSP. DOI: https://doi.org/10.1109/icassp39728.2021.9413901
dc.relation.referencesen[20] S. Zhao & B. Ma (2023). MossFormer: Pushing the Performance Limit of Monaural Speech Separation Using Gated Single-Head Transformer with Convolution-Augmented Joint Self-Attentions. In ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1–5). ICASSP. DOI: https://doi.org/10.1109/icassp49357.2023.10096646
dc.relation.referencesen[21] Lutati, Shahar et al. (2023). Separate and Diffuse: Using a Pretrained Diffusion Model for Improving Source Separation. In ArXiv abs/2301.10752. DOI: https://doi.org/10.48550/arxiv.2301.10752
dc.relation.referencesen[22] S. Zhao et al. (2024). MossFormer2: Combining Transformer and RNN-Free Recurrent Network for Enhanced Time-Domain Monaural Speech Separation. In ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 10356–10360). ICASSP. DOI: https://doi.org/10.1109/ICASSP48485.2024.10445985
dc.relation.referencesen[23] Y. Lee, S. Choi, B. -Y. Kim, Z. -Q. Wang & S. Watanabe (2024). Boosting Unknown-Number Speaker Separation with Transformer Decoder-Based Attractor. In ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 446–450). ICASSP. DOI: https://doi.org/10.1109/icassp48485.2024.10446032
dc.relation.urihttps://doi.org/10.1109/icaase56196.2022.9931594
dc.relation.urihttps://doi.org/10.1109/icassp.2018.8462507
dc.relation.urihttps://doi.org/10.1109/icassp40776.2020.9054172
dc.relation.urihttps://doi.org/10.1109/icassp40776.2020.9054266
dc.relation.urihttps://doi.org/10.1109/icassp48485.2024.10447030
dc.relation.urihttps://doi.org/10.3390/s21020532
dc.relation.urihttps://doi.org/10.1109/taslp.2018.2842159
dc.relation.urihttps://doi.org/10.1007/s11265-021-01683-x
dc.relation.urihttps://doi.org/10.1109/icassp.2016.7471631
dc.relation.urihttps://doi.org/10.1109/icassp.2019.8683855
dc.relation.urihttps://doi.org/10.1109/taslp.2017.2726762
dc.relation.urihttps://doi.org/10.48550/arxiv.2005.11262
dc.relation.urihttps://doi.org/10.48550/arxiv.1612.08083
dc.relation.urihttps://doi.org/10.1109/taslp.2019.2915167
dc.relation.urihttps://doi.org/10.1109/mlsp49062.2020.9231900
dc.relation.urihttps://doi.org/10.1109/taslp.2019.2941148
dc.relation.urihttps://doi.org/10.1109/taslp.2021.3099291
dc.relation.urihttps://doi.org/10.1109/icassp39728.2021.9413901
dc.relation.urihttps://doi.org/10.1109/icassp49357.2023.10096646
dc.relation.urihttps://doi.org/10.48550/arxiv.2301.10752
dc.relation.urihttps://doi.org/10.1109/ICASSP48485.2024.10445985
dc.relation.urihttps://doi.org/10.1109/icassp48485.2024.10446032
dc.rights.holder© Національний університет “Львівська політехніка”, 2024
dc.rights.holder© Tsemko A., Karbovnyk I., 2024
dc.subjectSpeech Separation
dc.subjectSpeech Enhancement
dc.subjectAudio Processing
dc.subjectNeural Networks
dc.titleModels and Methods for Speech Separation in Digital Systems
dc.typeArticle

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
2024v9n2_Tsemko_A-Models_and_Methods_for_Speech_121-127.pdf
Size:
1.07 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
2024v9n2_Tsemko_A-Models_and_Methods_for_Speech_121-127__COVER.png
Size:
543.03 KB
Format:
Portable Network Graphics

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.76 KB
Format:
Plain Text
Description: