改善基于学习的学习MOS预测网络

论文标题

改善基于学习的学习MOS预测网络

Improving Self-Supervised Learning-based MOS Prediction Networks

论文作者

Gyires-Tóth, Bálint, Zainkó, Csaba

论文摘要

MOS（平均意见分数）是用于评估系统质量的主观方法。电信（用于语音和视频），语音合成系统（用于生成的语音）是该方法的众多应用中的一些。尽管MOS测试已被广泛接受，但由于需要人类的输入，它们既耗时又昂贵。此外，由于测试的系统和受试者有所不同，因此结果并非可比性。另一方面，许多以前的测试使我们能够训练能够预测MOS值的机器学习模型。通过自动预测MOS值，可以解决两个上述问题。目前的工作将基于自学的学习MOS预测模型引入数据，培训和培训后的特定改进。我们使用了在LibrisPeech上预先训练的WAV2VEC 2.0模型，该模型用LSTM和非线性致密层扩展。我们介绍了转移学习，目标数据预处理两阶段和三相训练方法，具有不同的批次公式，辍学的积累（用于较大的批量尺寸）以及预测的量化。使用第一个语音MOS挑战的共享合成语音数据集评估了这些方法。

MOS (Mean Opinion Score) is a subjective method used for the evaluation of a system's quality. Telecommunications (for voice and video), and speech synthesis systems (for generated speech) are a few of the many applications of the method. While MOS tests are widely accepted, they are time-consuming and costly since human input is required. In addition, since the systems and subjects of the tests differ, the results are not really comparable. On the other hand, a large number of previous tests allow us to train machine learning models that are capable of predicting MOS value. By automatically predicting MOS values, both the aforementioned issues can be resolved. The present work introduces data-, training- and post-training specific improvements to a previous self-supervised learning-based MOS prediction model. We used a wav2vec 2.0 model pre-trained on LibriSpeech, extended with LSTM and non-linear dense layers. We introduced transfer learning, target data preprocessing a two- and three-phase training method with different batch formulations, dropout accumulation (for larger batch sizes) and quantization of the predictions. The methods are evaluated using the shared synthetic speech dataset of the first Voice MOS challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题