论文标题
使用语义相似性指标改进音频字幕
Improving Audio Captioning Using Semantic Similarity Metrics
论文作者
论文摘要
音频字幕的质量指标通常是从机器翻译和图像字幕区域借来的,测量了预测令牌和金参考令牌之间的重叠程度。在这项工作中,我们考虑了预测标题和参考字幕之间的语义相似性,而不是测量确切的单词重叠。我们首先评估其在与同一音频文件相对应的字幕之间捕获相似性的能力,并将其与其他已建立的指标进行比较。然后,我们提出了一种微调方法,通过通过嵌入提取器和音频字幕网络进行反向传播来直接优化度量。通过传统指标和拟议的语义相似性标题度量标准,这种微调会改善预测字幕。
Audio captioning quality metrics which are typically borrowed from the machine translation and image captioning areas measure the degree of overlap between predicted tokens and gold reference tokens. In this work, we consider a metric measuring semantic similarities between predicted and reference captions instead of measuring exact word overlap. We first evaluate its ability to capture similarities among captions corresponding to the same audio file and compare it to other established metrics. We then propose a fine-tuning method to directly optimize the metric by backpropagating through a sentence embedding extractor and audio captioning network. Such fine-tuning results in an improvement in predicted captions as measured by both traditional metrics and the proposed semantic similarity captioning metric.