International Journal of Computational Linguistics & Chinese Language Processing                                   [中æ�]
                                                                                          Vol. 26, No. 2, December 2021



Title:
Using Transfer Learning to Improve Deep Neural Networks for Lyrics Emotion Recognition in Chinese

Author:
Jia-Yi Liao, Ya-Hsuan Lin, Kuan-Cheng Lin, and Jia-Wei Chang

Abstract:
Emotion is an important attribute in music information retrieval. Deep learning methods have been widely used in the automatic recognition of music emotion. Most of the studies focus on the audio data, the role of lyrics in music emotion classification remains under-appreciated. Due to the richness of English language resources, most previous studies were based on English lyrics but rarely in Chinese. This study proposes an approach without specific training for the Chinese lyrics emotional classification task: using transfer learning to improve deep neural networks, BERT pre-training model, for the emotion classification in Chinese lyrics. The experimental results show that directly using BERT to build an emotion classification model of CVAT only reach 50% of the classification accuracy. However, using BERT with transfer learning from CVAW, CVAP, to CVAT can achieve 71% classification accuracy.

Keywords: Natural Language Processing, Music Emotion Recognition, Transfer learning, Chinese Lyrics


Title:
A Pretrained YouTuber Embeddings for Improving Sentiment Classification of YouTube Comments

Author:
Ching-Wen Hsu, Hsuan Liu, and Jheng-Long Wu

Abstract:
Technology is changing the way we consume information and entertainment. YouTube streaming video services provide a discussion function that allows video publishers to know what matters most to the people they want to love their brand. Through comments, video publishers can better understand the audience�䏭 thoughts and even help video publishers improve their video quality. We propsoe a classifier based on machine learning and BERT to automatically detect YouTuber preferences, video preferences, and excitement levels. In order to make high performance of models, we use a pretrained YouTuber embeddings to enhance performance, which is trained in advance based on roughly 175,000 pieces of videos�� comments that contain YouTubers�� name. YouTuber embeddings can capture some of the semantics and character of the relation between YouTubers. Experimental results show that the performances of machine learning-based models with YouTuber embeddings have improved overall accuracy and F1-score on all sentiment classications. The result validates that YouTuber embedding training is significantly helpful when detecting audience sentiment towards YouTubers. On the contrary, BERT model cannot perfectly deal with the polarity classificational tasks when using YouTubers embeddings. However, the BERT model construction is more suitable for addressing multi-dimensional classification tasks, such as the five-labels classification task used in this task. In conclusion, the comprehensive sentiment detection task on the YouTube video streaming service platform can improve performance by the proposed multi-dimensional sentiment indicators and our solution to modify the structure on classifiers.

Keywords:
YouTuber Embeddings, Sentiment Classification, Deep Learning, Pretrained Model


Title:
Employing Low-Pass Filtered Temporal Speech Features for the Training of Ideal Ratio Mask in Speech Enhancement

Author:
Yan-Tong Chen and Jeih-weih Hung

Abstract:
The masking-based speech enhancement method pursues a multiplicative mask that applies to the spectrogram of input noise-corrupted utterance, and a deep neural network (DNN) is often used to learn the mask. In particular, the features commonly used for automatic speech recognition can serve as the input of the DNN to learn the well-behaved mask that significantly reduce the noise distortion of processed utterances. This study proposes to preprocess the input speech features for the ideal ratio mask (IRM)-based DNN by lowpass filtering in order to alleviate the noise components. In particular, we employ the discrete wavelet transform (DWT) to decompose the temporal speech feature sequence and scale down the detail coefficients, which correspond to the high-pass portion of the sequence. Preliminary experiments conducted on a subset of TIMIT corpus reveal that the proposed method can make the resulting IRM achieve higher speech quality and intelligibility for the babble noise-corrupted signals compared with the original IRM, indicating that the lowpass filtered temporal feature sequence can learn a superior IRM network for speech enhancement.

Keywords:
Speech Enhancement, Temporal Feature Sequence, Lowpass Filtering, Ideal Ratio Mask, Wavelet Transform


Title:
Incorporating Speaker Embedding and Post-Filter Network for Improving Speaker Similarity of Personalized Speech Synthesis System

Author:
Sheng-Yao Wang, and Yi-Chin Huang

Abstract:
In recent years, speech synthesis system can generate speech with high speech quality. However, multi-speaker text-to-speech (TTS) system still require large amount of speech data for each target speaker. In this study, we would like to construct a multi-speaker TTS system by incorporating two sub modules into artificial neural network-based speech synthesis system to alleviate this problem. First module is to add the speaker embedding into encoding module of the end-to-end TTS framework while using small amount of the speech data of the training speakers. For speaker embedding method, in our study, two speaker embedding methods, namely speaker verification embedding and voice conversion embedding, are compared for deciding which one is suitable for the personalized TTS system. Besides, we substituted the conventional post-net module, which is conventionally adopted to enhance the output spectrum sequence, to a post-filter network, which is further improving the speech quality of the generated speech utterance. Finally, experiment results showed that the speaker embedding is useful by adding it into encoding module and the resultant speech utterance indeed perceived as the target speaker. Also, the post-filter network not only improving the speech quality and also enhancing the speaker similarity of the generated speech utterances. The constructed TTS system can generate a speech utterance of the target speaker in fewer than 2 seconds. In the future, other feature such as prosody information will be incorporated to help the TTS framework to improve the performance..

Keywords:
Multi-speaker Text-to-Speech, Voice Conversion, Speaker Verification, Zero-Shot, Post-Filter


Title:
Answering Chinese Elementary School Social Studies Multiple Choice Questions

Author:
Chao-Chun Liang, Daniel Lee, Meng-Tse Wu, Hsin-Min Wang, and Keh-Yih Su

Abstract:
We present several novel approaches to answer Chinese elementary school social studies multiple choice questions. Although BERT shows excellent performance on various reading comprehension tasks, it handles some kinds of questions poorly, in particular negation, all-of-the-above, and none-of-the-above questions. We thus propose a novel framework to cascade BERT with preprocessor and answer-picker/selector modules to address these cases. Experimental results show the proposed approaches effectively improve the performance of BERT, and thus demonstrate the feasibility of supplementing BERT with additional modules.

Keywords:
Natural Language Inference, Machine Reading Comprehension, Multiple Choice Question, Question and Answering.