SLT 2022 Notes
Attending an Offline Conference Again
I was fortunate to have an opportunity to attend the Spoken Language Technology workshop 2022, held from 9th to 12th January in Doha, Qatar. I presented my final PhD work at the conference. Going back to an (almost) in-person conference was so refreshing and I learned so many things.
In this post, I'll summarise my takeaways from the SLT 2022 conference. The photo below is the conference venue, Marsa Malaz Kempinski.
Human Speech Processing and Neural Networks
As a linguistics student (back in 2012), I've been always keen to learn studies on relevance of human language processing to computer speech processing. On this topic, I remember one of the keynote presentations and one poster.
The first keynote talk by Nima Mesgarani was interesting, mentioning that the shallow layers of an RNN-T encoder learns acoustic and phonetic information occurring in short time frames while the later layers capture lexical and semantic information which can happen across longer time frames. This apparently corresponds to human brain processing of speech. Coincidentally, the poster I saw, Phoneme Segmentation using Self-supervised Speech Models by Strgar and Harwath, also had a similar observation that more attention weights go to the shallow layers of wav2vec2.0 and HuBERT for phone processing.
Multimodality
It was nice to see many works on multimodality. I found the following works very interesting.
- multimodal sentiment analysis by Ando et al (NTT)
- AVSE challenge by Lorena et al. (Edinburgh)
- visual text-to-speech by Nakano et al. (U of Tokyo)
- SpeechCLIP: visually augmented speech retrieval by Shih et al (National Taiwan University)
Speech and NLP models are becoming better and better thanks to a humongous amount of data but still lacking understanding of cross-modality unlike humans. I am wondering about the real potential of multimodal processing if more amount of data is available.
Data Domain Shift and ASR
ASR systems are often trained and evaluated on the same data domain but real world ASR applications can face "unexpected" including accent, background noise and other obstacles which ASR is not trained on.
Relating to that, this is my favourite poster at the conference "How does pre-trained wav2vec2.0 perform on domain-shifted ASR? An extensive benchmark on air traffic control communications" by Zuluaga-Gomez et al. They investigated the effects of data domain-shift on the Kaldi-based system and the wav2vec2.0-CTC system. They found that just about 10 hours of speech of the air traffic control domain was sufficient to fine-tune the wav2vec2.0 system for decent quality transcripts.
Doha
The SLT conference happened right after the World Cup 2022 in Qatar and Doha was ready to accommodate visitors. I had no problem finding English speaking people for help. The place I'd love to visit again is Souq Waqif, the local market which has literally every thing one can imagine.
The left photo was taken in the West Bay district and the right was the Museum of Islamic Art.
Comments
Post a Comment