Distant Supervision for Cross-Language Speech Adaptation
Abstract: Direct supervision of ASR and TTS is provided by paired examples of speech and text. Distant supervision is provided by labels that are correlated with text, but that don’t unambiguously specify the text. First, consider the supervision that can be acquired from the users of a dialog system. The University of Illinois HealthEdvisor is an animated physician (Edna) who gives instructions to patients, then asks the patients to repeat each instruction in their own words (a method called “teach-back”). Teach-back helps patients (it improves memory and understanding), but now that we have teach-back working, we are exploring the possibility of using teach-back to adapt the system: if patients are really trying to correctly repeat the doctor’s instructions, then their utterances should be correlated, in some way, with the correct answer. Second, consider the distant supervision provided in field linguistic corpora, e.g., speech2translation and image2speech corpora. Speech2translation corpora consist of speech samples, possibly in unwritten languages, each paired with its text translation into some other language. Image2speech corpora consist of (image,speech) pairs, where each spoken utterance describes the content of the image. A convolutional network maps an image into a grid of feature vectors; by raster-scanning the grid, a sequence-to-sequence model with attention can learn to generate a spoken description. Hidden states of the sequence-to-sequence model can be clustered to form phone-like units, or segmented to form word-like units. But then, how do we figure out which of those phone-like units are actually phonemes? Consider a third type of distant supervision: adaptation from a well-resourced language, using methods modeled on human perceptual learning behaviors. Both human listeners and neural networks (CTC and TDNN) rapidly adapt to speech with a consistent pattern of phoneme distortions: after only four or five presentations, the listener learns a speaker-dependent phoneme category shift, and subsequent presentations are used to refine the boundary. Suppose we train a neural network in a well-resourced language with two training criteria: a CTC criterion enforcing generation of the correct phone sequence, and a speller criterion enforcing generation of the correct letter sequence. The mapping from phones to letters is similar to the task of grapheme-to-phoneme (G2P) transduction; could we use a neural G2P to bootstrap sequence-to-sequence ASR in a new language? Phonemes, words, conversations: all three types of distant supervision are used to teach human learners a new language. The ability of neural networks to draw knowledge from a wide variety of distant supervision encourages experiments, and promises the possibility of rapid cross-language speech technology adaptation in the very near future.
Biodata: Mark Hasegawa-Johnson has been on the faculty at the University of Illinois since 1999, where he is currently a Professor of Electrical and Computer Engineering. He received his Ph.D. in 1996 at MIT, with a thesis titled “Formant and Burst Spectral Measures with Quantitative Error Models for Speech Sound Classification,” after which he was a post-doc at UCLA from 1996-1999. Prof. Hasegawa-Johnson is a Fellow of the Acoustical Society of America, and a Senior Member of IEEE and ACM. He is currently Treasurer of ISCA, and Senior Area Editor of the IEEE Transactions on Audio, Speech and Language. He has published 280 peer-reviewed journal articles and conference papers in the general area of automatic speech analysis, including machine learning models of articulatory and acoustic phonetics, prosody, dysarthria, non-speech acoustic events, audio source separation, and under-resourced languages.
What Makes a Speaker Charismatic? Producing and Perceiving Charismatic Speech
Abstract: Charisma is defined by Max Weber as “a certain quality of an individual personality, by virtue of which he is set apart from ordinary men and treated as endowed with supernatural, superhuman, or at least specifically exceptional powers or qualities … not accessible to the ordinary person, but … regarded as of divine origin or as exemplary” on which basis “the individual concerned is treated as a leader” (Weber ‘47). In prior work we examined individual differences in both the production and perception of charismatic speech, finding some common trends but also some striking differences in different cultures and political leanings. We are currently studying differences in production and perception looking at other differences, including gender, level of education, personality traits, and how a speaker’s own speech influences their perception of charisma. Such studies are useful not only for text-to-speech synthesis and voice conversion but for broader issues of understanding political events as well as helping speakers to improve their own charismatic production.
Biodata: Julia Hirschberg is Percy K. and Vida L. W. Hudson Professor of Computer Science and Chair of the Computer Science Department at Columbia University. She worked at Bell Laboratories and AT&T Laboratories — Research from 1985-2003 as a Member of Technical Staff and a Department Head, creating the Human-Computer Interface Research Department in 1994. She served as editor-in-chief of Computational Linguistics from 1993-2003 and co-editor-in-chief of Speech Communication from 2003-2006. She served on the Executive Board of the Association for Computational Linguistics (ACL) from 1993-2003, on the Permanent Council of International Conference on Spoken Language Processing (ICSLP) since 1996, and on the board of the International Speech Communication Association (ISCA) from 1999-2007 (as President 2005-2007); she has served on the CRA Executive Board (2013-14), the Association for the Advancement of Artificial Intelligence (AAAI) Council (2012-15), the Executive Board of the North American ACL (2012-15), the IEEE Speech and Language Processing Technical Committee (2011–), and the board of the CRA-W (2009–). She has been an AAAI fellow since 1994, an ISCA Fellow since 2008, and a (founding) ACL Fellow since 2011, was elected to the American Philosophical Society in 2014, and was selected ACM Fellow in 2016. She is a winner of the IEEE James L. Flanagan Speech and Audio Processing Award (2011) and the ISCA Medal for Scientific Achievement (2011).
Biosignal-based Spoken Communication
Abstract: Speech is a complex process emitting a wide range of biosignals, including, but not limited to, acoustics. These biosignals – stemming from the articulators, the articulator muscle activities, the neural pathways, and the brain itself – can be used to circumvent limitations of conventional speech processing in particular, and to gain insights into the process of speech production in general. In my talk I will present ongoing research at the Cognitive Systems Lab (CSL), where we explore speech-related muscle and brain activities based on machine learning methods with the goal of creating biosignal-based processing devices for spoken communication applications in everyday situations. Several applications will be described such as Silent Speech Interfaces that rely on articulatory muscle movement captured by electromyography to recognize and synthesize silently produced speech, Brain-to-text interfaces that use brain activity captured by electrocorticography to recognize speech, and Brain-to-speech interfaces that directly convert electrocortical signals into audible speech. I will also describe initial experiments toward electrical stimulation of articulatory muscles driven by brain activity.
Biodata: Tanja Schultz received her diploma and doctoral degree in Informatics from University of Karlsruhe, Germany, in 1995 and 2000. Prior to these degrees she completed her Masters degree in Mathematics, Sports, Physical and Educational Science from Heidelberg University, Germany in 1989. Dr. Schultz is the Professor for Cognitive Systems at the University of Bremen, Germany and adjunct Research Professor at the Language Technologies Institute of Carnegie Mellon, PA USA. Since 2007, she directs the Cognitive Systems Lab, where her research activities include multilingual speech recognition and the processing, recognition, and interpretation of biosignals for human-centered technologies and applications. Since 2019 she is the spokesperson of the University Bremen high-profile area “Minds, Media, Machines”. Prior to joining University of Bremen, she was a Research Scientist at Carnegie Mellon (2000-2007) and a Full Professor at Karlsruhe Institute of Technology in Germany (2007-2015). Dr. Schultz is an Associate Editor of ACM Transactions on Asian Language Information Processing (since 2010), serves on the Editorial Board of Speech Communication (since 2004), and was Associate Editor of IEEE Transactions on Speech and Audio Processing (2002-2004). She was President (2014-2015) and elected Board Member (2006-2013) of ISCA, and a General Co-Chair of Interspeech 2006. She was elevated to Fellow of ISCA (2016) and to member of the European Academy of Sciences and Arts (2017). Dr. Schultz was the recipient of the Otto Haxel Award in 2013, the Alcatel Lucent Award for Technical Communication in 2012, the PLUX Wireless Biosignals Award in 2011, the Allen Newell Medal for Research Excellence in 2002, and received the ISCA / EURASIP Speech Communication Best paper awards in 2001 and 2015.
Towards Better Understanding Generalization in Deep Learning
Abstract: Deep learning has shown incredible successes in the past few years, but there is still a lot of work remaining in order to understand some of these successes. Why such over-parameterized models still generalize so well? In this presentation, I will cover recent work empirically showing interesting relations between learned internal representations and generalization.
Biodata: Samy Bengio (PhD in computer science, University of Montreal, 1993) is a research scientist at Google since 2007. He currently leads a group of research scientists in the Google Brain team, conducting research in many areas of machine learning such as deep architectures, representation learning, sequence processing, speech recognition, image understanding, large-scale problems, adversarial settings, etc. He was the general chair for Neural Information Processing Systems (NeurIPS) 2018, the main conference venue for machine learning, was the program chair for NeurIPS in 2017, is action editor of the Journal of Machine Learning Research and on the editorial board of the Machine Learning Journal, was program chair of the International Conference on Learning Representations (ICLR 2015, 2016), general chair of BayLearn (2012-2015) and the Workshops on Machine Learning for Multimodal Interactions (MLMI’2004-2006), as well as the IEEE Workshop on Neural Networks for Signal Processing (NNSP’2002), and on the program committee of several international conferences such as NIPS, ICML, ICLR, ECML and IJCAI. More information can be found on his website.
A Deep CASA Approach to Talker-independent Speaker Separation
Abstract: We address the challenge of talker-independent monaural speaker separation from the perspectives of deep learning and computational auditory scene analysis (CASA). Specifically, we approach multi-speaker separation in the stages of simultaneous grouping and sequential grouping. Simultaneous grouping is first performed in each time frame by separating the spectra of different speakers using a deep neural network with permutation-invariant training. In the second stage, frame-level separated spectra are sequentially grouped to different speakers by a clustering network. The deep CASA approach optimizes frame-level separation and speaker tracking in turn, and produces excellent results for both tasks. Systematic evaluation on the benchmark WSJ0 database of 2-speaker and 3-speaker mixtures shows that our approach achieves the state-of-the-art results. The approach has also been extended to perform speaker separation in reverberant conditions successfully. We believe that the development of the deep CASA approach represents a major step towards solving the cocktail party problem.
Biodata: DeLiang Wang received the B.S. degree in 1983 and the M.S. degree in 1986 from Peking (Beijing) University, Beijing, China, and the Ph.D. degree in 1991 from the University of Southern California, Los Angeles, CA, all in computer science. From July 1986 to December 1987 he was with the Institute of Computing Technology, Academia Sinica, Beijing. Since 1991, he has been with the Department of Computer Science and Engineering and the Center for Cognitive Science at The Ohio State University, Columbus, OH, where he is currently a Professor. From October 1998 to September 1999, he was a visiting scholar in the Department of Psychology at Harvard University, Cambridge, MA. From October 2006 to June 2007, he was a visiting scholar at Oticon A/S, Copenhagen, Denmark. From October 2014 to De 2007, he was a visiting scholar at Oticon A/S, Copenhagen, Denmark. DeLiang Wang received the NSF Research Initiation Award in 1992 and the ONR Young Investigator Award in 1996. He received the OSU College of Engineering Lumley Research Award in 1996, 2000, 2005, and 2010. His 2005 paper, “The time dimension for scene analysis”, received the IEEE Transactions on Neural Networks Outstanding Paper Award from the IEEE Computational Intelligence Society. He also received the 2008 Helmholtz Award from the International Neural Network Society, and was named a University Distinguished Scholar in 2014. He was an IEEE Distinguished Lecturer (2010-2012), and is an IEEE Fellow. He is Co-Editor-In-Chief of Neural Networks, which is a premier journal published by Elsevier. In addition, he serves on the editorial/advisory boards of Cognitive Computation, Cognitive Neurodynamics, Neural Computing and Applications, and IEEE/ACM Transactions on Audio, Speech, & Language Processing. He served as President of the International Neural Network Society in 2006, and currently serves on its governing board.