SANE 2017 - Speech and Audio in the Northeast

October 19, 2017

New York City from Google NY Offices

SANE 2017, a one-day event gathering researchers and students in speech and audio from the Northeast of the American continent, will be held on Thursday October 19, 2017 at Google, in New York, NY.

It is a follow-up to SANE 2012, held at Mitsubishi Electric Research Labs (MERL), SANE 2013, held at Columbia University, SANE 2014, held at MIT CSAIL, SANE 2015, (already!) held at Google NY, and SANE 2016, held at MIT's McGovern Institute for Brain Research. Since the first edition, the audience has steadily grown, gathering over 100 researchers and students in recent editions.

As in 2013 and 2015, this year's SANE will be held in conjunction with the WASPAA workshop, held October 15-18 in upstate New York. WASPAA attendees are welcome and encouraged to attend SANE.

SANE 2017 will feature invited talks by leading researchers from the Northeast, as well as from the international community. It will also feature a lively poster session during lunch time, open to both students and researchers.


  • Date: Thursday, October 19, 2017
  • Venue: Google, New York City, NY

Schedule — Thursday, October 19

Click on the talk title to jump to the abstract and bio, and on Poster Session for the list of posters.

8:30-8:55Registration and Breakfast
9:00-9:45 Sacha Krstulović (Audio Analytic)
Sound is not speech
9:45-10:30 Yusuf Aytar (Google DeepMind)
Deep cross-modal alignment and transfer for ambient sound understanding
10:30-11:00Coffee break
11:00-11:45 Florian Metze (CMU)
Open-Domain Audio-Visual Speech Recognition
11:45-12:30 Gunnar Evermann (Apple)
Developing Siri — Speech Recognition at Apple
12:30-14:30Lunch / Poster Session
14:30-15:15Live demos
15:15-16:00 Eric Humphrey (Spotify)
Advances in Processing Singing Voice in Recorded Music
16:00-16:30Coffee break
16:30-17:15 Aaron Courville (University of Montreal)
Progress on the road to end-to-end speech synthesis
17:15-18:00 Aäron van den Oord (Google DeepMind)
Neural discrete representation learning
18:00-18:15Closing remarks
18:15-...Drinks somewhere nearby


We have reached capacity, but you can ask to be put on the waiting list by sending an email to with your name and affiliation. SANE is a free event.


The workshop will be hosted at Google, in New York City, NY. Google NY is located at 111 8th Ave, and the closest subway stop is the A, C, and E lines' 14 St station. The entrance is the one to the LEFT of the apparel shop and the Google logo on this Street View shot. (Note: the entrance is different from the one used in 2015)

Organizing Committee



Google MERL






Sound is not speech

Sacha Krstulović

Audio Analytic

The recognition of audio events is emerging as a relatively new field of research compared to speech and music recognition. Whereas it has started from known recipes from the latter fields, 24/7 sound recognition actually defines a new range of research problems which are distinct from speech and music. After reviewing the constraints related to running sound recognition successfully on real world consumer products deployed across thousands of homes, the talk discusses the nature of some of sound recognition’s distinctive problems, such as open set recognition or the modelling of interrupted sequences. This is put in context with the most recent advances in the field, supported in the public domain, e.g., by competitive evaluations such as the DCASE challenge, to assess which of sound recognition’s distinctive problems are being currently addressed by state-of-the-art methods, and which ones could deserve more attention.

Sacha Krstulović

Sacha Krstulović is the Director of AALabs, which is the R&D division of Audio Analytic in Cambridge, UK, the world leader in artificial audio intelligence. Audio Analytic’s ai3™ software provides products in the smart home with the sense of hearing, enabling technology to help people by reacting to the world around them. Before joining Audio Analytic, Sacha was a Senior Research Engineer at Nuance’s Advanced Speech Group (Nuance ASG), where he worked on pushing the limits of large scale speech recognition services such as Voicemail-to-Text and Voice-Based Mobile Assistants (Apple Siri type services). Prior to that, he was a Research Engineer at Toshiba Research Europe Ltd., developing novel Text-To-Speech synthesis approaches able to learn from data. His 20 years of experience in machine learning for audio also spans speaker recognition and sparse signal decompositions. He is the author and co-author of three book chapters, several international patents and several articles in international journals and conferences. Sacha is using his extensive audio analysis expertise to drive forward Audio Analytic’s technology. He is passionate about researching and developing automatic recognition of sound where Audio Analytic is building significant leadership.


Deep cross-modal alignment and transfer for ambient sound understanding

Yusuf Aytar

Google DeepMind

Can you “see”, “hear" or “draw” the scene depicted in the sentence below?
“The child tosses the stone into the lake.”
Although you have only read about this situation, you likely can imagine it in other modalities, such as visually or aurally. For example, you might picture a boy near a lake, you could draw the curves of moving waves in circles around the stone, and you could imagine the sound of a stone splashing. This transfer is possible partly due to cross-modal perception, which is the ability that humans have to perceive concepts independently of the modality. Establishing a similar cross-modal understanding in machines would allow them to operate more reliably and to transfer knowledge and skills across modalities. This talk will be discussing the challenges in cross-modal alignment and transfer, particularly focusing on cross-modal relations between ambient sound and visual perception.
Deep learning is setting new records in many fields of AI such as language, vision, and speech understanding. One of the major weaknesses of the deep models is that with limited data you can only train small scale networks which often perform poorly. In this talk we'll address learning rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge.
Furthermore we will expand the model by learning an alignment between three major modalities of perception: language, sound and vision. By leveraging unlabeled videos and millions of sentences paired with images, we jointly train a deep convolutional network for aligned representation learning. Although our network is only trained with image+text and image+sound pairs, it can also transfer between text and sound as well, a transfer the network never observed during training. Our sound representations yield significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the network, even though it is trained without ground truth labels.

Yusuf Aytar

Yusuf Aytar is a Research Scientist at DeepMind. Previously he worked as a post-doctoral research associate at CSAIL, MIT, collaborating with Prof. Antonio Torralba. He obtained his PhD (2014) degree from the Visual Geometry Group at the University of Oxford under the supervision of Prof. Andrew Zisserman. As a Fulbright scholar, he obtained his MSc (2008) degree from Computer Vision Lab at the University of Central Florida under the supervision of Prof. Mubarak Shah, and his B.E. degree in Computer Engineering (2005) in Ege University. His main research is concentrated around cross-modal learning, transfer learning, image/video understanding, object detection and deep learning in general. Outcomes of his research are published in major machine learning and computer vision conferences such as NIPS, BMVC, CVPR, and ICCV.


Open-Domain Audio-Visual Speech Recognition

Florian Metze


Audio-visual speech recognition has long been an active area of research, mostly focusing on improving ASR performance using “lip-reading”. We present “open-domain audio-visual speech recognition”, where we incorporate the semantic context of the speech using object, scene, and action recognition in open-domain videos.
We show how all-neural approaches greatly simplify and improve our earlier work on adapting the acoustic and language model of a speech recognizer, and investigate several ways to adapt end-to-end models to this task: working on a corpus of “how-to” videos from the web, an object that can be seen (“car”), or a scene that is being detected (“kitchen”) can be used to condition models on the “context” of the recording, thereby reducing perplexity and improving transcription.
We achieve good improvements in all cases, and compare and analyze the respective reductions in word errors to a conventional baseline system. We hope that this work might serve to ultimately unite speech-to-text and image-to-text, in order to eventually achieve something like “video-to-meaning” or multi-media summarization systems.

Florian Metze

Florian Metze is an Associate Research Professor at Carnegie Mellon University, in the School of Computer Science’s Language Technologies Institute. His work covers many areas of speech recognition and multi-media analysis with a focus on end-to-end deep learning. He has also worked on low resource and multi-lingual speech processing, speech recognition with articulatory features, large-scale multi-media retrieval and summarization, analysis of doctor patient conversation, along with recognition of personality or similar meta-data from speech.
He is the founder of the “Speech Recognition Virtual Kitchen” project, which strives to make state-of-the-art speech processing techniques usable by non-experts in the field, and started the “Query by Example Search on Speech” task at MediaEval. He was Co-PI and PI of the CMU team in the IARPA Aladdin and Babel projects. Most recently, his group released the “Eesen” toolkit for end-to-end speech recognition using recurrent neural networks and connectionist temporal classification.
He received his PhD from the Universität Karlsruhe (TH) for a thesis on “Articulatory Features for Conversational Speech Recognition” in 2005. He worked at Deutsche Telekom Laboratories (T-Labs) from 2006 to 2009, and led research and development projects involving language technologies in the customer care and mobile services area. In 2009, he joined Carnegie Mellon University, where is also the associate director of the InterACT center. He served on the committee of multiple conferences and journals, and is an elected member of the IEEE Speech and Language Technical Committee since 2011.


Developing Siri — Speech Recognition at Apple

Gunnar Evermann

Gunnar Evermann






Advances in Processing Singing Voice in Recorded Music

Eric Humphrey


Modeling and understanding the human voice remains one of the enticing unsolved challenges in audio signal processing and machine listening. This challenge is amplified in the context of recorded music, where often many sound sources are intentionally correlated in both time and frequency. In this talk, we present recent advances in the state of the art for detecting, separating, and describing vocals in popular music audio recordings, leveraging semi-supervised datasets mined from a large commercial music catalog.

Eric Humphrey

Eric Humphrey is a research scientist at Spotify, and acting Secretary on the board of the International Society for Music Information Retrieval (ISMIR). Previously, he has worked or consulted in a research capacity for various companies, notably THX and MuseAmi, and is a contributing organizer of a monthly Music Hackathon series in NYC. He earned his Ph.D. at New York University in Steinhardt's Music Technology Department under the direction of Juan Pablo Bello, Yann LeCun, and Panayotis Mavromatis, exploring the application of deep learning to the domains of audio signal processing and music informatics. When not trying to help machines understand music, you can find him running the streets of Brooklyn or hiding out in his music studio.


Progress on the road to end-to-end speech synthesis

Aaron Courville

University of Montreal

In this talk I will present a recent model we call Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoder-decoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text. I will conclude with some prospects for future work in this area.

Aaron Courville

Aaron Courville is an Assistant Professor in the Department of Computer Science and Operations Research (DIRO) at the University of Montreal, and member of the MILA (Montreal Institute for Learning Algorithms). He received his PhD from Carnegie Mellon University in 2006. His current research interests focus on the development of deep learning models and methods with applications to computer vision, natural language processing, speech synthesis as well as other artificial-intelligence-related tasks. He is particularly interested in the development of generative models and models of task-oriented dialogue. He is a CIFAR fellow in the Learning in Machines and Brains program.


Neural discrete representation learning

Aäron van den Oord

Google DeepMind

Learning useful representations without supervision remains a key challenge in machine learning. In this talk, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of “posterior collapse” — where the latents are ignored when they are paired with a powerful autoregressive decoder — typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion, providing further evidence of the utility of the learnt representations.

Aäron van den Oord

Aäron van den Oord completed his PhD at the University of Ghent in Belgium at the supervision of Dr. Benjamin Schrauwen. During his PhD he worked on generative models, image compression and music recommendation. After Aaron joined DeepMind in 2015 he made important contributions to generative models, including PixelRNN, PixelCNN, WaveNet.



Poster Session

  • "The roles of listeners’ head orientations and eye gazes in multi-talker real-world conversations"
    Jordan Beim, Karim Helwani, Dean Meyer, Tao Zhang (University of Minnesota, Starkey Technologies)
    • In a real-world multiple-talker conversation, a listener typically focuses on one target talker at a time. To enhance the target speech, it is important to know where the target talker is in such a scenario. However, it is often challenging to determine where the target talker is using microphone inputs alone. It is well known that the listener tends to look at the target talker in a real-world conversation. Based on such observation, researchers have investigated how the listeners use their eye gazes in the laboratory environment (i.e., Kidd et al, 2013). However, it is unknown how the listeners use their eye gazes in the real-world conversations. We carefully conduct an experimental to study such a behavior in a natural real-world conversation. Specifically, we simultaneously monitor multiple listeners’ eye gaze behaviors in natural conversations in a large and busy cafeteria using a camera, head tracking sensors and electrooculography (EOG) electrodes in addition to microphones. We use the camera recordings to calibrate and validate the EOG recordings, the head tracking recordings to determine the head orientation and the microphone signals to determine when and where the target speech occurs. The eye gaze data are finally analyzed relative to the estimated head orientation and target speech occurrence on both individual and group levels. Its implications for applications in hearing devices and future research directions are discussed.
  • "Improving Neural Net Autoencoders for Music Synthesis"
    Joseph Colonel, Christopher Curro, Sam Keene (The Cooper Union)
    • We present a novel architecture for a synthesizer based on an autoencoder that compresses and reconstructs magnitude short time Fourier transform frames. This architecture outperforms previous topologies by using improved regularization, employing several activation functions, creating a focused training corpus, and implementing the Adam learning method. By multiplying gains to the hidden layer, users can alter the autoencoder’s output, which opens up a palette of sounds unavailable to additive/subtractive synthesizers. Furthermore, our architecture can be quickly re-trained on any sound domain, making it flexible for music synthesis applications. Samples of the autoencoder’s outputs can be found at, and the code used to generate and train the autoencoder is open source, hosted at
  • "High‐Quality Spectrogram Inversion in Real Time"
    Zdeněk Průša, Pavel Rajmic (Brno University of Technology)
    • An efficient algorithm for real‐time signal reconstruction from the magnitude of the short‐time Fourier transform (STFT) has been recently presented in IEEE Signal Processing Letters (, open access article). The proposed approach combines the strengths of two previously published algorithms: the real‐time phase gradient heap integration and the Gnann and Spiertz's real‐time iterative spectrogram inversion with look‐ahead. The demonstration would explain the background of the approach using a poster, and offer a listening proof of the method’s superiority over the state‐of‐the‐art. The sound excerpts would be similar to those on our web page .
  • "Student’s t Source and Mixing Models for Multichannel Audio Source Separation"
    Simon Leglaive, Roland Badeau, Gaël Richard (LTCI, Télécom ParisTech, Université Paris-Saclay)
    • This work introduces a Bayesian framework for under-determined audio source separation in multichannel reverberant mixtures. We model the source signals as Student's t latent random variables in a time-frequency domain. The specific structure of musical signals in this domain is exploited by means of a non-negative matrix factorization model. Conversely, we design the mixing model in the time domain. In addition to leading to an exact representation of the convolutive mixing process, this approach allows us to develop simple probabilistic priors for the mixing filters. Indeed, as those filters correspond to room responses they exhibit a simple characteristic structure in the time domain that can be used to guide their estimation. We also rely on the Student's t distribution for modeling the impulse response of the mixing filters. From this model, we develop a variational inference algorithm in order to perform source separation. The experimental evaluation demonstrates the potential of this approach for separating multichannel reverberant mixtures.
  • "Multichannel Processing for Augmented Listening Devices"
    Ryan M. Corey, Andrew C. Singer (University of Illinois at Urbana-Champaign)
    • Augmented listening devices, such as digital hearing aids and so-called “hearables”, enhance everyday listening by processing environmental sounds before playing them back to the listener. Though augmented listening devices promise to improve human hearing by attenuating annoying background noise and amplifying conversation partners, most perform poorly in noisy and reverberant real-world environments. To reliably separate speech and noise in complex listening environments, we propose using large microphone arrays to spatially separate, process, and remix audio signals. Though increasingly popular in machine listening applications, such as personal voice assistants, large arrays have rarely been used in real-time listening enhancement. In this poster, we will outline the benefits and challenges of multichannel audio enhancement for listening devices, emphasizing the ways in which the problem differs from better-studied machine listening applications. In particular, we will highlight our recent results and ongoing work on delay constraints for real-time listening, preserving the spatial awareness of the listener, and nonlinear speech enhancement methods in multichannel systems.
  • "A Recurrent Encoder-Decoder Approach With Skip-Filtering Connections For Monaural Singing Voice Separation"
    Stylianos Ioannis Mimilakis, Konstantinos Drossos, Tuomas Virtanen, Gerald Schuller (Fraunhofer IDMT, Tampere University of Technology, Technical University of Ilmenau)
    • The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB.
  • "Semantic Decomposition of Applause-Like Signals and Applications"
    Alexander Adami, Jürgen Herre (International Audio Laboratories Erlangen (AudioLabs))
    • Applause sounds result from the superposition of many individual people clapping their hands. Applause can be considered as a mixture of individually perceivable transient foreground events and a more noise-like background. Due to the high number of densely packed transient events, applause and comparable sounds (like rain drops, many galloping horses, etc.) form a special signal class which often needs a dedicated processing to cope with the impulsiveness of these sounds. This demonstration presents a semantic decomposition method of applause sounds into a foreground component corresponding to individually perceivable transient events and a residual more noise-like background component. A selection of applications of this generic decomposition is discussed including measurement of applause characteristics, blind upmix of applause signals and applause restoration / perceptual coding enhancement. Sound examples illustrate the capabilities of the scheme.
  • "Large-Scale Audio Event Discovery in One Million YouTube Videos"
    Aren Jansen, Jort F. Gemmeke, Daniel P. W. Ellis, Xiaofeng Liu, Wade Lawrence, Dylan Freedman (Google)
    • Internet videos provide a virtually boundless source of audio with a conspicuous lack of localized annotations, presenting an ideal setting for unsupervised methods. With this motivation, we perform an unprecedented exploration into the large-scale discovery of recurring audio events in a diverse corpus of one million YouTube videos (45K hours of audio). Our approach is to apply a streaming, nonparametric clustering algorithm to both spectral features and out-of-domain neural audio embeddings, and use a small portion of manually annotated audio events to quantitatively estimate the intrinsic clustering performance. In addition to providing a useful mechanism for unsupervised active learning, we demonstrate the effectiveness of the discovered audio event clusters in two downstream applications. The first is weakly-supervised learning, where we exploit the association of video-level metadata and cluster occurrences to temporally localize audio events. The second is informative activity detection, an unsupervised method for semantic saliency based on the corpus statistics of the discovered event clusters.
  • "Regression vs. Classification in DNN-Based Artificial Speech Bandwidth Extension"
    Naoya Takahashi (Sony)
    • This work deals with the problem of audio source separation. To handle the complex and ill-posed nature of the problems of audio source separation, the current state-of-the-art approaches employ deep neural networks to obtain instrumental spectra from a mixture. In this study, we propose a novel network architecture that extends the recently developed densely connected convolutional network (DenseNet), which has shown excellent results on image classification tasks. To deal with the specific problem of audio source separation, an up-sampling layer, block skip connection and band-dedicated dense blocks are incorporated on top of DenseNet. The proposed approach takes advantage of long contextual information and outperforms state-of-the-art results on SiSEC 2016 competition by a large margin in terms of signal-to-distortion ratio. Moreover, the proposed architecture requires significantly fewer parameters and considerably less training time compared with other methods.
  • "DNN-Based Turbo Fusion of Multiple Audio Streams for Phoneme Recognition"
    Timo Lohrenz and Tim Fingscheidt (Technische Universität Braunschweig)
    • In order to increase automatic speech recognition (ASR) performance, we present the so-called turbo fusion for effective and competitive fusion of multiple audio (feature) streams. Inspired by the so-called turbo codes in digital communications, turbo fusion is an iterative approach exchanging state posteriors between multiple (distributed) ASR systems. We exploit the complementary information of standard Mel- and Gabor-filterbank features by fusing their respective deep neural network (DNN) posterior outputs. Thereby we use a new dynamic range limitation to balance exchanged information between recognizers. While turbo ASR has been validated in the past on (small vocabulary) word models, we take here a first step towards LVCSR by conducting turbo fusion on the TIMIT phoneme recognition task and thereby compare it to classical benchmarks. Thereby we show that turbo fusion outperforms these single-feature baselines as well as other well-known fusion approaches.
  • "Regression vs. Classification in DNN-Based Artificial Speech Bandwidth Extension"
    Johannes Abel and Tim Fingscheidt (Technische Universität Braunschweig)
    • Artificial speech bandwidth extension (ABE) is typically based on a statistical model for estimating missing frequency components of telephone speech between 4 and 8 kHz, i.e., the upper-band (UB). Many researchers in the field use a codebook of UB spectral envelopes in combination with a statistical model to select the optimal codebook entry. Baseline for the statistical model used to classify codebook entries will be the classical hidden Markov model (HMM) employing a Gaussian mixture models as acoustic model. We will compare such baseline to an HMM with a deep neural network (DNN) as acoustic model, as well as to a statistical model consisting purely of a DNN. Finally, we investigate a DNN which directly estimates UB envelopes without using a codebook (regression). Instrumental speech quality measures show that the regression approach outperforms all other investigated variants. Furthermore, in a (subjective) CCR listening test, the regression-based ABE was found to improve coded narrowband speech by over 1.3 CMOS points.
  • "Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework"
    Xiaohui Zhang, Vimal Manohar, Dan Povey, Sanjeev Khudanpur (Johns Hopkins University)
    • Speech recognition systems for irregularly-spelled languages like English normally require hand-written pronunciations. In this work, we describe a system for automatically obtaining pronunciations of words for which pronunciations are not available, but for which transcribed data exists. Our method integrates information from the letter sequence and from the acoustic evidence. The novel aspect of the problem that we address is the problem of how to prune entries from such a lexicon (since, empirically, lexicons with too many entries do not tend to be good for ASR performance). Experiments on various ASR tasks show that, with the proposed framework, starting with an initial lexicon of several thousand words, we are able to learn a lexicon which performs close to a full expert lexicon in terms of WER performance on test data, and is better than lexicons built using G2P alone or with a pruning criterion based on pronunciation probability.
  • "Signal Processing, Psychoacoustic Engineering and Digital Worlds: Interdisciplinary Audio Research at the University of Surrey"
    Mark D. Plumbley, Adrian Hilton, Philip J. B. Jackson, Wenwu Wang, Tim Brookes, Philip Coleman, Russell Mason, David Frohlich, Carla Bonina and David Plans (University of Surrey)
    • At the University of Surrey (Guildford, UK), we have brought together research groups in different disciplines, with a shared interest in audio, to work on a range of collaborative research projects. In the Centre for Vision, Speech and Signal Processing (CVSSP) we focus on technologies for machine perception of audio scenes; in the Institute of Sound Recording (IoSR) we focus on research into human perception of audio quality; the Digital World Research Centre (DWRC) focusses on the design of digital technologies; while the Centre for Digital Economy (CoDE) focusses on new business models enabled by digital technology. This interdisciplinary view, across different traditional academic departments and faculties, allows us to undertake projects which would be impossible for a single research group. In this poster we will present an overview of some of these interdisciplinary projects, including projects in spatial audio, sound scene and event analysis, and creative commons audio.
  • "Fully Convolutional Neural Networks for Anger Detection in Spontaneous Speech"
    Mohamed Ezz and Taniya Mishra (Affectiva)
    • Convolutional neural networks (CNNs) have been shown to be effective in end-to-end speech modeling tasks, such as acoustic scene/object classification and automatic speech recognition. In this work, we train fully convolutional networks to detect anger in speech. Since training these deep architectures requires large amounts of speech data, and the size of emotion datasets is relatively small, we also explore a transfer learning based approach. Specifically, we train a model to detect anger in speech by fine-tuning Soundnet. Soundnet is a fully convolutional neural network trained multimodally to classify natural objects and environmental scenes, with ground truth generated by vision-based classifiers. In our experiments, we use acted, elicited, and natural emotional speech datasets. We also examine the language dependence of anger display in speech by comparing training and testing in the same language to training with one language and testing on another, using US English and Mandarin Chinese datasets. We find that our models are able to effectively detect anger in speech under these conditions. Furthermore, our proposed system has low latency suitable for real-time applications, only requiring one second of audio to make a reliable classification. Further tuning of the model as more data becomes available and exploiting data augmentation is expected to improve performance without requiring the machinations of feature engineering.
  • "Gated Deep Recurrent Nonnegative Matrix Factorization Networks for Anger Detection in Speech"
    Scott Wisdom and Taniya Mishra (Affectiva)
    • Recurrent neural networks (RNNs) have been shown to be effective for audio processing tasks, such as speech enhancement, source separation, and acoustic scene classification. Another model that has also been effective for these tasks is sparse nonnegative matrix factorization (SNMF), which decomposes a nonnegative feature matrix (e.g., a spectrogram) into sparse coefficients using a dictionary of feature templates. We consider a recently-proposed neural network that combines the advantages of RNNs and SNMF, the deep recurrent NMF (DR-NMF) network. DR-NMF is a RNN whose forward pass corresponds to the iterations of an inference algorithm for SNMF coefficients. Since DR-NMF is based on the principled statistical model, it can be initialized with the maximum likelihood parameters of the model and fine-tuned for any task, which has been shown to be beneficial when only a small amount of labeled training data is available. Specifically, we propose an extension to DR-NMF: gated DR-NMF (GDR-NMF), which adds GRU-like gates to the recurrent warm-start connections between time steps. We apply DR-NMF and GDR-NMF networks to the specific problem of detecting anger from speech, for which only a small amount of training data is available. We compare the performance of DR-NMF and GDR-NMF, both randomly initialized and pretrained with SNMF, to conventional state-of-the-art LSTMs and convolutional neural networks (CNNs) across a variety of realistic acted, elicited, and natural emotional speech data.
  • "Large Scale Audio Event Classification using Weak Labels "
    Anurag Kumar (Carnegie Mellon University )
    • Sound Event Detection or Audio Event Detection has received a lot of attention in recent years. However, the scale of audio event detection was limited due to lack of large scale labeled datasets. We introduced weak labeling approaches for audio event detection which can help in scaling AED in terms of both number of sound events as well as amount of data for each event. In this work we will describe some of our recent weak label learning approaches based on Deep Convolutional Neural Networks (CNNs). We show that our proposed approach outperforms previous works on using CNNs for large scale AED. We also show that the proposed approaches are much more computationally efficient. We show results on Audioset which is currently the largest available large scale dataset for audio event classification.
  • "Multitask Learning for Fundamental Frequency Estimation in Music"
    Rachel Bittner, Brian McFee, Juan P. Bello (New York University)
    • Fundamental frequency estimation from polyphonic music includes the tasks of multiple-f0, melody, vocal, and bass line estimation. Historically these tasks have been approached separately and little work has used learning-based approaches. We present a multi-task deep learning architecture that jointly predicts output for multi-f0, melody vocal and bass line estimation tasks and is trained using a large, semi-automatically annotated dataset. We show that the multitask model outperforms its single-task counterparts, and that the addition of synthetically generated training data is beneficial.
  • "Enhanced Online IVA with Adaptive Learning for Speech Separation using Various Source Priors"
    Suleiman Erateb and Jonathon Chambers (Loughborough University, Newcastle University)
    • Independent vector analysis (IVA) is a frequency domain blind source separation (FDBSS) technique that has proven efficient in separating independent speech signals from their convolutive mixtures. In particular, it addresses the problematic permutation problem by using a multivariate source prior. The multivariate source prior models statistical inter dependency across the frequency bins of each source and the performance of the method is dependent upon the choice of source prior. The online form of the IVA is suitable for practical real time systems. Previous online algorithms use a learning rate that does not introduce a robust way to control the learning as a function of the proximity to the target solution. In this work, we propose a new adaptive learning scheme to improve the convergence speed and steady state separation performance. The experimental results confirm improved performance with real room impulse responses and real recorded speech signals modelled by two different source priors.
  • "Motion-Informed Audio Source Separation"
    Sanjeel Parekh, Slim Essid, Alexey Ozerov, Ngoc Q. K. Duong, Patrick Perez, Gael Richard (Telecom ParisTech and Technicolor)
    • In this work we propose novel joint and sequential multimodal approaches for the task of single channel audio source separation in videos. This is done within the popular non-negative matrix factorization framework using information about the sounding object’s motion. Specifically, we present methods that utilize non-negative least squares formulation to couple motion and audio information. The proposed techniques generalize recent work carried out on NMF-based motion-informed source separation and easily extend to video data. Experiments with two distinct multimodal datasets of string instrument performance recordings illustrate their advantages over the existing methods.
  • "Speech-LM : Leveraging Verbal as well as Nonverbal Information to improve Language Modeling"
    Sayan Ghosh (University of Southern California)
    • Previous literature on spoken language interaction has shown that non-verbal or extra-linguistic content in speech plays an important role in conversation understanding and disambiguating verbal content. However the contribution of non-verbal information in language modeling has been understudied. In this paper we introduce Speech-LM, a language model which predicts the next word given not only the context words, but also the non-verbal information contained in the context. Experiments show that test perplexity as measured on a large conversational corpus is lower than a baseline LSTM neural language model. Further, the embeddings learnt by the Speech-LM model are meaningful and correspond to extra-linguistic groupings.
  • "Trenton Makes Music: the sound of a city"
    Teresa Marrin Nakra and Kim Pearson (The College of New Jersey)
    • Trenton, New Jersey is a capital city with many challenges, but it also is the source of a remarkable musical playlist that has been heard around the world. The Trenton Makes Music project is a long-term partnership between faculty and students at TCNJ and our neighboring community to document the rich but largely understudied heritage of music making in our city. We have created a digital archive at, containing oral history interviews, podcasts, and extensive media artifacts to help us tell the stories of Trenton musicians and articulate the role of music as an important driver of cultural memory, identity, and economic development. As a result of this work, we have identified that Trenton's music is characterized by considerable cross-fertilization between musicians working across genres, communities, and neighborhoods. We have also documented the substantial investment in public music education that the city made over the past seventy-five years, producing a series of successful music professionals over that period. These observations have led us to pose a research question about whether there might be a 'Trenton sound,' the way that certain cities are described as having a unique 'sound.' (For example: Nashville, Philadelphia, or Detroit within certain time periods and genres.) Given that Music Information Retrieval and Machine Learning tools are getting increasingly sophisticated at determining perceptually significant musical features, we hypothesize that it might now be possible to use such software tools to determine the exact mix of characteristics that defines a regional 'sound' or stylistic signature. We hope to discuss this idea with audio researchers who might point to specific algorithms or approaches (i.e., a kind of musical genomic analysis) to identify structural commonalities in the music produced by Trenton artists that can be linked to the physical, educational, or cultural environment in this city.
  • "Speaker Localization with Convolutional Neural Networks Trained using Synthesized Noise Signals"
    Soumitro Chakrabarty, and Emanuël. A. P. Habets (International Audio Laboratories Erlangen)
    • In microphone array processing, the location of a sound source in the acoustic environment is one of the most important information, which is generally unavailable and needs to be estimated. A convolutional neural network (CNN) based supervised learning method for speaker localization is presented that can be trained using synthesized noise signals rather than actual speech. As an input, the phase component of the STFT representation of the microphone signals is presented such that the features required for localization can be learnt during training by exploiting the phase correlations in neighbouring microphones across the whole spectrum. The performance of the proposed method is evaluated using both simulated and measured data.
  • "Azimuthal source localization in binaural audio using neural nets with complex weights"
    Andy M. Sarroff and Michael A. Casey (Dartmouth College)
    • The human head's filter response provides many cues for localizing sound sources. Traditional models for computer-based binaural source localization do not take full advantage of inter and intra-channel phase information. We investigate whether feedforward networks with complex weights are better than their real-valued counterparts in a direction-of-arrival classification task. The models are trained on musical sources that are convolved with 25 human heads at 24 azimuthal locations spaced evenly along the full plane. We show that the complex nets outperform real nets on a hold-out set of 13 heads. The complex model achieves 81.75% overall accuracy with full-band binaural complex STFT features, whereas the real net yields 80.92% accuracy. The complex model is determined to be significantly better than the real model (with 95% confidence using a Wilcoxon signed-rank test) at generalizing to unseen heads. We also show that complex nets outperform real nets on band-passed stimuli, noise stimuli, and when the input feature is a raw stereo waveform rather than its Fourier transform. The confusion matrices exhibit a "cone of confusion," which humans are similarly prone to. We conclude that complex nets should be considered when relative signal phase provides information useful to the task at hand.
  • "Robust automatic speech recognition using analysis-by-synthesis feature estimation"
    Min Ma and Michael I. Mandel (Graduate Center, City University of New York)
    • Spectral masking is a promising method for noise suppression in which regions of the spectrogram that are dominated by noise are attenuated while regions dominated by speech are preserved. It is not clear, however, how best to combine spectral masking with the non-linear processing necessary to compute automatic speech recognition features. We propose an analysis-by-synthesis approach to estimate MFCC features of clean speech using spectral mask, and apply the method to robust automatic speech recognition of noisy speech.
  • "Real-time acoustic event detection system based on non-negative matrix factorization"
    Tatsuya Komatsu (NEC Corporation)
    • A real-time acoustic event detection system based on non-negative matrix factorization is presented. The system consists of two functions; “training” to design classifiers from recorded data and “detection’’ to classify recorded sound into acoustic events. The system is implemented on a standard laptop PC and detects acoustic events in real-time. On-site training/detection of acoustic events is performed in addition to detection using pre-trained classifiers.
  • "Amplitude and Phase Dereverberation of Harmonic Signals"
    Arthur Belhomme, Roland Badeau, Yves Grenier, Eric Humbert (Telecom ParisTech)
    • While most dereverberation methods focus on how to estimate the magnitude of an anechoic signal in the time-frequency domain, we propose a method which also takes the phase into account. By applying a harmonic model to the anechoic signal, we derive a formulation to compute the amplitude and phase of each harmonic. These parameters are then estimated by our method in presence of reverberation. As we jointly estimate the amplitude and phase of the clean signal, we achieve a very strong dereverberation on synthetic harmonic signals, resulting in a significant improvement of standard dereverberation objective measures over the state-of-the-art.