Human beings are amazingly good at understanding speech even in severe environments with reverberation, noise and even cross-talk. We are able to focus on the speaker and ignore most of the disturbances.
For automatic speech recognition systems however, background noises and especially reverberation and cross-talk are very challenging. Without any countermeasures and traditional systems, the perturbations of the audio signal usually lead to non-sense transcriptions and word error rates well above 50%. Getting every second word wrong, these systems are frustrating to use and not helpful anymore.
So how do home assistances like the Google Home or Amazon Echo or your in-car voice control manage to work in their designated environments? Read on to find out!
Multi-channel audio processing
Much like with optical signals, it is also possible to focus on one direction with audio signals. One way to do so would be analogous to the optical case: using a directivity microphone with a narrow main lobe. This is impractical for speech recognition as it would require a mechanical system to steer the microphone in the direction of the speaker. Thus, much like the human auditory system, we use another approach to steer the focus into a particular direction. Namely, we use multiple microphones and exploit the delay caused by the speed limit of the wave propagation to do beamforming.
If we observe an audio signal with multiple microphones, each direction of an impinging sound wave causes different delays observed at the microphone array. If the delay is known, we can align the recorded data such that the overlap in a specific direction is constructive. Depending on the configuration of the microphone array this gives us a beam of specific width in the direction of the speaker.
The method just described works well in an environment with no reverberation and if get a good estimate for the delay from the target direction. In real-life applications however, these two assumptions don't hold. Usually there is non neglectable amount of reverberation and current methods of determining the delay from the target direction (TDOA estimation) do not work precisely in severe environments.
One way to circumvent this problem is to do statistically beamforming. Here, we estimate correlations between each microphone (usually of the target and the distortion) and use those to design a complex filter - our beamforming vector. To estimate those correlations, a masking based approach is taken. The mask basically tells us where in the signal the target or the distortion is predominant.
In my research I focus on neural networks to generate those masks. This method has proven to be very effective in different environments (see publications).
Robust speech recognition
Although recent developments unify a speech recognition system, it can be broadly split into two parts: The frond-end and the back-end. The front-end processes the audio data, possibly enhancing it, and provides features for the back-end. One possible way to enhance the signal is the multi-channel processing described above.
The classical back-end itself consists of two models: An acoustic model and a language model. The former uses the features provided by the front-end and classifies each frame to be a part of a specific phoneme. The language model models the statistical properties of higher level concepts like words: In a given context of words there a few words which are likely to follow while the vast majority of words is very unlikely. Finally, the decoder combines the evidences of both models and outputs the most likely sentence for a given audio sequence.
My research in this area is primarily focused on the acoustic modelling part. Using recent architectural advances for neural networks for example enables one to build a very competitive model for robust speech recognition as we were able to prove in the recent CHiME challenge.
My current focus lies on so called "End-to-End" speech recognition (of course robust!). Here, one combines the front-end with the back-end and jointly trains both. This removes some heuristics from the training process and, in the special case of mask estimation, eliminates the need for parallel clean and noisy data. The models are trained with loss functions which do not need a specific alignment (e.g. the information which frame corresponds to which part of a phoneme) like Connectionist Temporal Classification or Sequence-to-Sequence models.
The "Echo project"
Research without any application is only have of the fun. The "Echo project" aims to build a smart home assistant like the Amazon Echo or the Google Home. This makes use of all the research topics described above and is especially designed for project groups of students. So far we have build the following components: wake-up keyword detection, a German acoustic model, an online decoder. Custom built hardware for multi-channel recording is also available.