Within the scope of several projects and cooperations with industrial partners, we are concerned with the transcription of conversational speech, be it gathered in professional meetings or casual gatherings among friends. Current solutions clearly fall far short of the recognition performance of humans. There are two main reasons for this: First, the signal quality is poor due to environmental influences such as room reverberation. Furthermore, especially in informal get-togethers, people often interrupt each other or parallel conversations develop among the participants, so that the signals of several speakers overlap.
We develop methods that not only transcribe what is said, but also annotate who spoke and when. Here, the number of speakers and the amount of speaker overlap is unknown in advance and time-varying. This annotation task, known as "diarization" in English, essentially involves two subtasks: segmenting the meeting into, possibly overlapping, time periods in which one speaker is active at a time, and correctly identifying the speaker in each segment. While existing systems satisfactorily solve either one or the two subtasks, we are working on a method that provides both good segmentation and correct speaker recognition.
Microphone arrays are used to mitigate signal quality degradation due to room reverberation and other acoustic disturbances. The spatial information gathered by them allows the system to focus on the direction of the target speaker and suppress interference from other room directions. While this was achieved with statistical approaches in the past, nowadays more and more neural network based solutions are used. We integrate both approaches to combine their strengths: The statistical models allow precise acoustic beamforming towards a specific spatial direction, while the neural networks contribute the determination of the spatial direction of the target speaker and the interference.
For the processing of conversational speech, where multiple speakers are active and their speech may overlap, pre-processing steps are needed to detect speech activity and remove speaker overlaps so that robust transcription is subsequently possible. For this purpose, we have developed a method that maps the multi-speaker signals to two output channels in such a way that there is no speaker overlap any more on those channels, regardless of the number of speakers being present in the meeting.