Achtung:

Sie haben Javascript deaktiviert!
Sie haben versucht eine Funktion zu nutzen, die nur mit Javascript möglich ist. Um sämtliche Funktionalitäten unserer Internetseite zu nutzen, aktivieren Sie bitte Javascript in Ihrem Browser.

Info-Icon This content is not available in English
Show image information


Open list in Research Information System

2019

Joint Optimization of Neural Network-based WPE Dereverberation and Acoustic Model for Robust Online ASR

J. Heymann, L. Drude, R. Haeb-Umbach, K. Kinoshita, T. Nakatani, in: ICASSP 2019, Brighton, UK, 2019

Signal dereverberation using the Weighted Prediction Error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. First proposed as an iterative algorithm, follow-up works have reformulated it as a recursive least squares algorithm and therefore enabled its use in online applications. For this algorithm, the estimation of the power spectral density (PSD) of the anechoic signal plays an important role and strongly influences its performance. Recently, we showed that using a neural network PSD estimator leads to improved performance for online automatic speech recognition. This, however, comes at a price. To train the network, we require parallel data, i.e., utterances simultaneously available in clean and reverberated form. Here we propose to overcome this limitation by training the network jointly with the acoustic model of the speech recognizer. To be specific, the gradients computed from the cross-entropy loss between the target senone sequence and the acoustic model network output is backpropagated through the complex-valued dereverberation filter estimation to the neural network for PSD estimation. Evaluation on two databases demonstrates improved performance for on-line processing scenarios while imposing fewer requirements on the available training data and thus widening the range of applications.


Directional Statistics and Filtering Using libDirectional

G. Kurz, I. Gilitschenski, F. Pfaff, L. Drude, U.D. Hanebeck, R. Haeb-Umbach, R.Y. Siegwart, in: Journal of Statistical Software 89(4), 2019

In this paper, we present libDirectional, a MATLAB library for directional statistics and directional estimation. It supports a variety of commonly used distributions on the unit circle, such as the von Mises, wrapped normal, and wrapped Cauchy distributions. Furthermore, various distributions on higher-dimensional manifolds such as the unit hypersphere and the hypertorus are available. Based on these distributions, several recursive filtering algorithms in libDirectional allow estimation on these manifolds. The functionality is implemented in a clear, well-documented, and object-oriented structure that is both easy to use and easy to extend.


Unsupervised training of neural mask-based beamforming

L. Drude, J. Heymann, R. Haeb-Umbach, in: INTERSPEECH 2019, Graz, Austria, 2019

We present an unsupervised training approach for a neural network-based mask estimator in an acoustic beamforming application. The network is trained to maximize a likelihood criterion derived from a spatial mixture model of the observations. It is trained from scratch without requiring any parallel data consisting of degraded input and clean training targets. Thus, training can be carried out on real recordings of noisy speech rather than simulated ones. In contrast to previous work on unsupervised training of neural mask estimators, our approach avoids the need for a possibly pre-trained teacher model entirely. We demonstrate the effectiveness of our approach by speech recognition experiments on two different datasets: one mainly deteriorated by noise (CHiME 4) and one by reverberation (REVERB). The results show that the performance of the proposed system is on par with a supervised system using oracle target masks for training and with a system trained using a model-based teacher.


All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis

T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, R. Häb-Umbach, in: ICASSP 2019, Brighton, UK, 2019

Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation. The NN-based estimator operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources. The neural network is recurrent over time as well as over the number of sources. The simulation experiments show that state of the art separation performance is achieved, while at the same time delivering good diarization and source counting results. It even generalizes well to an unseen large number of blocks.


Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation

L. Drude, D. Hasenklever, R. Haeb-Umbach, in: ICASSP 2019, Brighton, UK, 2019

We propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, we demonstrate that an unsupervised spatial clustering algorithm is sufficient to guide the training of a deep clustering system. We argue that previous work on deep clustering requires strong supervision and elaborate on why this is a limitation. We demonstrate that (a) the single-channel deep clustering system trained according to the proposed scheme alone is able to achieve a similar performance as the multi-channel teacher in terms of word error rates and (b) initializing the spatial clustering approach with the deep clustering result yields a relative word error rate reduction of 26% over the unsupervised teacher.


2018

Insights into the Interplay of Sampling Rate Offsets and MVDR Beamforming

J. Schmalenstroeer, R. Haeb-Umbach, in: ITG 2018, Oldenburg, Germany, 2018

It has been experimentally verified that sampling rate offsets (SROs) between the input channels of an acoustic beamformer have a detrimental effect on the achievable SNR gains. In this paper we derive an analytic model to study the impact of SRO on the estimation of the spatial noise covariance matrix used in MVDR beamforming. It is shown that a perfect compensation of the SRO is impossible if the noise covariance matrix is estimated by time averaging, even if the SRO is perfectly known. The SRO should therefore be compensated for prior to beamformer coefficient estimation. We present a novel scheme where SRO compensation and beamforming closely interact, saving some computational effort compared to separate SRO adjustment followed by acoustic beamforming.


Machine learning techniques for semantic analysis of dysarthric speech: An experimental study

V. Despotovic, O. Walter, R. Haeb-Umbach, Speech Communication 99 (2018) 242-251 (Elsevier B.V.) (2018)

We present an experimental comparison of seven state-of-the-art machine learning algorithms for the task of semantic analysis of spoken input, with a special emphasis on applications for dysarthric speech. Dysarthria is a motor speech disorder, which is characterized by poor articulation of phonemes. In order to cater for these noncanonical phoneme realizations, we employed an unsupervised learning approach to estimate the acoustic models for speech recognition, which does not require a literal transcription of the training data. Even for the subsequent task of semantic analysis, only weak supervision is employed, whereby the training utterance is accompanied by a semantic label only, rather than a literal transcription. Results on two databases, one of them containing dysarthric speech, are presented showing that Markov logic networks and conditional random fields substantially outperform other machine learning approaches. Markov logic networks have proved to be especially robust to recognition errors, which are caused by imprecise articulation in dysarthric speech.


Front-End Processing for the CHiME-5 Dinner Party Scenario

C. Boeddeker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, R. Haeb-Umbach, in: INTERSPEECH 2018, Hyderabad, India, 2018

This contribution presents a speech enhancement system for the CHiME-5 Dinner Party Scenario. The front-end employs multi-channel linear time-variant filtering and achieves its gains without the use of a neural network. We present an adaptation of blind source separation techniques to the CHiME-5 database which we call Guided Source Separation (GSS). Using the baseline acoustic and language model, the combination of Weighted Prediction Error based dereverberation, guided source separation, and beamforming reduces the WER by 10:54% (relative) for the single array track and by 21:12% (relative) on the multiple array track.


Integration neural network based beamforming and weighted prediction error dereverberation

L. Drude, C. Boeddeker, J. Heymann, K. Kinoshita, M. Delcroix, T. Nakatani, R. Haeb-Umbach, in: INTERSPEECH 2018, Hyderabad, India, 2018

The weighted prediction error (WPE) algorithm has proven to be a very successful dereverberation method for the REVERB challenge. Likewise, neural network based mask estimation for beamforming demonstrated very good noise suppression in the CHiME 3 and CHiME 4 challenges. Recently, it has been shown that this estimator can also be trained to perform dereverberation and denoising jointly. However, up to now a comparison of a neural beamformer and WPE is still missing, so is an investigation into a combination of the two. Therefore, we here provide an extensive evaluation of both and consequently propose variants to integrate deep neural network based beamforming with WPE. For these integrated variants we identify a consistent word error rate (WER) reduction on two distinct databases. In particular, our study shows that deep learning based beamforming benefits from a model-based dereverberation technique (i.e. WPE) and vice versa. Our key findings are: (a) Neural beamforming yields the lower WERs in comparison to WPE the more channels and noise are present. (b) Integration of WPE and a neural beamformer consistently outperforms all stand-alone systems.


Frame-Online DNN-WPE Dereverberation

J. Heymann, L. Drude, R. Haeb-Umbach, K. Kinoshita, T. Nakatani, in: IWAENC 2018, Tokio, Japan, 2018

Signal dereverberation using the weighted prediction error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. But in its original formulation, WPE requires multiple iterations over a sufficiently long utterance, rendering it unsuitable for online low-latency applications. Recently, two methods have been proposed to overcome this limitation. One utilizes a neural network to estimate the power spectral density (PSD) of the target signal and works in a block-online fashion. The other method relies on a rather simple PSD estimation which smoothes the observed PSD and utilizes a recursive formulation which enables it to work on a frame-by-frame basis. In this paper, we integrate a deep neural network (DNN) based estimator into the recursive frame-online formulation. We evaluate the performance of the recursive system with different PSD estimators in comparison to the block-online and offline variant on two distinct corpora. The REVERB challenge data, where the signal is mainly deteriorated by reverberation, and a database which combines WSJ and VoiceHome to also consider (directed) noise sources. The results show that although smoothing works surprisingly well, the more sophisticated DNN based estimator shows promising improvements and shortens the performance gap between online and offline processing.


NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing

L. Drude, J. Heymann, C. Boeddeker, R. Haeb-Umbach, in: ITG 2018, Oldenburg, Germany, 2018

NARA-WPE is a Python software package providing implementations of the weighted prediction error (WPE) dereverberation algorithm. WPE has been shown to be a highly effective tool for speech dereverberation, thus improving the perceptual quality of the signal and improving the recognition performance of downstream automatic speech recognition (ASR). It is suitable both for single-channel and multi-channel applications. The package consist of (1) a Numpy implementation which can easily be integrated into a custom Python toolchain, and (2) a TensorFlow implementation which allows integration into larger computational graphs and enables backpropagation through WPE to train more advanced front-ends. This package comprises of an iterative offline (batch) version, a block-online version, and a frame-online version which can be used in moderately low latency applications, e.g. digital speech assistants.


Evaluation of Modulation-MFCC Features and DNN Classification for Acoustic Event Detection

J. Ebbers, A. Nelus, R. Martin, R. Haeb-Umbach, in: DAGA 2018, München, 2018

Acoustic event detection, i.e., the task of assigning a human interpretable label to a segment of audio, has only recently attracted increased interest in the research community. Driven by the DCASE challenges and the availability of large-scale audio datasets, the state-of-the-art has progressed rapidly with deep-learning-based classi- fiers dominating the field. Because several potential use cases favor a realization on distributed sensor nodes, e.g. ambient assisted living applications, habitat monitoring or surveillance, we are concerned with two issues here. Firstly the classification performance of such systems and secondly the computing resources required to achieve a certain performance considering node level feature extraction. In this contribution we look at the balance between the two criteria by employing traditional techniques and different deep learning architectures, including convolutional and recurrent models in the context of real life everyday audio recordings in realistic, however challenging, multisource conditions.


Discrimination of Stationary from Moving Targets with Recurrent Neural Networks in Automotive Radar

C. Grimm, T. Breddermann, R. Farhoud, T. Fei, E. Warsitz, R. Haeb-Umbach, in: International Conference on Microwaves for Intelligent Mobility (ICMIM) 2018, 2018

In this paper, we present a neural network based classification algorithm for the discrimination of moving from stationary targets in the sight of an automotive radar sensor. Compared to existing algorithms, the proposed algorithm can take into account multiple local radar targets instead of performing classification inference on each target individually resulting in superior discrimination accuracy, especially suitable for non rigid objects, like pedestrians, which in general have a wide velocity spread when multiple targets are detected.


Benchmarking Neural Network Architectures for Acoustic Sensor Networks

J. Ebbers, J. Heitkaemper, J. Schmalenstroeer, R. Haeb-Umbach, in: ITG 2018, Oldenburg, Germany, 2018

Due to their distributed nature wireless acoustic sensor networks offer great potential for improved signal acquisition, processing and classification for applications such as monitoring and surveillance, home automation, or hands-free telecommunication. To reduce the communication demand with a central server and to raise the privacy level it is desirable to perform processing at node level. The limited processing and memory capabilities on a sensor node, however, stand in contrast to the compute and memory intensive deep learning algorithms used in modern speech and audio processing. In this work, we perform benchmarking of commonly used convolutional and recurrent neural network architectures on a Raspberry Pi based acoustic sensor node. We show that it is possible to run medium-sized neural network topologies used for speech enhancement and speech recognition in real time. For acoustic event recognition, where predictions in a lower temporal resolution are sufficient, it is even possible to run current state-of-the-art deep convolutional models with a real-time-factor of 0:11.


Smoothing along Frequency in Online Neural Network Supported Acoustic Beamforming

J. Heitkaemper, J. Heymann, R. Haeb-Umbach, in: ITG 2018, Oldenburg, Germany, 2018

We present a block-online multi-channel front end for automatic speech recognition in noisy and reverberated environments. It is an online version of our earlier proposed neural network supported acoustic beamformer, whose coefficients are calculated from noise and speech spatial covariance matrices which are estimated utilizing a neural mask estimator. However, the sparsity of speech in the STFT domain causes problems for the initial beamformer coefficients estimation in some frequency bins due to lack of speech observations. We propose two methods to mitigate this issue. The first is to lower the frequency resolution of the STFT, which comes with the additional advantage of a reduced time window, thus lowering the latency introduced by block processing. The second approach is to smooth beamforming coefficients along the frequency axis, thus exploiting their high interfrequency correlation. With both approaches the gap between offline and block-online beamformer performance, as measured by the word error rate achieved by a downstream speech recognizer, is significantly reduced. Experiments are carried out on two copora, representing noisy (CHiME-4) and noisy reverberant (voiceHome) environments.


Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery

T. Glarner, P. Hanebrink, J. Ebbers, R. Haeb-Umbach, in: INTERSPEECH 2018, Hyderabad, India, 2018

The invention of the Variational Autoencoder enables the application of Neural Networks to a wide range of tasks in unsupervised learning, including the field of Acoustic Unit Discovery (AUD). The recently proposed Hidden Markov Model Variational Autoencoder (HMMVAE) allows a joint training of a neural network based feature extractor and a structured prior for the latent space given by a Hidden Markov Model. It has been shown that the HMMVAE significantly outperforms pure GMM-HMM based systems on the AUD task. However, the HMMVAE cannot autonomously infer the number of acoustic units and thus relies on the GMM-HMM system for initialization. This paper introduces the Bayesian Hidden Markov Model Variational Autoencoder (BHMMVAE) which solves these issues by embedding the HMMVAE in a Bayesian framework with a Dirichlet Process Prior for the distribution of the acoustic units, and diagonal or full-covariance Gaussians as emission distributions. Experiments on TIMIT and Xitsonga show that the BHMMVAE is able to autonomously infer a reasonable number of acoustic units, can be initialized without supervision by a GMM-HMM system, achieves computationally efficient stochastic variational inference by using natural gradient descent, and, additionally, improves the AUD performance over the HMMVAE.


Dual Frequency- and Block-Permutation Alignment for Deep Learning Based Block-Online Blind Source Separation

L. Drude, .T.. Higuchi,, K.. Kinoshita, T.. Nakatani, R. Haeb-Umbach, in: ICASSP 2018, Calgary, Canada, 2018

Deep attractor networks (DANs) are a recently introduced method to blindly separate sources from spectral features of a monaural recording using bidirectional long short-term memory networks (BLSTMs). Due to the nature of BLSTMs, this is inherently not online-ready and resorting to operating on blocks yields a block permutation problem in that the index of each speaker may change between blocks. We here propose the joint modeling of spatial and spectral features to solve the block permutation problem and generalize DANs to multi-channel meeting recordings: The DAN acts as a spectral feature extractor for a subsequent model-based clustering approach. We first analyze different joint models in batch-processing scenarios and finally propose a block-online blind source separation algorithm. The efficacy of the proposed models is demonstrated on reverberant mixtures corrupted by real recordings of multi-channel background noise. We demonstrate that both the proposed batch-processing and the proposed block-online system outperform (a) a spatial-only model with a state-of-the-art frequency permutation solver and (b) a spectral-only model with an oracle block permutation solver in terms of signal to distortion ratio (SDR) gains.


Efficient Sampling Rate Offset Compensation - An Overlap-Save Based Approach

J. Schmalenstroeer, R. Haeb-Umbach, in: 26th European Signal Processing Conference (EUSIPCO 2018), 2018

Distributed sensor data acquisition usually encompasses data sampling by the individual devices, where each of them has its own oscillator driving the local sampling process, resulting in slightly different sampling rates at the individual sensor nodes. Nevertheless, for certain downstream signal processing tasks it is important to compensate even for small sampling rate offsets. Aligning the sampling rates of oscillators which differ only by a few parts-per-million, is, however, challenging and quite different from traditional multirate signal processing tasks. In this paper we propose to transfer a precise but computationally demanding time domain approach, inspired by the Nyquist-Shannon sampling theorem, to an efficient frequency domain implementation. To this end a buffer control is employed which compensates for sampling offsets which are multiples of the sampling period, while a digital filter, realized by the wellknown Overlap-Save method, handles the fractional part of the sampling phase offset. With experiments on artificially misaligned data we investigate the parametrization, the efficiency, and the induced distortions of the proposed resampling method. It is shown that a favorable compromise between residual distortion and computational complexity is achieved, compared to other sampling rate offset compensation techniques.


The RWTH/UPB System Combination for the CHiME 2018 Workshop

M. Kitza, W. Michel, C. Boeddeker, J. Heitkaemper, T. Menne, R. Schlüter, H. Ney, J. Schmalenstroeer, L. Drude, J. Heymann, R. Haeb-Umbach, in: INTERSPEECH 2018, Hyderabad, India, 2018

This paper describes the systems for the single-array track and the multiple-array track of the 5th CHiME Challenge. The final system is a combination of multiple systems, using Confusion Network Combination (CNC). The different systems presented here are utilizing different front-ends and training sets for a Bidirectional Long Short-Term Memory (BLSTM) Acoustic Model (AM). The front-end was replaced by enhancements provided by Paderborn University [1]. The back-end has been implemented using RASR [2] and RETURNN [3]. Additionally, a system combination including the hypothesis word graphs from the system of the submission [1] has been performed, which results in the final best system.


Exploring Practical Aspects of Neural Mask-Based Beamforming for Far-Field Speech Recognition

C. Boeddeker, H. Erdogan, T. Yoshioka, R. Haeb-Umbach, in: ICASSP 2018, Calgary, Canada, 2018

This work examines acoustic beamformers employing neural networks (NNs) for mask prediction as front-end for automatic speech recognition (ASR) systems for practical scenarios like voice-enabled home devices. To test the versatility of the mask predicting network, the system is evaluated with different recording hardware, different microphone array designs, and different acoustic models of the downstream ASR system. Significant gains in recognition accuracy are obtained in all configurations despite the fact that the NN had been trained on mismatched data. Unlike previous work, the NN is trained on a feature level objective, which gives some performance advantage over a mask related criterion. Furthermore, different approaches for realizing online, or adaptive, NN-based beamforming are explored, where the online algorithms still show significant gains compared to the baseline performance.


Deep Attractor Networks for Speaker Re-Identifikation and Blind Source Separation

L. Drude, T. von Neumann, R. Haeb-Umbach, in: ICASSP 2018, Calgary, Canada, 2018

Deep clustering (DC) and deep attractor networks (DANs) are a data-driven way to monaural blind source separation. Both approaches provide astonishing single channel performance but have not yet been generalized to block-online processing. When separating speech in a continuous stream with a block-online algorithm, it needs to be determined in each block which of the output streams belongs to whom. In this contribution we solve this block permutation problem by introducing an additional speaker identification embedding to the DAN model structure. We motivate this model decision by analyzing the embedding topology of DC and DANs and show, that DC and DANs themselves are not sufficient for speaker identification. This model structure (a) improves the signal to distortion ratio (SDR) over a DAN baseline and (b) provides up to 61% and up to 34% relative reduction in permutation error rate and re-identification error rate compared to an i-vector baseline, respectively.


2017

A Generic Neural Acoustic Beamforming Architecture for Robust Multi-Channel Speech Processing

J. Heymann, L. Drude, R. Haeb-Umbach, Computer Speech and Language (2017)

Acoustic beamforming can greatly improve the performance of Automatic Speech Recognition (ASR) and speech enhancement systems when multiple channels are available. We recently proposed a way to support the model-based Generalized Eigenvalue beamforming operation with a powerful neural network for spectral mask estimation. The enhancement system has a number of desirable properties. In particular, neither assumptions need to be made about the nature of the acoustic transfer function (e.g., being anechonic), nor does the array configuration need to be known. While the system has been originally developed to enhance speech in noisy environments, we show in this article that it is also effective in suppressing reverberation, thus leading to a generic trainable multi-channel speech enhancement system for robust speech processing. To support this claim, we consider two distinct datasets: The CHiME 3 challenge, which features challenging real-world noise distortions, and the Reverb challenge, which focuses on distortions caused by reverberation. We evaluate the system both with respect to a speech enhancement and a recognition task. For the first task we propose a new way to cope with the distortions introduced by the Generalized Eigenvalue beamformer by renormalizing the target energy for each frequency bin, and measure its effectiveness in terms of the PESQ score. For the latter we feed the enhanced signal to a strong DNN back-end and achieve state-of-the-art ASR results on both datasets. We further experiment with different network architectures for spectral mask estimation: One small feed-forward network with only one hidden layer, one Convolutional Neural Network and one bi-directional Long Short-Term Memory network, showing that even a small network is capable of delivering significant performance improvements.


Hypothesis Test for the Detection of Moving Targets in Automotive Radar

C. Grimm, T. Breddermann, R. Farhoud, T. Fei, E. Warsitz, R. Haeb-Umbach, in: IEEE International conference on microwave, communications, anthenas and electronic systems (COMCAS), 2017

In this paper, we present a hypothesis test for the classification of moving targets in the sight of an automotive radar sensor. For this purpose, a statistical model of the relative velocity between a stationary target and the radar sensor has been developed. With respect to the statistical properties a confidence interval is calculated and targets with relative velocity lying outside this interval are classified as moving targets. Compared to existing algorithms our approach is able to give robust classification independent of the number of observed moving targets and is characterized by an instantaneous classification, a simple parameterization of the model and an automatic calculation of the discriminating threshold.


BEAMNET: End-to-End Training of a Beamformer-Supported Multi-Channel ASR System

J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, R. Haeb-Umbach, in: Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2017

This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks for a statistically optimum beamformer is jointly trained with a network for acoustic modeling. To update its parameters, we propagate the gradients from the acoustic model all the way through feature extraction and the complex valued beamforming operation. Besides avoiding a mismatch between the front-end and the back-end, this approach also eliminates the need for stereo data, i.e., the parallel availability of clean and noisy versions of the signals. Instead, it can be trained with real noisy multichannel data only. Also, relying on the signal statistics for beamforming, the approach makes no assumptions on the configuration of the microphone array. We further observe a performance gain through joint training in terms of word error rate in an evaluation of the system on the CHiME 4 dataset.


Tight integration of spatial and spectral features for BSS with Deep Clustering embeddings

L. Drude, R. Haeb-Umbach, in: INTERSPEECH 2017, Stockholm, Schweden, 2017

Recent advances in discriminatively trained mask estimation networks to extract a single source utilizing beamforming techniques demonstrate, that the integration of statistical models and deep neural networks (DNNs) are a promising approach for robust automatic speech recognition (ASR) applications. In this contribution we demonstrate how discriminatively trained embeddings on spectral features can be tightly integrated into statistical model-based source separation to separate and transcribe overlapping speech. Good generalization to unseen spatial configurations is achieved by estimating a statistical model at test time, while still leveraging discriminative training of deep clustering embeddings on a separate training set. We formulate an expectation maximization (EM) algorithm which jointly estimates a model for deep clustering embeddings and complex-valued spatial observations in the short time Fourier transform (STFT) domain at test time. Extensive simulations confirm, that the integrated model outperforms (a) a deep clustering model with a subsequent beamforming step and (b) an EM-based model with a beamforming step alone in terms of signal to distortion ratio (SDR) and perceptually motivated metric (PESQ) gains. ASR results on a reverberated dataset further show, that the aforementioned gains translate to reduced word error rates (WERs) even in reverberant environments.


Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery

J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, B. Raj, in: INTERSPEECH 2017, Stockholm, Schweden, 2017

Variational Autoencoders (VAEs) have been shown to provide efficient neural-network-based approximate Bayesian inference for observation models for which exact inference is intractable. Its extension, the so-called Structured VAE (SVAE) allows inference in the presence of both discrete and continuous latent variables. Inspired by this extension, we developed a VAE with Hidden Markov Models (HMMs) as latent models. We applied the resulting HMM-VAE to the task of acoustic unit discovery in a zero resource scenario. Starting from an initial model based on variational inference in an HMM with Gaussian Mixture Model (GMM) emission probabilities, the accuracy of the acoustic unit discovery could be significantly improved by the HMM-VAE. In doing so we were able to demonstrate for an unsupervised learning task what is well-known in the supervised learning case: Neural networks provide superior modeling power compared to GMMs.


Detection of Moving Targets in Automotive Radar with Distorted Ego-Velocity Information

C. Grimm, R. Farhoud, T. Fei, E. Warsitz, R. Haeb-Umbach, in: IEEE Microwaves, Radar and Remote Sensing Symposium (MRRS), 2017

In this paper we present an algorithm for the detection of moving targets in sight of an automotive radar sensor which can handle distorted ego-velocity information. In situations where biased or none velocity information is provided from the ego-vehicle, the algorithm is able to estimate the ego-velocity based on previously detected stationary targets with high accuracy, subsequently used for the target classification. Compared to existing ego-velocity algorithms our approach provides fast and efficient inference without sacrificing the practical classification accuracy. Other than that the algorithm is characterized by simple parameterization and little but appropriate model assumptions for high accurate production automotive radar sensors.


Building or Enclosure Termination Closing and/or Opening Apparatus, and Method for Operating a Building or Enclosure Termination

F. Jacob, J. Schmalenstroeer. Building or Enclosure Termination Closing and/or Opening Apparatus, and Method for Operating a Building or Enclosure Termination, Patent WO2018/077610A. 2017.

The invention relates to a building or enclosure termination opening and/or closing apparatus having communication signed or encrypted by means of a key, and to a method for operating such. To allow simple, convenient and secure use by exclusively authorised users, the apparatus comprises: a first and a second user terminal, with secure forwarding of a time-limited key from the first to the second user terminal being possible. According to an alternative, individual keys are generated by a user identification and a secret device key.


On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming

C. Boeddeker, P. Hanebrink, L. Drude, J. Heymann, R. Haeb-Umbach, 2017

This report describes the computation of gradients by algorithmic differentiation for statistically optimum beamforming operations. Especially the derivation of complex-valued functions is a key component of this approach. Therefore the real-valued algorithmic differentiation is extended via the complex-valued chain rule. In addition to the basic mathematic operations the derivative of the eigenvalue problem with complex-valued eigenvectors is one of the key results of this report. The potential of this approach is shown with experimental results on the CHiME-3 challenge database. There, the beamforming task is used as a front-end for an ASR system. With the developed derivatives a joint optimization of a speech enhancement and speech recognition system w.r.t. the recognition optimization criterion is possible.


A Study on Transfer Learning for Acoustic Event Detection in a Real Life Scenario

P. Arora, R. Haeb-Umbach, in: IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), 2017

In this work, we address the limited availability of large annotated databases for real-life audio event detection by utilizing the concept of transfer learning. This technique aims to transfer knowledge from a source domain to a target domain, even if source and target have different feature distributions and label sets. We hypothesize that all acoustic events share the same inventory of basic acoustic building blocks and differ only in the temporal order of these acoustic units. We then construct a deep neural network with convolutional layers for extracting the acoustic units and a recurrent layer for capturing the temporal order. Under the above hypothesis, transfer learning from a source to a target domain with a different acoustic event inventory is realized by transferring the convolutional layers from the source to the target domain. The recurrent layer is, however, learnt directly from the target domain. Experiments on the transfer from a synthetic source database to the reallife target database of DCASE 2016 demonstrate that transfer learning leads to improved detection performance on average. However, the successful transfer to detect events which are very different from what was seen in the source domain, could not be verified.


Optimizing Neural-Network Supported Acoustic Beamforming by Algorithmic Differentiation

C. Boeddeker, P. Hanebrink, L. Drude, J. Heymann, R. Haeb-Umbach, in: Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2017

In this paper we show how a neural network for spectral mask estimation for an acoustic beamformer can be optimized by algorithmic differentiation. Using the beamformer output SNR as the objective function to maximize, the gradient is propagated through the beamformer all the way to the neural network which provides the clean speech and noise masks from which the beamformer coefficients are estimated by eigenvalue decomposition. A key theoretical result is the derivative of an eigenvalue problem involving complex-valued eigenvectors. Experimental results on the CHiME-3 challenge database demonstrate the effectiveness of the approach. The tools developed in this paper are a key component for an end-to-end optimization of speech enhancement and speech recognition.


A Novel Target Separation Algorithm Applied to The Two-Dimensional Spectrum for FMCW Automotive Radar Systems

T. Fei, C. Grimm, R. Farhoud, T. Breddermann, E. Warsitz, R. Haeb-Umbach, in: IEEE International conference on microwave, communications, anthenas and electronic systems, 2017

In this paper, we apply a high-resolution approach, i.e. the matrix pencil method (MPM), to the FMCW automotive radar system to separate the neighboring targets, which share similar parameters, i.e. range, relative speed and azimuth angle, and cause overlapping in the radar spectrum. In order to adapt the 1D model of MPM to the 2D range-velocity spectrum and simultaneously limit the computational cost, some preprocessing steps are proposed to construct a novel separation algorithm. Finally, this algorithm is evaluated in both simulation and real data, and the results indicate a promising performance.


Leveraging Text Data for Word Segmentation for Underresourced Languages

T. Glarner, B. Boenninghoff, O. Walter, R. Haeb-Umbach, in: INTERSPEECH 2017, Stockholm, Schweden, 2017

In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segementation performance by a large margin.


Multi-Stage Coherence Drift Based Sampling Rate Synchronization for Acoustic Beamforming

J. Schmalenstroeer, J. Heymann, L. Drude, C. Boeddeker, R. Haeb-Umbach, in: IEEE 19th International Workshop on Multimedia Signal Processing (MMSP), 2017

Multi-channel speech enhancement algorithms rely on a synchronous sampling of the microphone signals. This, however, cannot always be guaranteed, especially if the sensors are distributed in an environment. To avoid performance degradation the sampling rate offset needs to be estimated and compensated for. In this contribution we extend the recently proposed coherence drift based method in two important directions. First, the increasing phase shift in the short-time Fourier transform domain is estimated from the coherence drift in a Matched Filterlike fashion, where intermediate estimates are weighted by their instantaneous SNR. Second, an observed bias is removed by iterating between offset estimation and compensation by resampling a couple of times. The effectiveness of the proposed method is demonstrated by speech recognition results on the output of a beamformer with and without sampling rate offset compensation between the input channels. We compare MVDR and maximum-SNR beamformers in reverberant environments and further show that both benefit from a novel phase normalization, which we also propose in this contribution.


A Generalized Log-Spectral Amplitude Estimator for Single-Channel Speech Enhancement

A. Chinaev, R. Haeb-Umbach, in: Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2017

The benefits of both a logarithmic spectral amplitude (LSA) estimation and a modeling in a generalized spectral domain (where short-time amplitudes are raised to a generalized power exponent, not restricted to magnitude or power spectrum) are combined in this contribution to achieve a better tradeoff between speech quality and noise suppression in single-channel speech enhancement. A novel gain function is derived to enhance the logarithmic generalized spectral amplitudes of noisy speech. Experiments on the CHiME-3 dataset show that it outperforms the famous minimum mean squared error (MMSE) LSA gain function of Ephraim and Malah in terms of noise suppression by 1.4 dB, while the good speech quality of the MMSE-LSA estimator is maintained.


2016

A Priori SNR Estimation Using a Generalized Decision Directed Approach

A. Chinaev, R. Haeb-Umbach, in: INTERSPEECH 2016, San Francisco, USA, 2016

In this contribution we investigate a priori signal-to-noise ratio (SNR) estimation, a crucial component of a single-channel speech enhancement system based on spectral subtraction. The majority of the state-of-the art a priori SNR estimators work in the power spectral domain, which is, however, not confirmed to be the optimal domain for the estimation. Motivated by the generalized spectral subtraction rule, we show how the estimation of the a priori SNR can be formulated in the so called generalized SNR domain. This formulation allows to generalize the widely used decision directed (DD) approach. An experimental investigation with different noise types reveals the superiority of the generalized DD approach over the conventional DD approach in terms of both the mean opinion score - listening quality objective measure and the output global SNR in the medium to high input SNR regime, while we show that the power spectrum is the optimal domain for low SNR. We further develop a parameterization which adjusts the domain of estimation automatically according to the estimated input global SNR. Index Terms: single-channel speech enhancement, a priori SNR estimation, generalized spectral subtraction


Factor Graph Decoding for Speech Presence Probability Estimation

T. Glarner, M. Mahdi Momenzadeh, L. Drude, R. Haeb-Umbach, in: 12. ITG Fachtagung Sprachkommunikation (ITG 2016), 2016

This paper is concerned with speech presence probability estimation employing an explicit model of the temporal and spectral correlations of speech. An undirected graphical model is introduced, based on a Factor Graph formulation. It is shown that this undirected model cures some of the theoretical issues of an earlier directed graphical model. Furthermore, we formulate a message passing inference scheme based on an approximate graph factorization, identify this inference scheme as a particular message passing schedule based on the turbo principle and suggest further alternative schedules. The experiments show an improved performance over speech presence probability estimation based on an IID assumption, and a slightly better performance of the turbo schedule over the alternatives.


Wide Residual BLSTM Network with Discriminative Speaker Adaptation for Robust Speech Recognition

J. Heymann, L. Drude, R. Haeb-Umbach, in: Computer Speech and Language, 2016

We present a system for the 4th CHiME challenge which significantly increases the performance for all three tracks with respect to the provided baseline system. The front-end uses a bi-directional Long Short-Term Memory (BLSTM)-based neural network to estimate signal statistics. These then steer a Generalized Eigenvalue beamformer. The back-end consists of a 22 layer deep Wide Residual Network and two extra BLSTM layers. Working on a whole utterance instead of frames allows us to refine Batch-Normalization. We also train our own BLSTM-based language model. Adding a discriminative speaker adaptation leads to further gains. The final system achieves a word error rate on the six channel real test data of 3.48%. For the two channel track we achieve 5.96% and for the one channel track 9.34%. This is the best reported performance on the challenge achieved by a single system, i.e., a configuration, which does not combine multiple systems. At the same time, our system is independent of the microphone configuration. We can thus use the same components for all three tracks.


On the Bias of Direction of Arrival Estimation Using Linear Microphone Arrays

F. Jacob, R. Haeb-Umbach, in: 12. ITG Fachtagung Sprachkommunikation (ITG 2016), 2016

This contribution investigates Direction of Arrival (DoA) estimation using linearly arranged microphone arrays. We are going to develop a model for the DoA estimation error in a reverberant scenario and show the existence of a bias, that is a consequence of the linear arrangement and limited field of view (FoV) bias: First, the limited FoV leading to a clipping of the measurements, and, second, the angular distribution of the signal energy of the reflections being non-uniform. Since both issues are a consequence of the linear arrangement of the sensors, the bias arises largely independent of the kind of DoA estimator. The experimental evaluation demonstrates the existence of the bias for a selected number of DoA estimation methods and proves that the prediction from the developed theoretical model matches the simulation results.


Acoustic Microphone Geometry Calibration: An overview and experimental evaluation of state-of-the-art algorithms

A. Plinge, F. Jacob, R. Haeb-Umbach, G.A. Fink, IEEE Signal Processing Magazine (2016), 33(4), pp. 14-29

Today, we are often surrounded by devices with one or more microphones, such as smartphones, laptops, and wireless microphones. If they are part of an acoustic sensor network, their distribution in the environment can be beneficially exploited for various speech processing tasks. However, applications like speaker localization, speaker tracking, and speech enhancement by beamforming avail themselves of the geometrical configuration of the sensors. Therefore, acoustic microphone geometry calibration has recently become a very active field of research. This article provides an application-oriented, comprehensive survey of existing methods for microphone position self-calibration, which will be categorized by the measurements they use and the scenarios they can calibrate. Selected methods will be evaluated comparatively with real-world recordings.


Unsupervised Word Discovery from Speech using Bayesian Hierarchical Models

O. Walter, R. Haeb-Umbach, in: 38th German Conference on Pattern Recognition (GCPR 2016), 2016

In this paper we demonstrate an algorithm to learn words from speech using non-parametric Bayesian hierarchical models in an unsupervised setting. We exploit the assumption of a hierarchical structure of speech, namely the formation of spoken words as a sequence of phonemes. We employ the Nested Hierarchical Pitman-Yor Language Model, which allows an a priori unknown and possibly unlimited number of words. We assume the n-gram probabilities of words, the m-gram probabilities of phoneme sequences in words and the phoneme sequences of the words themselves as latent variables to be learned. We evaluate the algorithm on a cross language task using an existing speech recognizer trained on English speech to decode speech in the Xitsonga language supplied for the 2015 ZeroSpeech challenge. We apply the learning algorithm on the resulting phoneme graphs and achieve the highest token precision and F score compared to present systems.



A Priori SNR Estimation Using Weibull Mixture Model

A. Chinaev, J. Heitkaemper, R. Haeb-Umbach, in: 12. ITG Fachtagung Sprachkommunikation (ITG 2016), 2016

This contribution introduces a novel causal a priori signal-to-noise ratio (SNR) estimator for single-channel speech enhancement. To exploit the advantages of the generalized spectral subtraction, a normalized ?-order magnitude (NAOM) domain is introduced where an a priori SNR estimation is carried out. In this domain, the NAOM coefficients of noise and clean speech signals are modeled by a Weibull distribution and aWeibullmixturemodel (WMM), respectively. While the parameters of the noise model are calculated from the noise power spectral density estimates, the speechWMM parameters are estimated from the noisy signal by applying a causal Expectation-Maximization algorithm. Further a maximum a posteriori estimate of the a priori SNR is developed. The experiments in different noisy environments show the superiority of the proposed estimator compared to the well-known decision-directed approach in terms of estimation error, estimator variance and speech quality of the enhanced signals when used for speech enhancement.


Noise-Presence-Probability-Based Noise PSD Estimation by Using DNNs

A. Chinaev, J. Heymann, L. Drude, R. Haeb-Umbach, in: 12. ITG Fachtagung Sprachkommunikation (ITG 2016), 2016

A noise power spectral density (PSD) estimation is an indispensable component of speech spectral enhancement systems. In this paper we present a noise PSD tracking algorithm, which employs a noise presence probability estimate delivered by a deep neural network (DNN). The algorithm provides a causal noise PSD estimate and can thus be used in speech enhancement systems for communication purposes. An extensive performance comparison has been carried out with ten causal state-of-the-art noise tracking algorithms taken from the literature and categorized acc. to applied techniques. The experiments showed that the proposed DNN-based noise PSD tracker outperforms all competing methods with respect to all tested performance measures, which include the noise tracking performance and the performance of a speech enhancement system employing the noise tracking component.



On the appropriateness of complex-valued neural networks for speech enhancement

L. Drude, B. Raj, R. Haeb-Umbach, in: INTERSPEECH 2016, San Francisco, USA, 2016

Although complex-valued neural networks (CVNNs) â?? networks which can operate with complex arithmetic â?? have been around for a while, they have not been given reconsideration since the breakthrough of deep network architectures. This paper presents a critical assessment whether the novel tool set of deep neural networks (DNNs) should be extended to complex-valued arithmetic. Indeed, with DNNs making inroads in speech enhancement tasks, the use of complex-valued input data, specifically the short-time Fourier transform coefficients, is an obvious consideration. In particular when it comes to performing tasks that heavily rely on phase information, such as acoustic beamforming, complex-valued algorithms are omnipresent. In this contribution we recapitulate backpropagation in CVNNs, develop complex-valued network elements, such as the split-rectified non-linearity, and compare real- and complex-valued networks on a beamforming task. We find that CVNNs hardly provide a performance gain and conclude that the effort of developing the complex-valued counterparts of the building blocks of modern deep or recurrent neural networks can hardly be justified.


A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research

K. Kinoshita, M. Delcroix, S. Gannot, E.A.P. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, T. Yoshioka, EURASIP Journal on Advances in Signal Processing (2016)


Investigations into Bluetooth Low Energy Localization Precision Limits

J. Schmalenstroeer, R. Haeb-Umbach, in: 24th European Signal Processing Conference (EUSIPCO 2016), 2016

In this paper we study the influence of directional radio patterns of Bluetooth low energy (BLE) beacons on smartphone localization accuracy and beacon network planning. A two-dimensional model of the power emission characteristic is derived from measurements of the radiation pattern of BLE beacons carried out in an RF chamber. The Cramer-Rao lower bound (CRLB) for position estimation is then derived for this directional power emission model. With this lower bound on the RMS positioning error the coverage of different beacon network configurations can be evaluated. For near-optimal network planing an evolutionary optimization algorithm for finding the best beacon placement is presented.


The RWTH/UPB/FORTH System Combination for the 4th CHiME Challenge Evaluation

T. Menne, J. Heymann, A. Alexandridis, K. Irie, A. Zeyer, M. Kitza, P. Golik, I. Kulikov, L. Drude, R. Schlüter, H. Ney, R. Haeb-Umbach, A. Mouchtaris, in: Computer Speech and Language, 2016

This paper describes automatic speech recognition (ASR) systems developed jointly by RWTH, UPB and FORTH for the 1ch, 2ch and 6ch track of the 4th CHiME Challenge. In the 2ch and 6ch tracks the final system output is obtained by a Confusion Network Combination (CNC) of multiple systems. The Acoustic Model (AM) is a deep neural network based on Bidirectional Long Short-Term Memory (BLSTM) units. The systems differ by front ends and training sets used for the acoustic training. The model for the 1ch track is trained without any preprocessing. For each front end we trained and evaluated individual acoustic models. We compare the ASR performance of different beamforming approaches: a conventional superdirective beamformer [1] and an MVDR beamformer as in [2], where the steering vector is estimated based on [3]. Furthermore we evaluated a BLSTM supported Generalized Eigenvalue beamformer using NN-GEV [4]. The back end is implemented using RWTH?s open-source toolkits RASR [5], RETURNN [6] and rwthlm [7]. We rescore lattices with a Long Short-Term Memory (LSTM) based language model. The overall best results are obtained by a system combination that includes the lattices from the system of UPB?s submission [8]. Our final submission scored second in each of the three tracks of the 4th CHiME Challenge.


2015

BLSTM supported GEV Beamformer Front-End for the 3RD CHiME Challenge

J. Heymann, L. Drude, A. Chinaev, R. Haeb-Umbach, in: Automatic Speech Recognition and Understanding Workshop (ASRU 2015), 2015


Lexicon Discovery for Language Preservation using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01)

O. Walter, R. Haeb-Umbach, J. Strunk, N.. P. Himmelmann, 2015

In this paper we show that recently developed algorithms for unsupervised word segmentation can be a valuable tool for the documentation of endangered languages. We applied an unsupervised word segmentation algorithm based on a nested Pitman-Yor language model to two austronesian languages, Wooi and Waima'a. The algorithm was then modified and parameterized to cater the needs of linguists for high precision of lexical discovery: We obtained a lexicon precision of of 69.2\% and 67.5\% for Wooi and Waima'a, respectively, if single-letter words and words found less than three times were discarded. A comparison with an English word segmentation task showed comparable performance, verifying that the assumptions underlying the Pitman-Yor language model, the universality of Zipf's law and the power of n-gram structures, do also hold for languages as exotic as Wooi and Waima'a.


On Optimal Smoothing in Minimum Statistics Based Noise Tracking

A. Chinaev, R. Haeb-Umbach, in: Interspeech 2015, 2015, pp. 1785-1789

Noise tracking is an important component of speech enhancement algorithms. Of the many noise trackers proposed, Minimum Statistics (MS) is a particularly popular one due to its simple parameterization and at the same time excellent performance. In this paper we propose to further reduce the number of MS parameters by giving an alternative derivation of an optimal smoothing constant. At the same time the noise tracking performance is improved as is demonstrated by experiments employing speech degraded by various noise types and at different SNR values.


Absolute Geometry Calibration of Distributed Microphone Arrays in an Audio-Visual Sensor Network

F. Jacob, R. Haeb-Umbach, ArXiv e-prints (2015)

Joint audio-visual speaker tracking requires that the locations of microphones and cameras are known and that they are given in a common coordinate system. Sensor self-localization algorithms, however, are usually separately developed for either the acoustic or the visual modality and return their positions in a modality specific coordinate system, often with an unknown rotation, scaling and translation between the two. In this paper we propose two techniques to determine the positions of acoustic sensors in a common coordinate system, based on audio-visual correlates, i.e., events that are localized by both, microphones and cameras separately. The first approach maps the output of an acoustic self-calibration algorithm by estimating rotation, scale and translation to the visual coordinate system, while the second solves a joint system of equations with acoustic and visual directions of arrival as input. The evaluation of the two strategies reveals that joint calibration outperforms the mapping approach and achieves an overall calibration error of 0.20m even in reverberant environments.


Aligning training models with smartphone properties in WiFi fingerprinting based indoor localization

M.K. Hoang, J. Schmalenstroeer, R. Haeb-Umbach, in: 40th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), 2015


Semantic Analysis of Spoken Input using Markov Logic Networks

V. Despotovic, O. Walter, R. Haeb-Umbach, in: INTERSPEECH 2015, 2015

We present a semantic analysis technique for spoken input using Markov Logic Networks (MLNs). MLNs combine graphical models with first-order logic. They areparticularly suitable for providing inference in the presence of inconsistent and incomplete data, which are typical of an automatic speech recognizer's (ASR) output in the presence of degraded speech. The target application is a speech interface to a home automation system to be operated by people with speech impairments, where the ASR output is particularly noisy. In order to cater for dysarthric speech with non-canonical phoneme realizations, acoustic representations of the input speech are learned in an unsupervised fashion. While training data transcripts are not required for the acoustic model training, the MLN training requires supervision, however, at a rather loose and abstract level. Results on two databases, one of them for dysarthric speech, show that MLN-based semantic analysis clearly outperforms baseline approaches employing non-negative matrix factorization, multinomial naive Bayes models, or support vector machines.


DOA-Estimation based on a Complex Watson Kernel Method

L. Drude, F. Jacob, R. Haeb-Umbach, in: 23th European Signal Processing Conference (EUSIPCO 2015), 2015

This contribution presents a Direction of Arrival (DoA) estimation algorithm based on the complex Watson distribution to incorporate both phase and level differences of captured micro- phone array signals. The derived algorithm is reviewed in the context of the Generalized State Coherence Transform (GSCT) on the one hand and a kernel density estimation method on the other hand. A thorough simulative evaluation yields insight into parameter selection and provides details on the performance for both directional and omni-directional microphones. A comparison to the well known Steered Response Power with Phase Transform (SRP-PHAT) algorithm and a state of the art DoA estimator which explicitly accounts for aliasing, shows in particular the advantages of presented algorithm if inter-sensor level differences are indicative of the DoA, as with directional microphones.


Unsupervised adaptation of a denoising autoencoder by Bayesian Feature Enhancement for reverberant asr under mismatch conditions

J. Heymann, R. Haeb-Umbach, P. Golik, R. Schlueter, in: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, 2015, pp. 5053-5057

The parametric Bayesian Feature Enhancement (BFE) and a datadriven Denoising Autoencoder (DA) both bring performance gains in severe single-channel speech recognition conditions. The first can be adjusted to different conditions by an appropriate parameter setting, while the latter needs to be trained on conditions similar to the ones expected at decoding time, making it vulnerable to a mismatch between training and test conditions. We use a DNN backend and study reverberant ASR under three types of mismatch conditions: different room reverberation times, different speaker to microphone distances and the difference between artificially reverberated data and the recordings in a reverberant environment. We show that for these mismatch conditions BFE can provide the targets for a DA. This unsupervised adaptation provides a performance gain over the direct use of BFE and even enables to compensate for the mismatch of real and simulated reverberant data.


Typicality and Emotion in the Voice of Children with Autism Spectrum Condition: Evidence Across Three Languages

E. Marchi, B. Schuller, S. Baron-Cohen, O. Golan, S. Boelte, P. Arora, R. Haeb-Umbach, in: INTERSPEECH 2015, 2015

Only a few studies exist on automatic emotion analysis of speech from children with Autism Spectrum Conditions (ASC). Out of these, some preliminary studies have recently focused on comparing the relevance of selected prosodic features against large sets of acoustic, spectral, and cepstral features; however, no study so far provided a comparison of performances across different languages. The present contribution aims to fill this white spot in the literature and provide insight by extensive evaluations carried out on three databases of prompted phrases collected in English, Swedish, and Hebrew, inducing nine emotion categories embedded in short-stories. The datasets contain speech of children with ASC and typically developing children under the same conditions. We evaluate automatic diagnosis and recognition of emotions in atypical childrens voice over the nine categories including binary valence/arousal discrimination.


Source Counting in Speech Mixtures by Nonparametric Bayesian Estimation of an infinite Gaussian Mixture Model

O. Walter, L. Drude, R. Haeb-Umbach, in: 40th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), 2015

In this paper we present a source counting algorithm to determine the number of speakers in a speech mixture. In our proposed method, we model the histogram of estimated directions of arrival with a nonparametric Bayesian infinite Gaussian mixture model. As an alternative to classical model selection criteria and to avoid specifying the maximum number of mixture components in advance, a Dirichlet process prior is employed over the mixture components. This allows to automatically determine the optimal number of mixture components that most probably model the observations. We demonstrate by experiments that this model outperforms a parametric approach using a finite Gaussian mixture model with a Dirichlet distribution prior over the mixture weights.



Autonomous Learning of Representations

O. Walter, R. Haeb-Umbach, B. Mokbel, B. Paassen, B. Hammer, KI - Kuenstliche Intelligenz (2015), pp. 1-13

Besides the core learning algorithm itself, one major question in machine learning is how to best encode given training data such that the learning technology can efficiently learn based thereon and generalize to novel data. While classical approaches often rely on a hand coded data representation, the topic of autonomous representation or feature learning plays a major role in modern learning architectures. The goal of this contribution is to give an overview about different principles of autonomous feature learning, and to exemplify two principles based on two recent examples: autonomous metric learning for sequences, and autonomous learning of a deep representation for spoken language, respectively.


2014

Source Counting in Speech Mixtures Using a Variational EM Approach for Complexwatson Mixture Models

L. Drude, A. Chinaev, D.H. Tran Vu, R. Haeb-Umbach, in: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), 2014

"In this contribution we derive a variational EM (VEM) algorithm for model selection in complex Watson mixture models, which have been recently proposed as a model of the distribution of normalized microphone array signals in the short-time Fourier transform domain. The VEM algorithm is applied to count the number of active sources in a speech mixture by iteratively estimating the mode vectors of the Watson distributions and suppressing the signals from the corresponding directions. A key theoretical contribution is the derivation of the MMSE estimate of a quadratic form involving the mode vector of the Watson distribution. The experimental results demonstrate the effectiveness of the source counting approach at moderately low SNR. It is further shown that the VEM algorithm is more robust w.r.t. used threshold values."


Towards Online Source Counting in Speech Mixtures Applying a Variational EM for Complex Watson Mixture Models

L. Drude, A. Chinaev, D.H. Tran Vu, R. Haeb-Umbach, in: 14th International Workshop on Acoustic Signal Enhancement (IWAENC 2014), 2014, pp. 213-217

This contribution describes a step-wise source counting algorithm to determine the number of speakers in an offline scenario. Each speaker is identified by a variational expectation maximization (VEM) algorithm for complex Watson mixture models and therefore directly yields beamforming vectors for a subsequent speech separation process. An observation selection criterion is proposed which improves the robustness of the source counting in noise. The algorithm is compared to an alternative VEM approach with Gaussian mixture models based on directions of arrival and shown to deliver improved source counting accuracy. The article concludes by extending the offline algorithm towards a low-latency online estimation of the number of active sources from the streaming input data.


Spectral Noise Tracking for Improved Nonstationary Noise Robust ASR

A. Chinaev, M. Puels, R. Haeb-Umbach, in: 11. ITG Fachtagung Sprachkommunikation (ITG 2014), 2014

"A method for nonstationary noise robust automatic speech recognition (ASR) is to first estimate the changing noise statistics and second clean up the features prior to recognition accordingly. Here, the first is accomplished by noise tracking in the spectral domain, while the second relies on Bayesian enhancement in the feature domain. In this way we take advantage of our recently proposed maximum a-posteriori based (MAP-B) noise power spectral density estimation algorithm, which is able to estimate the noise statistics even in time-frequency bins dominated by speech. We show that MAP-B noise tracking leads to an improved noise model estimate in the feature domain compared to estimating noise in speech absence periods only, if the bias resulting from the nonlinear transformation from the spectral to the feature domain is accounted for. Consequently, ASR results are improved, as is shown by experiments conducted on the Aurora IV database."


A New Observation Model in the Logarithmic Mel Power Spectral Domain for the Automatic Recognition of Noisy Reverberant Speech

V. Leutnant, A. Krueger, R. Haeb-Umbach, IEEE/ACM Transactions on Audio, Speech, and Language Processing (2014), 22(1), pp. 95-109

In this contribution we present a theoretical and experimental investigation into the effects of reverberation and noise on features in the logarithmic mel power spectral domain, an intermediate stage in the computation of the mel frequency cepstral coefficients, prevalent in automatic speech recognition (ASR). Gaining insight into the complex interaction between clean speech, noise, and noisy reverberant speech features is essential for any ASR system to be robust against noise and reverberation present in distant microphone input signals. The findings are gathered in a probabilistic formulation of an observation model which may be used in model-based feature compensation schemes. The proposed observation model extends previous models in three major directions: First, the contribution of additive background noise to the observation error is explicitly taken into account. Second, an energy compensation constant is introduced which ensures an unbiased estimate of the reverberant speech features, and, third, a recursive variant of the observation model is developed resulting in reduced computational complexity when used in model-based feature compensation. The experimental section is used to evaluate the accuracy of the model and to describe how its parameters can be determined from test data.


A Gossiping Approach to Sampling Clock Synchronization in Wireless Acoustic Sensor Networks

J. Schmalenstroeer, P. Jebramcik, R. Haeb-Umbach, in: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), 2014

"In this paper we present an approach for synchronizing the sampling clocks of distributed microphones over a wireless network. The proposed system uses a two stage procedure. It first employs a two-way message exchange algorithm to estimate the clock phase and frequency difference between two nodes and then uses a gossiping algorithmto estimate a virtual master clock, to which all sensor nodes synchronize. Simulation results are presented for networks of different topology and size, showing the effectiveness of our approach."


Coordinate Mapping Between an Acoustic and Visual Sensor Network in the Shape Domain for a Joint Self-Calibrating Speaker Tracking

F. Jacob, R. Haeb-Umbach, in: 11. ITG Fachtagung Sprachkommunikation (ITG 2014), 2014

"Several self-localization algorithms have been proposed, that determine the positions of either acoustic or visual sensors autonomously. Usually these positions are given in a modality specific coordinate system, with an unknown rotation, translation and scale between the different systems. For a joint audiovisual tracking, where the different modalities support each other, the two modalities need to be mapped into a common coordinate system. In this paper we propose to estimate this mapping based on audiovisual correlates, i.e., a speaker that can be localized by both, a microphone and a camera network separately. The voice is tracked by a microphone network, which had to be calibrated by a self-localization algorithm at first, and the head is tracked by a calibrated camera network. Unlike existing Singular Value Decomposition based approaches to estimate the coordinate system mapping, we propose to perform an estimation in the shape domain, which turns out to be computationally more efficient. Simulations of the self-localization of an acoustic sensor network and a following coordinate mapping for a joint speaker localization showed a significant improvement of the localization performance, since the modalities were able to support each other."


An Evaluation of Unsupervised Acoustic Model Training for a Dysarthric Speech Interface

O. Walter, V. Despotovic, R. Haeb-Umbach, J. Gemmeke, B. Ons, H. Van hamme, in: INTERSPEECH 2014, 2014

In this paper, we investigate unsupervised acoustic model training approaches for dysarthric-speech recognition. These models are first, frame-based Gaussian posteriorgrams, obtained from Vector Quantization (VQ), second, so-called Acoustic Unit Descriptors (AUDs), which are hidden Markov models of phone-like units, that are trained in an unsupervised fashion, and, third, posteriorgrams computed on the AUDs. Experiments were carried out on a database collected from a home automation task and containing nine speakers, of which seven are considered to utter dysarthric speech. All unsupervised modeling approaches delivered significantly better recognition rates than a speaker-independent phoneme recognition baseline, showing the suitability of unsupervised acoustic model training for dysarthric speech. While the AUD models led to the most compact representation of an utterance for the subsequent semantic inference stage, posteriorgram-based representations resulted in higher recognition rates, with the Gaussian posteriorgram achieving the highest slot filling F-score of 97.02%. Index Terms: unsupervised learning, acoustic unit descriptors, dysarthric speech, non-negative matrix factorization


An Overview of Noise-Robust Automatic Speech Recognition

J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, IEEE Transactions on Audio, Speech and Language Processing (2014), 22(4), pp. 745-777

New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed.


A combined hardware-software approach for acoustic sensor network synchronization

J. Schmalenstroeer, P. Jebramcik, R. Haeb-Umbach, Signal Processing (2014), pp. -

Abstract In this paper we present an approach for synchronizing a wireless acoustic sensor network using a two-stage procedure. First the clock frequency and phase differences between pairs of nodes are estimated employing a two-way message exchange protocol. The estimates are further improved in a Kalman filter with a dedicated observation error model. In the second stage network-wide synchronization is achieved by means of a gossiping algorithm which estimates the average clock frequency and phase of the sensor nodes. These averages are viewed as frequency and phase of a virtual master clock, to which the clocks of the sensor nodes have to be adjusted. The amount of adjustment is computed in a specific control loop. While these steps are done in software, the actual sampling rate correction is carried out in hardware by using an adjustable frequency synthesizer. Experimental results obtained from hardware devices and software simulations of large scale networks are presented.


Iterative Bayesian Word Segmentation for Unspuervised Vocabulary Discovery from Phoneme Lattices

J. Heymann, O. Walter, R. Haeb-Umbach, B. Raj, in: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014), 2014

"In this paper we present an algorithm for the unsupervised segmentation of a lattice produced by a phoneme recognizer into words. Using a lattice rather than a single phoneme string accounts for the uncertainty of the recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model (LM) is known. We propose a computationally efficient iterative approach, which alternates between the following two steps: First, the most probable string is extracted from the lattice using a phoneme LM learned on the segmentation result of the previous iteration. Second, word segmentation is performed on the extracted string using a word and phoneme LM which is learned alongside the new segmentation. We present results on lattices produced by a phoneme recognizer on the WSJCAM0 dataset. We show that our approach delivers superior segmentation performance than an earlier approach found in the literature, in particular for higher-order language models. "


Online Observation Error Model Estimation for Acoustic Sensor Network Synchronization

J. Schmalenstroeer, W. Zhao, R. Haeb-Umbach, in: 11. ITG Fachtagung Sprachkommunikation (ITG 2014), 2014

"Acoustic sensor network clock synchronization via time stamp exchange between the sensor nodes is not accurate enough for many acoustic signal processing tasks, such as speaker localization. To improve synchronization accuracy it has therefore been proposed to employ a Kalman Filter to obtain improved frequency deviation and phase offset estimates. The estimation requires a statistical model of the errors of the measurements obtained from the time stamp exchange algorithm. These errors are caused by random transmission delays and hardware effects and are thus network specific. In this contribution we develop an algorithm to estimate the parameters of the measurement error model alongside the Kalman filter based sampling clock synchronization, employing the Expectation Maximization algorithm. Simulation results demonstrate that the online estimation of the error model parameters leads only to a small degradation of the synchronization performance compared to a perfectly known observation error model."


2013

MAP-based Estimation of the Parameters of a Gaussian Mixture Model in the Presence of Noisy Observations

A. Chinaev, R. Haeb-Umbach, in: 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), 2013, pp. 3352-3356

In this contribution we derive the Maximum A-Posteriori (MAP) estimates of the parameters of a Gaussian Mixture Model (GMM) in the presence of noisy observations. We assume the distortion to be white Gaussian noise of known mean and variance. An approximate conjugate prior of the GMM parameters is derived allowing for a computationally efficient implementation in a sequential estimation framework. Simulations on artificially generated data demonstrate the superiority of the proposed method compared to the Maximum Likelihood technique and to the ordinary MAP approach, whose estimates are corrected by the known statistics of the distortion in a straightforward manner.



The reverb challenge: a common evaluation framework for dereverberation and recognition of reverberant speech

K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, S. Gannot, B. Raj, in: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics , 2013, pp. 22-23

Recently, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel de-reverberation techniques, and automatic speech recognition (ASR) techniques robust to reverberation. To evaluate state-of-the-art algorithms and obtain new insights regarding potential future research directions, we propose a common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques. The proposed framework will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. This paper describes the rationale behind the challenge, and provides a detailed description of the evaluation framework and benchmark results.


Sampling Rate Synchronisation in Acoustic Sensor Networks with a Pre-Trained Clock Skew Error Model

J. Schmalenstroeer, R. Haeb-Umbach, in: 21th European Signal Processing Conference (EUSIPCO 2013), 2013

In this paper we present a combined hardware/software approach for synchronizing the sampling clocks of an acoustic sensor network. A first clock frequency offset estimate is obtained by a time stamp exchange protocol with a low data rate and computational requirements. The estimate is then postprocessed by a Kalman filter which exploits the specific properties of the statistics of the frequency offset estimation error. In long term experiments the deviation between the sampling oscillators of two sensor nodes never exceeded half a sample with a wired and with a wireless link between the nodes. The achieved precision enables the estimation of time difference of arrival values across different hardware devices without sharing a common sampling hardware.


Blind Speech Separation Exploiting Temporal and Spectral Correlations Using Turbo Decoding of 2D-HMMs

D.H. Tran Vu, R. Haeb-Umbach, in: 21th European Signal Processing Conference (EUSIPCO 2013), 2013

We present a novel method to exploit correlations of adjacent time-frequency (TF)-slots for a sparseness-based blind speech separation (BSS) system. Usually, these correlations are exploited by some heuristic smoothing techniques in the post-processing of the estimated soft TF masks. We propose a different approach: Based on our previous work with one-dimensional (1D)-hidden Markov models (HMMs) along the time axis we extend the modeling to two-dimensional (2D)-HMMs to exploit both temporal and spectral correlations in the speech signal. Based on the principles of turbo decoding we solved the complex inference of 2D-HMMs by a modified forward-backward algorithm which operates alternatingly along the time and the frequency axis. Extrinsic information is exchanged between these steps such that increasingly better soft time-frequency masks are obtained, leading to improved speech separation performance in highly reverberant recording conditions.


Parameter estimation and classification of censored Gaussian data with application to WiFi indoor positioning

M.K. Hoang, R. Haeb-Umbach, in: 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), 2013, pp. 3721-3725

In this paper, we consider the Maximum Likelihood (ML) estimation of the parameters of a GAUSSIAN in the presence of censored, i.e., clipped data. We show that the resulting Expectation Maximization (EM) algorithm delivers virtually biasfree and efficient estimates, and we discuss its convergence properties. We also discuss optimal classification in the presence of censored data. Censored data are frequently encountered in wireless LAN positioning systems based on the fingerprinting method employing signal strength measurements, due to the limited sensitivity of the portable devices. Experiments both on simulated and real-world data demonstrate the effectiveness of the proposed algorithms.


Using the turbo principle for exploiting temporal and spectral correlations in speech presence probability estimation

D.H.T. Vu, R. Haeb-Umbach, in: 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), 2013, pp. 863-867

In this paper we present a speech presence probability (SPP) estimation algorithmwhich exploits both temporal and spectral correlations of speech. To this end, the SPP estimation is formulated as the posterior probability estimation of the states of a two-dimensional (2D) Hidden Markov Model (HMM). We derive an iterative algorithm to decode the 2D-HMM which is based on the turbo principle. The experimental results show that indeed the SPP estimates improve from iteration to iteration, and further clearly outperform another state-of-the-art SPP estimation algorithm.



GMM-based significance decoding

A.H. Abdelaziz, S. Zeiler, D. Kolossa, V. Leutnant, R. Haeb-Umbach, in: Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, 2013, pp. 6827-6831

The accuracy of automatic speech recognition systems in noisy and reverberant environments can be improved notably by exploiting the uncertainty of the estimated speech features using so-called uncertainty-of-observation techniques. In this paper, we introduce a new Bayesian decision rule that can serve as a mathematical framework from which both known and new uncertainty-of-observation techniques can be either derived or approximated. The new decision rule in its direct form leads to the new significance decoding approach for Gaussian mixture models, which results in better performance compared to standard uncertainty-of-observation techniques in different additive and convolutive noise scenarios.


Improved Single-Channel Nonstationary Noise Tracking by an Optimized MAP-based Postprocessor

A. Chinaev, R. Haeb-Umbach, J. Taghia, R. Martin, in: 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013), 2013, pp. 7477-7481

In this paper we present an improved version of the recently proposed Maximum A-Posteriori (MAP) based noise power spectral density estimator. An empirical bias compensation and bandwidth adjustment reduce bias and variance of the noise variance estimates. The main advantage of the MAP-based postprocessor is its low estimation variance. The estimator is employed in the second stage of a two-stage single-channel speech enhancement system, where eight different state-of-the-art noise tracking algorithms were tested in the first stage. While the postprocessor hardly affects the results in stationary noise scenarios, it becomes the more effective the more nonstationary the noise is. The proposed postprocessor was able to improve all systems in babble noise w.r.t. the perceptual evaluation of speech quality performance.


A Hidden Markov Model for Indoor User Tracking Based on WiFi Fingerprinting and Step Detection

M.K. Hoang, J. Schmalenstroeer, C. Drueke, D.H. Tran Vu, R. Haeb-Umbach, in: 21th European Signal Processing Conference (EUSIPCO 2013), 2013

In this paper we present a modified hidden Markov model (HMM) for the fusion of received signal strength index (RSSI) information of WiFi access points and relative position information which is obtained from the inertial sensors of a smartphone for indoor positioning. Since the states of the HMM represent the potential user locations, their number determines the quantization error introduced by discretizing the allowable user positions through the use of the HMM. To reduce this quantization error we introduce â??pseudoâ?? states, whose emission probability, which models the RSSI measurements at this location, is synthesized from those of the neighboring states of which a Gaussian emission probability has been estimated during the training phase. The experimental results demonstrate the effectiveness of this approach. By introducing on average two pseudo states per original HMM state the positioning error could be significantly reduced without increasing the training effort.


Bayesian Feature Enhancement for Reverberation and Noise Robust Speech Recognition

V. Leutnant, A. Krueger, R. Haeb-Umbach, IEEE Transactions on Audio, Speech, and Language Processing (2013), 21(8), pp. 1640-1652

In this contribution we extend a previously proposed Bayesian approach for the enhancement of reverberant logarithmic mel power spectral coefficients for robust automatic speech recognition to the additional compensation of background noise. A recently proposed observation model is employed whose time-variant observation error statistics are obtained as a side product of the inference of the a posteriori probability density function of the clean speech feature vectors. Further a reduction of the computational effort and the memory requirements are achieved by using a recursive formulation of the observation model. The performance of the proposed algorithms is first experimentally studied on a connected digits recognition task with artificially created noisy reverberant data. It is shown that the use of the time-variant observation error model leads to a significant error rate reduction at low signal-to-noise ratios compared to a time-invariant model. Further experiments were conducted on a 5000 word task recorded in a reverberant and noisy environment. A significant word error rate reduction was obtained demonstrating the effectiveness of the approach on real-world data.


On the Acoustic Channel Identification in Multi-Microphone Systems via Adaptive Blind Signal Enhancement Techniques

G. Enzner, D. Schmid, R. Haeb-Umbach, in: 21th European Signal Processing Conference (EUSIPCO 2013), 2013

Among the different configurations of multi-microphone systems, e.g., in applications of speech dereverberation or denoising, we consider the case without a priori information of the microphone-array geometry. This naturally invokes explicit or implicit identification of source-receiver transfer functions as an indirect description of the microphone-array configuration. However, this blind channel identification (BCI) has been difficult due to the lack of unique identifiability in the presence of observation noise or near-common channel zeros. In this paper, we study the implicit BCI performance of blind signal enhancement techniques such as the adaptive principal component analysis (PCA) or the iterative blind equalization and channel identification (BENCH). To this end, we make use of a recently proposed metric, the normalized filter-projection misalignment (NFPM), which is tailored for BCI evaluation in ill-conditioned (e.g., noisy) scenarios. The resulting understanding of implicit BCI performance can help to judge the behavior of multi-microphone speech enhancement systems and the suitability of implicit BCI to serve channel-based (i.e., channel-informed) enhancement.


Server based indoor navigation using RSSI and inertial sensor information

M.K. Hoang, S. Schmitz, C. Drueke, D.H.T. Vu, J. Schmalenstroeer, R. Haeb-Umbach, in: Positioning Navigation and Communication (WPNC), 2013 10th Workshop on, 2013, pp. 1-6

In this paper we present a system for indoor navigation based on received signal strength index information of Wireless-LAN access points and relative position estimates. The relative position information is gathered from inertial smartphone sensors using a step detection and an orientation estimate. Our map data is hosted on a server employing a map renderer and a SQL database. The database includes a complete multilevel office building, within which the user can navigate. During navigation, the client retrieves the position estimate from the server, together with the corresponding map tiles to visualize the user's position on the smartphone display.


DoA-Based Microphone Array Position Self-Calibration Using Circular Statistic

F. Jacob, J. Schmalenstroeer, R. Haeb-Umbach, in: 38th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2013), 2013, pp. 116-120

In this paper we propose an approach to retrieve the absolute geometry of an acoustic sensor network, consisting of spatially distributed microphone arrays, from reverberant speech input. The calibration relies on direction of arrival measurements of the individual arrays. The proposed calibration algorithm is derived from a maximum-likelihood approach employing circular statistics. Since a sensor node consists of a microphone array with known intra-array geometry, we are able to obtain an absolute geometry estimate, including angles and distances. Simulation results demonstrate the effectiveness of the approach.


Unsupervised Word Discovery from Phonetic Input Using Nested Pitman-Yor Language Modeling

O. Walter, R. Haeb-Umbach, S. Chaudhuri, B. Raj, in: IEEE International Conference on Robotics and Automation (ICRA 2013), 2013

In this paper we consider the unsupervised word discovery from phonetic input. We employ a word segmentation algorithm which simultaneously develops a lexicon, i.e., the transcription of a word in terms of a phone sequence, learns a n-gram language model describing word and word sequence probabilities, and carries out the segmentation itself. The underlying statistical model is that of a Pitman-Yor process, a concept known from Bayesian non-parametrics, which allows for an a priori unknown and unlimited number of different words. Using a hierarchy of Pitman-Yor processes, language models of different order can be employed and nesting it with another hierarchy of Pitman-Yor processes on the phone level allows for backing off unknown word unigrams by phone m-grams. We present results on a large-vocabulary task, assuming an error-free phone sequence is given. We finish by discussing options how to cope with noisy phone sequences.


A Novel Initialization Method for Unsupervised Learning of Acoustic Patterns in Speech (FGNT-2013-01)

O. Walter, J. Schmalenstroeer, R. Haeb-Umbach, 2013

In this paper we present a novel initialization method for unsupervised learning of acoustic patterns in recordings of continuous speech. The pattern discovery task is solved by dynamic time warping whose performance we improve by a smart starting point selection. This enables a more accurate discovery of patterns compared to conventional approaches. After graph-based clustering the patterns are employed for training hidden Markov models for an unsupervised speech acquisition. By iterating between model training and decoding in an EM-like framework the word accuracy is continuously improved. On the TIDIGITS corpus we achieve a word error rate of about 13 percent by the proposed unsupervised pattern discovery approach, which neither assumes knowledge of the acoustic units nor of the labels of the training data.


2012


Improved Noise Power Spectral Density Tracking by a MAP-based Postprocessor

A. Chinaev, A. Krueger, D.H. Tran Vu, R. Haeb-Umbach, in: 37th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), 2012

In this paper we present a novel noise power spectral density tracking algorithm and its use in single-channel speech enhancement. It has the unique feature that it is able to track the noise statistics even if speech is dominant in a given time-frequency bin. As a consequence it can follow non-stationary noise superposed by speech, even in the critical case of rising noise power. The algorithm requires an initial estimate of the power spectrum of speech and is thus meant to be used as a postprocessor to a first speech enhancement stage. An experimental comparison with a state-of-the-art noise tracking algorithm demonstrates lower estimation errors under low SNR conditions and smaller fluctuations of the estimated values, resulting in improved speech quality as measured by PESQ scores.



Smartphone-Based Sensor Fusion for Improved Vehicular Navigation

O. Walter, J. Schmalenstroeer, A. Engler, R. Haeb-Umbach, in: 9th Workshop on Positioning Navigation and Communication (WPNC 2012), 2012

In this paper we present a system for car navigation by fusing sensor data on an Android smartphone. The key idea is to use both the internal sensors of the smartphone (e.g., gyroscope) and sensor data from the car (e.g., speed information) to support navigation via GPS. To this end we employ a CAN-Bus-to-Bluetooth adapter to establish a wireless connection between the smartphone and the CAN-Bus of the car. On the smartphone a strapdown algorithm and an error-state Kalman filter are used to fuse the different sensor data streams. The experimental results show that the system is able to maintain higher positioning accuracy during GPS dropouts, thus improving the availability and reliability, compared to GPS-only solutions.


Investigations Into a Statistical Observation Model for Logarithmic Mel Power Spectral Density Features of Noisy Reverberant Speech

V. Leutnant, A. Krueger, R. Haeb-Umbach, Speech Communication; 10. ITG Symposium; Proceedings of (2012), pp. 1-4

In this contribution, a new observation model for the joint compensation of reverberation and noise in the logarithmic mel power spectral density domain will be considered. The proposed observation model relates the noisy reverberant feature to the underlying sequence of clean speech features and the feature of the noise. Nevertheless, due to the complex interaction of these variables in the target domain, the observationmodel cannot be applied to Bayesian feature enhancement directly, calling for approximations that eventually render the observation model useful. The performance of the approximated observation model will highly depend on the capability of modeling the difference between the model and the noisy reverberant observation. A detailed analysis of this observation error will be provided in this work. Among others, it will point out the need to account for the instantaneous ratio of the reverberant speech power and the noise power. Index Terms: Bayesian feature enhancement, observation model for noisy reverberant speech


Bayesian Feature Enhancement for ASR of Noisy Reverberant Real-World Data

A. Krueger, O. Walter, V. Leutnant, R. Haeb-Umbach, in: Proc. Interspeech, 2012

In this contribution we investigate the effectiveness of Bayesian feature enhancement (BFE) on a medium-sized recognition task containing real-world recordings of noisy reverberant speech. BFE employs a very coarse model of the acoustic impulse response (AIR) from the source to the microphone, which has been shown to be effective if the speech to be recognized has been generated by artificially convolving nonreverberant speech with a constant AIR. Here we demonstrate that the model is also appropriate to be used in feature enhancement of true recordings of noisy reverberant speech. On the Multi-Channel Wall Street Journal Audio Visual corpus (MC-WSJ-AV) the word error rate is cut in half to 41.9 percent compared to the ETSI Standard Front-End using as input the signal of a single distant microphone with a single recognition pass.


Reverberant Speech Recognition

A. Krueger, R. Haeb-Umbach, in: Techniques for Noise Robustness in Automatic Speech Recognition, Wiley, 2012


A Statistical Observation Model For Noisy Reverberant Speech Features and its Application to Robust ASR

V. Leutnant, A. Krueger, R. Haeb-Umbach, in: Signal Processing, Communications and Computing (ICSPCC), 2012 IEEE International Conference on, 2012

In this work, an observation model for the joint compensation of noise and reverberation in the logarithmic mel power spectral density domain is considered. It relates the features of the noisy reverberant speech to those of the non-reverberant speech and the noise. In contrast to enhancement of features only corrupted by reverberation (reverberant features), enhancement of noisy reverberant features requires a more sophisticated model for the error introduced by the proposed observation model. In a first consideration, it will be shown that this error is highly dependent on the instantaneous ratio of the power of reverberant speech to the power of the noise and, moreover, sensitive to the phase between reverberant speech and noise in the short-time discrete Fourier domain. Afterwards, a statistically motivated approach will be presented allowing for the model of the observation error to be inferred from the error model previously used for the reverberation only case. Finally, the developed observation error model will be utilized in a Bayesian feature enhancement scheme, leading to improvements in word accuracy on the AURORA5 database.


Exploiting Temporal Correlations in Joint Multichannel Speech Separation and Noise Suppression using Hidden Markov Models

D.H. Tran Vu, R. Haeb-Umbach, in: International Workshop on Acoustic Signal Enhancement (IWAENC2012), 2012


Microphone Array Position Self-Calibration from Reverberant Speech Input

F. Jacob, J. Schmalenstroeer, R. Haeb-Umbach, in: International Workshop on Acoustic Signal Enhancement (IWAENC 2012), 2012

In this paper we propose an approach to retrieve the geometry of an acoustic sensor network consisting of spatially distributed microphone arrays from unconstrained speech input. The calibration relies on Direction of Arrival (DoA) measurements which do not require a clock synchronization among the sensor nodes. The calibration problem is formulated as a cost function optimization task, which minimizes the squared differences between measured and predicted observations and additionally avoids the existence of minima that correspond to mirrored versions of the actual sensor orientations. Further, outlier measurements caused by reverberation are mitigated by a Random Sample Consensus (RANSAC) approach. The experimental results show a mean positioning error of at most 25 cm even in highly reverberant environments.


2011

Investigations into Features for Robust Classification into Broad Acoustic Categories

J. Schmalenstroeer, M. Bartek, R. Haeb-Umbach, in: 37. Deutsche Jahrestagung fuer Akustik (DAGA 2011), 2011

In this paper we present our experimental results about classifying audio data into broad acoustic categories. The reverberated sound samples from indoor recordings are grouped into four classes, namely speech, music, acoustic events and noise. We investigated a total of 188 acoustic features and achieved for the best configuration a classification accuracy better than 98\%. This was achieved by a 42-dimensional feature vector consisting of Mel-Frequency Cepstral Coefficients, an autocorrelation feature and so-called track features that measure the length of ''traces'' of high energy in the spectrogram. We also found a 4-feature configuration with a classification rate of about 90\% allowing for broad acoustic category classification with low computational effort.


A Platform for efficient Supply Chain Management Support in Logistics

M. Bevermeier, S. Flanke, R. Haeb-Umbach, J. Stehr, in: International Workshop on Intelligent Transportation (WIT 2011), 2011


Unsupervised learning of acoustic events using dynamic time warping and hierarchical K-means++ clustering

J. Schmalenstroeer, M. Bartek, R. Haeb-Umbach, in: Interspeech 2011, 2011

In this paper we propose to jointly consider Segmental Dynamic Time Warping and distance clustering for the unsupervised learning of acoustic events. As a result, the computational complexity increases only linearly with the dababase size compared to a quadratic increase in a sequential setup, where all pairwise SDTW distances between segments are computed prior to clustering. Further, we discuss options for seed value selection for clustering and show that drawing seeds with a probability proportional to the distance from the already drawn seeds, known as K-means++ clustering, results in a significantly higher probability of finding representatives of each of the underlying classes, compared to the commonly used draws from a uniform distribution. Experiments are performed on an acoustic event classification and an isolated digit recognition task, where on the latter the final word accuracy approaches that of supervised training.


Unsupervised Geometry Calibration of Acoustic Sensor Networks Using Source Correspondences

J. Schmalenstroeer, F. Jacob, R. Haeb-Umbach, M. Hennecke, G.A. Fink, in: Interspeech 2011, 2011

In this paper we propose a procedure for estimating the geometric configuration of an arbitrary acoustic sensor placement. It determines the position and the orientation of microphone arrays in 2D while locating a source by direction-of-arrival (DoA) estimation. Neither artificial calibration signals nor unnatural user activity are required. The problem of scale indeterminacy inherent to DoA-only observations is solved by adding time difference of arrival (TDOA) measurements. The geometry calibration method is numerically stable and delivers precise results in moderately reverberated rooms. Simulation results are confirmed by laboratory experiments.


On Initial Seed Selection for Frequency Domain Blind Speech Separation

D.H. Tran Vu, R. Haeb-Umbach, in: Interspeech 2011, 2011

In this paper we address the problem of initial seed selection for frequency domain iterative blind speech separation (BSS) algorithms. The derivation of the seeding algorithm is guided by the goal to select samples which are likely to be caused by source activity and not by noise and at the same time originate from different sources. The proposed algorithm has moderate computational complexity and finds better seed values than alternative schemes, as is demonstrated by experiments on the database of the SiSEC2010 challenge.


A versatile Gaussian splitting approach to non-linear state estimation and its application to noise-robust ASR

V. Leutnant, A. Krueger, R. Haeb-Umbach, in: Interspeech 2011, 2011

In this work, a splitting and weighting scheme that allows for splitting a Gaussian density into a Gaussian mixture density (GMM) is extended to allow the mixture components to be arranged along arbitrary directions. The parameters of the Gaussian mixture are chosen such that the GMM and the original Gaussian still exhibit equal central moments up to an order of four. The resulting mixtures{\rq} covariances will have eigenvalues that are smaller than those of the covariance of the original distribution, which is a desirable property in the context of non-linear state estimation, since the underlying assumptions of the extended K ALMAN filter are better justified in this case. Application to speech feature enhancement in the context of noise-robust automatic speech recognition reveals the beneficial properties of the proposed approach in terms of a reduced word error rate on the Aurora 2 recognition task.


Speech Enhancement With a GSC-Like Structure Employing Eigenvector-Based Transfer Function Ratios Estimation

A. Krueger, E. Warsitz, R. Haeb-Umbach, IEEE Transactions on Audio, Speech, and Language Processing (2011), 19(1), pp. 206-219

In this paper, we present a novel blocking matrix and fixed beamformer design for a generalized sidelobe canceler for speech enhancement in a reverberant enclosure. They are based on a new method for estimating the acoustical transfer function ratios in the presence of stationary noise. The estimation method relies on solving a generalized eigenvalue problem in each frequency bin. An adaptive eigenvector tracking utilizing the power iteration method is employed and shown to achieve a high convergence speed. Simulation results demonstrate that the proposed beamformer leads to better noise and interference reduction and reduced speech distortions compared to other blocking matrix designs from the literature.


A Model-Based Approach to Joint Compensation of Noise and Reverberation for Speech Recognition

A. Krueger, R. Haeb-Umbach, in: Robust Speech Recognition of Uncertain or Missing Data, Springer, 2011

Employing automatic speech recognition systems in hands-free communication applications is accompanied by perfomance degradation due to background noise and, in particular, due to reverberation. These two kinds of distortion alter the shape of the feature vector trajectory extracted from the microphone signal and consequently lead to a discrepancy between training and testing conditions for the recognizer. In this chapter we present a feature enhancement approach aiming at the joint compensation of noise and reverberation to improve the performance by restoring the training conditions. For the enhancement we concentrate on the logarithmic mel power spectral coefficients as features, which are computed at an intermediate stage to obtain the widely used mel frequency cepstral coefficients. The proposed technique is based on a Bayesian framework, to attempt to infer the posterior distribution of the clean features given the observation of all past corrupted features. It exploits information from a priori models describing the dynamics of clean speech and noise-only feature vector trajectories as well as from an observation model relating the reverberant noisy to the clean features. The observation model relies on a simplified stochastic model of the room impulse response (RIR) between the speaker and the microphone, having only two parameters, namely RIR energy and reverberation time, which can be estimated from the captured microphone signal. The performance of the proposed enhancement technique is finally experimentally studied by means of recognition accuracy obtained for a connected digits recognition task under different noise and reverberation conditions using the Aurora~5 database.


Uncertainty Decoding and Conditional Bayesian Estimation

R. Haeb-Umbach, in: Robust Speech Recognition of Uncertain or Missing Data, Springer, 2011

In this contribution classification rules for HMM-based speech recognition in the presence of a mismatch between training and test data are presented. The observed feature vectors are regarded as corrupted versions of underlying and unobservable clean feature vectors, which have the same statistics as the training data. Optimal classification then consists of two steps. First, the posterior density of the clean feature vector, given the observed feature vectors, has to be determined, and second, this posterior is employed in a modified classification rule, which accounts for imperfect estimates. We discuss different variants of the classification rule and further elaborate on the estimation of the clean speech feature posterior, using conditional Bayesian estimation. It is shown that this concept is fairly general and can be applied to different scenarios, such as noisy or reverberant speech recognition.


Conditional Bayesian Estimation Employing a Phase-Sensitive Observation Model for Noise Robust Speech Recognition

V. Leutnant, R. Haeb-Umbach, in: Robust Speech Recognition of Uncertain or Missing Data, Springer, 2011

In this contribution, conditional Bayesian estimation employing a phase-sensitive observation model for noise robust speech recognition will be studied. After a review of speech recognition under the presence of corrupted features, termed uncertainty decoding, the estimation of the posterior distribution of the uncorrupted (clean) feature vector will be shown to be a key element of noise robust speech recognition. The estimation process will be based on three major components: an a priori model of the unobservable data, an observation model relating the unobservable data to the corrupted observation and an inference algorithm, finally allowing for a computationally tractable solution. Special stress will be laid on a detailed derivation of the phase-sensitive observation model and the required moments of the phase factor distribution. Thereby, it will not only be proven analytically that the phase factor distribution is non-Gaussian but also that all central moments can (approximately) be computed solely based on the used mel filter bank, finally rendering the moments independent of noise type and signal-to-noise ratio. The phase-sensitive observation model will then be incorporated into a model-based feature enhancement scheme and recognition experiments will be carried out on the Aurora~2 and Aurora~4 databases. The importance of incorporating phase factor information into the enhancement scheme is pointed out by all recognition results. Application of the proposed scheme under the derived uncertainty decoding framework further leads to significant improvements in both recognition tasks, eventually reaching the performance achieved with the ETSI advanced front-end.



Können Computer sprechen und hören, sollen sie es überhaupt können? Sprachverarbeitung und ambiente Intelligenz

R. Haeb-Umbach, in: Baustelle Informationsgesellschaft und Universität heute, Ferdinand Schoeningh Verlag, Paderborn, 2011


Adaptive Systems for Unsupervised Speaker Tracking and Speech Recognition

T. Herbig, F. Gerl, W. Minker, R. Haeb-Umbach, Evolving Systems (2011), 2(3), pp. 199-214


MAP-based estimation of the parameters of non-stationary Gaussian processes from noisy observations

A. Krueger, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2011), 2011, pp. 3596-3599

The paper proposes a modification of the standard maximum a posteriori (MAP) method for the estimation of the parameters of a Gaussian process for cases where the process is superposed by additive Gaussian observation errors of known variance. Simulations on artificially generated data demonstrate the superiority of the proposed method. While reducing to the ordinary MAP approach in the absence of observation noise, the improvement becomes the more pronounced the larger the variance of the observation noise. The method is further extended to track the parameters in case of non-stationary Gaussian processes.


2010

Barometric height estimation combined with map-matching in a loosely-coupled Kalman-filter

M. Bevermeier, O. Walter, S. Peschke, R. Haeb-Umbach, in: 7th Workshop on Positioning Navigation and Communication (WPNC 2010), 2010, pp. 128-134

In this paper we present a robust location estimation algorithm especially focused on the accuracy in vertical position. A loosely-coupled error state space Kalman filter, which fuses sensor data of an Inertial Measurement Unit and the output of a Global Positioning System device, is augmented by height information from an altitude measurement unit. This unit consists of a barometric altimeter whose output is fused with topographic map information by a Kalman filter to provide robust information about the current vertical user position. These data replace the less reliable vertical position information provided the GPS device. It is shown that typical barometric errors like thermal divergences and fluctuations in the pressure due to changing weather conditions can be compensated by the topographic map information and the barometric error Kalman filter. The resulting height information is shown not only to be more reliable than height information provided by GPS. It also turns out that it leads to better attitude and thus better overall localization estimation accuracy due to the coupling of spatial orientations via the Direct Cosine Matrix. Results are presented both for artificially generated and field test data, where the user is moving by car.


On the Exploitation of Hidden Markov Models and Linear Dynamic Models in a Hybrid Decoder Architecture for Continuous Speech Recognition

V. Leutnant, R. Haeb-Umbach, in: Interspeech 2010, 2010

Linear dynamic models (LDMs) have been shown to be a viable alternative to hidden Markov models (HMMs) on small-vocabulary recognition tasks, such as phone classification. In this paper we investigate various statistical model combination approaches for a hybrid HMM-LDM recognizer, resulting in a phone classification performance that outperforms the best individual classifier. Further, we report on continuous speech recognition experiments on the AURORA4 corpus, where the model combination is carried out on wordgraph rescoring. While the hybrid system improves the HMM system in the case of monophone HMMs, the performance of the triphone HMM model could not be improved by monophone LDMs, asking for the need to introduce context-dependency also in the LDM model inventory.


Model-Based Feature Enhancement for Reverberant Speech Recognition

A. Krueger, R. Haeb-Umbach, IEEE Transactions on Audio, Speech, and Language Processing (2010), 18(7), pp. 1692-1707

In this paper, we present a new technique for automatic speech recognition (ASR) in reverberant environments. Our approach is aimed at the enhancement of the logarithmic Mel power spectrum, which is computed at an intermediate stage to obtain the widely used Mel frequency cepstral coefficients (MFCCs). Given the reverberant logarithmic Mel power spectral coefficients (LMPSCs), a minimum mean square error estimate of the clean LMPSCs is computed by carrying out Bayesian inference. We employ switching linear dynamical models as an a priori model for the dynamics of the clean LMPSCs. Further, we derive a stochastic observation model which relates the clean to the reverberant LMPSCs through a simplified model of the room impulse response (RIR). This model requires only two parameters, namely RIR energy and reverberation time, which can be estimated from the captured microphone signal. The performance of the proposed enhancement technique is studied on the AURORA5 database and compared to that of constrained maximum-likelihood linear regression (CMLLR). It is shown by experimental results that our approach significantly outperforms CMLLR and that up to 80\% of the errors caused by the reverberation are recovered. In addition to the fact that the approach is compatible with the standard MFCC feature vectors, it leaves the ASR back-end unchanged. It is of moderate computational complexity and suitable for real time applications.


Online Diarization of Streaming Audio-Visual Data for Smart Environments

J. Schmalenstroeer, R. Haeb-Umbach, IEEE Journal of Selected Topics in Signal Processing (2010), 4(5), pp. 845-856

For an environment to be perceived as being smart, contextual information has to be gathered to adapt the system's behavior and its interface towards the user. Being a rich source of context information speech can be acquired unobtrusively by microphone arrays and then processed to extract information about the user and his environment. In this paper, a system for joint temporal segmentation, speaker localization, and identification is presented, which is supported by face identification from video data obtained from a steerable camera. Special attention is paid to latency aspects and online processing capabilities, as they are important for the application under investigation, namely ambient communication. It describes the vision of terminal-less, session-less and multi-modal telecommunication with remote partners, where the user can move freely within his home while the communication follows him. The speaker diarization serves as a context source, which has been integrated in a service-oriented middleware architecture and provided to the application to select the most appropriate I/O device and to steer the camera towards the speaker during ambient communication.


An EM Approach to Integrated Multichannel Speech Separation and Noise Suppression

D.H. Tran Vu, R. Haeb-Umbach, in: International Workshop on Acoustic Echo and Noise Control (IWAENC 2010), 2010

In this contribution we provide a unified treatment of blind source separation (BSS) and noise suppression, two tasks which have traditionally been considered different and for which quite different techniques have been developed. Exploiting the sparseness of the sources in the short time frequency domain and using a probabilistic model which accounts for the presence of additive noise and which captures the spatial information of the multi-channel recording, a speech enhancement system is developed which suppresses noise and simultaneously separates speakers in case multiple speakers are active. Source activity estimation and model parameter estimation form the E-step and the M-step of the Expectation Maximization algorithm, respectively. Experimental results obtained on the dataset of the Signal Separation Evaluation Campaign 2010 demonstrate the effectiveness of the proposed system.


Blind speech separation employing directional statistics in an Expectation Maximization framework

D.H. Tran Vu, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2010), 2010, pp. 241-244

In this paper we propose to employ directional statistics in a complex vector space to approach the problem of blind speech separation in the presence of spatially correlated noise. We interpret the values of the short time Fourier transform of the microphone signals to be draws from a mixture of complex Watson distributions, a probabilistic model which naturally accounts for spatial aliasing. The parameters of the density are related to the a priori source probabilities, the power of the sources and the transfer function ratios from sources to sensors. Estimation formulas are derived for these parameters by employing the Expectation Maximization (EM) algorithm. The E-step corresponds to the estimation of the source presence probabilities for each time-frequency bin, while the M-step leads to a maximum signal-to-noise ratio (MaxSNR) beamformer in the presence of uncertainty about the source activity. Experimental results are reported for an implementation in a generalized sidelobe canceller (GSC) like spatial beamforming configuration for 3 speech sources with significant coherent noise in reverberant environments, demonstrating the usefulness of the novel modeling framework.


Ungrounded Independent Non-Negative Factor Analysis

B. Raj, K.W. Wilson, A. Krueger, R. Haeb-Umbach, in: Interspeech 2010, 2010

We describe an algorithm that performs regularized non-negative matrix factorization (NMF) to find independent components in non-negative data. Previous techniques proposed for this purpose require the data to be grounded, with support that goes down to 0 along each dimension. In our work, this requirement is eliminated. Based on it, we present a technique to find a low-dimensional decomposition of spectrograms by casting it as a problem of discovering independent non-negative components from it. The algorithm itself is implemented as regularized non-negative matrix factorization (NMF). Unlike other ICA algorithms, this algorithm computes the mixing matrix rather than an unmixing matrix. This algorithm provides a better decomposition than standard NMF when the underlying sources are independent. It makes better use of additional observation streams than previous non-negative ICA algorithms.


Options for Modelling Temporal Statistical Dependencies in an Acoustic Model for ASR

V. Leutnant, R. Haeb-Umbach, in: 36. Deutsche Jahrestagung fuer Akustik (DAGA 2010), 2010

Traditionally, ASR systems are based on hidden Markov models with Gaussian mixtures modelling the state-conditioned feature distribution. The inherent assumption of conditional independence, stating that a feature's likelihood solely depends on the current HMM state, makes the search computationally tractable, nevertheless has also been identified to be a major reason for the lack of robustness of such systems. Linear dynamic models have been proposed to overcome this weakness by employing a hidden dynamic state process underlying the observed features. Though performance of linear dynamic models on continuous speech/phone recognition tasks has been shown to be superior to that of equivalent static models, this approach still cannot compete with the established acoustic models. In this paper we consider the combination of hidden Markov models based on Gaussian mixture densities (GMM-HMMs) and linear dynamic models (LDMs) as the acoustic model for automatic speech recognition systems. In doing so, the individual strengths of both models, i.e. the modelling of long-term temporal dependencies by the GMM-HMM and the direct modelling of statistical dependencies between consecutive feature vectors by the LDM, are exploited. Phone classification experiments conducted on the TIMIT database indicate the prospective use of this approach for the application to continuous speech recognition.


2009

An analytic derivation of a phase-sensitive observation model for noise robust speech recognition

V. Leutnant, R. Haeb-Umbach, in: Interspeech 2009, 2009

In this paper we present an analytic derivation of the moments of the phase factor between clean speech and noise cepstral or log-mel-spectral feature vectors. The development shows, among others, that the probability density of the phase factor is of sub-Gaussian nature and that it is independent of the noise type and the signal-to-noise ratio, however dependent on the mel filter bank index. Further we show how to compute the contribution of the phase factor to both the mean and the vari- ance of the noisy speech observation likelihood, which relates the speech and noise feature vectors to those of noisy speech. The resulting phase-sensitive observation model is then used in model-based speech feature enhancement, leading to significant improvements in word accuracy on the AURORA2 database.


On the Estimation and Use of Feature Reliability Information for Noise Robust Speech Recognition

V. Leutnant, R. Haeb-Umbach, in: International Conference on Acoustics (NAG/DAGA 2009), 2009

In this paper we present an Uncertainty Decoding rule which exploits feature reliability information and interframe correlation for noise robust speech recognition. The reliability information can be obtained either from conditional Bayesian estimation, where speech and noise feature vectors are tracked jointly, or by augmenting conventional point estimation methods with heuristics about the estimator's reliability. Experimental results on the AURORA2 database demonstrate on the one hand that Uncertainty Decoding improves recognition performance, while on the other hand it is seen that the severe approximations needed to arrive at computationally tractable solutions have their noticable impact on recognition performance.


Model based feature enhancement for automatic speech recognition in reverberant environments

A. Krueger, R. Haeb-Umbach, in: Interspeech 2009, 2009

In this paper we present a new feature space dereverberation technique for automatic speech recognition. We derive an expression for the dependence of the reverberant speech features in the log-mel spectral domain on the non-reverberant speech features and the room impulse response. The obtained observation model is used for a model based speech enhancement based on Kalman filtering. The performance of the proposed enhancement technique is studied on the AURORA5 database. In our currently best configuration, which includes uncertainty decoding, the number of recognition errors is approximately halved compared to the recognition of unprocessed speech.


Audio-Visual Data Processing for Ambient Communication

J. Schmalenstroeer, V. Leutnant, R. Haeb-Umbach, in: 1st International Workshop on Distributed Computing in Ambient Environments within 32nd Annual Conference on Artificial Intelligence, 2009


Robust vehicle localization based on multi-level sensor fusion and online parameter estimation

M. Bevermeier, S. Peschke, R. Haeb-Umbach, in: 6th Workshop on Positioning Navigation and Communication (WPNC 2009), 2009, pp. 235-242

In this paper we present a novel vehicle tracking algorithm, which is based on multi-level sensor fusion of GPS (global positioning system) with Inertial Measurement Unit sensor data. It is shown that the robustness of the system to temporary dropouts of the GPS signal, which may occur due to limited visibility of satellites in narrow street canyons or tunnels, is greatly improved by sensor fusion. We further demonstrate how the observation and state noise covariances of the employed Kalman filters can be estimated alongside the filtering by an application of the Expectation-Maximization algorithm. The proposed time-variant multi-level Kalman filter is shown to outperform an Interacting Multiple Model approach while at the same time being computationally less demanding.


A GPS positioning approach exploiting GSM velocity estimates

S. Peschke, M. Bevermeier, R. Haeb-Umbach, in: 6th Workshop on Positioning Navigation and Communication (WPNC 2009), 2009, pp. 195-202

A combination of GPS (global positioning system) and INS (inertial navigation system) is known to provide high precision and highly robust vehicle localization. Notably during times when the GPS signal has a poor quality, e.g. due to the lack of a sufficiently large number of visible satellites, the INS, which may consist of a gyroscope and an odometer, will lead to improved positioning accuracy. In this paper we show how velocity information obtained from GSM (global system for mobile communications) signalling, rather than from a tachometer, can be used together with a gyroscope sensor to support localization in the presence of temporarily unavailable GPS data. We propose a sensor fusion system architecture and present simulation results that show the effectiveness of this approach.


Approaches to Iterative Speech Feature Enhancement and Recognition

S. Windmann, R. Haeb-Umbach, IEEE Transactions on Audio, Speech, and Language Processing (2009), 17(5), pp. 974-984

In automatic speech recognition, hidden Markov models (HMMs) are commonly used for speech decoding, while switching linear dynamic models (SLDMs) can be employed for a preceding model-based speech feature enhancement. In this paper, these model types are combined in order to obtain a novel iterative speech feature enhancement and recognition architecture. It is shown that speech feature enhancement with SLDMs can be improved by feeding back information from the HMM to the enhancement stage. Two different feedback structures are derived. In the first, the posteriors of the HMM states are used to control the model probabilities of the SLDMs, while in the second they are employed to directly influence the estimate of the speech feature distribution. Both approaches lead to improvements in recognition accuracy both on the AURORA2 and AURORA4 databases compared to non-iterative speech feature enhancement with SLDMs. It is also shown that a combination with uncertainty decoding further enhances performance.


Joint Parameter Estimation and Tracking in a Multi-Stage Kalman Filter for Vehicle Positioning

M. Bevermeier, S. Peschke, R. Haeb-Umbach, in: IEEE 69th Vehicular Technology Conference (VTC 2009 Spring), 2009, pp. 1-5

In this paper we present a novel vehicle tracking method which is based on multi-stage Kalman filtering of GPS and IMU sensor data. After individual Kalman filtering of GPS and IMU measurements the estimates of the orientation of the vehicle are combined in an optimal manner to improve the robustness towards drift errors. The tracking algorithm incorporates the estimation of time-variant covariance parameters by using an iterative block Expectation-Maximization algorithm to account for time-variant driving conditions and measurement quality. The proposed system is compared to an interacting multiple model approach (IMM) and achieves improved localization accuracy at lower computational complexity. Furthermore we show how the joint parameter estimation and localizaiton can be conducted with streaming input data to be able to track vehicles in a real driving environment.


A hierarchical approach to unsupervised shape calibration of microphone array networks

M. Hennecke, T. Ploetz, G.A. Fink, J. Schmalenstroeer, R. Haeb-Umbach, in: IEEE/SP 15th Workshop on Statistical Signal Processing (SSP 2009), 2009, pp. 257-260

Microphone arrays represent the basis for many challenging acoustic sensing tasks. The accuracy of techniques like beamforming directly depends on a precise knowledge of the relative positions of the sensors used. Unfortunately, for certain use cases manually measuring the geometry of an array is not feasible due to practical constraints. In this paper we present an approach to unsupervised shape calibration of microphone array networks. We developed a hierarchical procedure that first performs local shape calibration based on coherence analysis and then employs SRP-PHAT in a network calibration method. Practical experiments demonstrate the effectiveness of our approach especially for highly reverberant acoustic environments.


Fusing Audio and Video Information for Online Speaker Diarization

J. Schmalenstroeer, M. Kelling, V. Leutnant, R. Haeb-Umbach, in: Interspeech 2009, 2009

In this paper we present a system for identifying and localizingspeakers using distant microphone arrays and a steerablepan-tilt-zoom camera. Audio and video streams are processedin real-time to obtain the diarization information {grqq}who speakswhen and where'' with low latency to be used in advanced videoconferencing systems or user-adaptive interfaces. A key featureof the proposed system is to first glean information about thespeaker{\rq}s location and identity from the audio and visual datastreams separately and then to fuse these data in a probabilisticframework employing the Viterbi algorithm. Here, visual evidenceof a person is utilized through a priori state probabilities,while location and speaker change information are employedvia time-variant transition probablities. Experiments show thatvideo information yields a substantial improvement comparedto pure audio-based diarization.



Parameter Estimation of a State-Space Model of Noise for Robust Speech Recognition

S. Windmann, R. Haeb-Umbach, IEEE Transactions on Audio, Speech, and Language Processing (2009), 17(8), pp. 1577-1590

In this paper, parameter estimation of a state-space model of noise or noisy speech cepstra is investigated. A blockwise EM algorithm is derived for the estimation of the state and observation noise covariance from noise-only input data. It is supposed to be used during the offline training mode of a speech recognizer. Further a sequential online EM algorithm is developed to adapt the observation noise covariance on noisy speech cepstra at its input. The estimated parameters are then used in model-based speech feature enhancement for noise-robust automatic speech recognition. Experiments on the AURORA4 database lead to improved recognition results with a linear state model compared to the assumption of stationary noise.



2008

Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller

E. Warsitz, A. Krueger, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008), 2008, pp. 73-76

The generalized sidelobe canceller by Griffith and Jim is a robust beamforming method to enhance a desired (speech) signal in the presence of stationary noise. Its performance depends to a high degree on the construction of the blocking matrix which produces noise reference signals for the subsequent adaptive interference canceller. Especially in reverberated environments the beamformer may suffer from signal leakage and reduced noise suppression. In this paper a new blocking matrix is proposed. It is based on a generalized eigenvalue problem whose solution provides an indirect estimation of the transfer functions from the source to the sensors. The quality of the new generalized eigenvector blocking matrix is studied in simulated rooms with different reverberation times and is compared to alternatives proposed in the literature.


Uncertainty Decoding in Automatic Speech Recognition

R. Haeb-Umbach, 2008 ITG Conference on Voice Communication (SprachKommunikation) (2008), pp. 1-7

The term uncertainty decoding has been phrased for a class of robustness enhancing algorithms in automatic speech recognition that replace point estimates and plug-in rules by posterior densities and optimal decision rules. While uncertainty can be incorporated in the model domain, in the feature domain, or even in both, we concentrate here on feature domain approaches as they tend to be computationally less demanding. We derive optimal decision rules in the presence of uncertain observations and discuss simplifications which result in computationally efficient realizations. The usefulness of the presented statistical framework is then exemplified for two types of realworld problems: The first is improving the robustness of speech recognition towards incomplete or corrupted feature vectors due to a lossy communication link between the speech capturing front end and the backend recognition engine. And the second is the well-known and extensively studied issue of improving the robustness of the recognizer towards environmental noise.


Error Concealement

R. Haeb-Umbach, V. Ion, in: Automatic Speech Recognition on Mobile Devices and over Communication Networks, Springer, 2008, pp. 187-210

In distributed and network speech recognition the actual recognition task is not carried out on the user{\rq}s terminal but rather on a remote server in the network. While there are good reasons for doing so, a disadvantage of this client-server architecture is clearly that the communication medium may introduce errors, which then impairs speech recognition accuracy. Even sophisticated channel coding cannot completely prevent the occurrence of residual bit errors in the case of temporarily adverse channel conditions, and in packet-oriented transmission packets of data may arrive too late for the given real-time constraints and have to be declared lost. The goal of error concealment is to reduce the detrimental effect that such errors may induce on the recipient of the transmitted speech signal by exploiting residual redundancy in the bit stream at the source coder output. In classical speech transmission a human is the recipient, and erroneous data are reconstructed so as to reduce the subjectively annoying effect of corrupted bits or lost packets. Here, however, a statistical classifier is at the receiving end, which can benefit from knowledge about the quality of the reconstruction. In this book chapter we show how the classical Bayesian decision rule needs to be modified to account for uncertain features, and illustrate how the required feature posterior density can be estimated in the case of distributed speech recognition. Some other techniques for error concealment can be related to this approach. Experimental results are given for both a small and a medium vocabulary recognition task and both for a channel exhibiting bit errors and a packet erasure channel.


A segmental HMM based on a modified emission probability

S. Windmann, R. Haeb-Umbach, V. Leutnant, 2008 ITG Conference on Voice Communication (SprachKommunikation) (2008), pp. 1-4

In this paper, a novel segmental Hidden Markov Model (HMM) is proposed. The model is based on a modified emission density where additional statistical dependencies between subsequent frames of the speech signal are considered. In the following we derive an effective search strategy for the modified statistical model. Further an approach to parameter reduction is introduced. Experiments were carried out on the AURORA2 database where consistent im


A Novel Uncertainty Decoding Rule With Applications to Transmission Error Robust Speech Recognition

V. Ion, R. Haeb-Umbach, IEEE Transactions on Audio, Speech, and Language Processing (2008), 16(5), pp. 1047-1060

In this paper, we derive an uncertainty decoding rule for automatic speech recognition (ASR), which accounts for both corrupted observations and inter-frame correlation. The conditional independence assumption, prevalent in hidden Markov model-based ASR, is relaxed to obtain a clean speech posterior that is conditioned on the complete observed feature vector sequence. This is a more informative posterior than one conditioned only on the current observation. The novel decoding is used to obtain a transmission-error robust remote ASR system, where the speech capturing unit is connected to the decoder via an error-prone communication network. We show how the clean speech posterior can be computed for communication links being characterized by either bit errors or packet loss. Recognition results are presented for both distributed and network speech recognition, where in the latter case common voice-over-IP codecs are employed.


Blinde Akustische Strahlformung fuer Anwendungen im KFZ

A. Krueger, E. Warsitz, R. Haeb-Umbach, in: 34. Deutsche Jahrestagung fuer Akustik (DAGA 2008), 2008

In diesem Beitrag werden zwei neuartige akustische Strahlformungsalgorithmen fuer einen Einsatz im KFZ diskutiert. In beiden Verfahren ist fuer jede Frequenzkomponente der Eigenvektor zum groessten Eigenwert eines verallgemeinerten Eigenwertproblems zu bestimmen: bei der einen Varianate stellen die Filterkoeffizienten gerade obigen Eigenvektor dar. Sprachverzerrungen werden mit einem einkanaligen Nachfilter kompensiert. In der anderen Variante, welche die Struktur eines ''Generalized Sidelobe Cancellers'' (GSC) hat, basiert die Blockiermatrix auf obiger Eigenvektorzerlegung.


Blind Speech Separation in Presence of Correlated Noise with Generalized Eigenvector Beamforming

D.H. Tran Vu, R. Haeb-Umbach, 2008 ITG Conference on Voice Communication (SprachKommunikation) (2008), pp. 1-4

This paper considers the convolutive blind source separation of speech sources in the presence of spatially correlated noise. We introduce a method for estimating the scaled mixing matrix from the sources to the microphones even if coherent noise is present. This is achieved by combining time-frequency sparseness with the generalized eigenvalue decomposition of the power spectral density matrix (PSD) of the noisy speech and noise-only microphone signals. Separation is performed by spatial filtering with coefficients constructed by Gram-Schmidt orthogonalization which places spatial nulls at the interferer{^A}?`s direction. Experimental results show that our approach is capable of separating 2 sources in a reverberant environment (RT60=0ms..500ms) degraded by significant directional noise.


A novel approach to noise estimation in model-based speech feature enhancement

S. Windmann, R. Haeb-Umbach, 2008 ITG Conference on Voice Communication (SprachKommunikation) (2008), pp. 1-4

In this paper, the noise estimation for model-based speech feature enhancement in automatic speech recognition (ASR) is investigated. Beside a stationary noise prior, three linear state space models for the (cepstral) noise process are considered. We have derived novel EM algorithms for the estimation of the noise model parameters: A blockwise EM algorithm is applied on noise-only input data. It is supposed to be used during the offline training mode of the recognizer. Further a sequential online EM algorithm is employed to adapt the observation variance in recognition mode which works as well under the asumption of a stationary noise prior and a linear state model for the noise. Experiments on the AURORA4 database lead to improved recognition results with the new state model compared to the assumption of stationary noise.


Investigations into Uncertainty Decoding Employing a Discrete Feature Space for Noise Robust Automatic Speech Recognition

V. Ion, R. Haeb-Umbach, 2008 ITG Conference on Voice Communication (SprachKommunikation) (2008), pp. 1-4

This paper addresses the robustness of automatic speech recognition to environmental noise. In order to account for reliability of the clean feature estimate we employ the feature posterior density conditioned on observed noisy features to perform uncertainty decoding. We investigate two approaches to estimate the posterior using a discrete feature space, first conditioning only on the current observation, and second on the whole feature sequence of an utterance. Experiments with Aurora 2 showed that the latter provides slightly better performance, as it allows for exploiting the temporal correlations between consecutive features.


Generalized Eigenvector Blind Speech Separation Under Coherent Noise In A GSC Configuration

D.H. Tran Vu, A. Krueger, R. Haeb-Umbach, in: International Workshop on Acoustic Echo and Noise Control (IWAENC 2008), 2008

This Paper deals with a new Technique for multi-channel separation of speech signals from convolutive mixtures under coherent noise. We demonstrate how the scaled transfer functions from the sources to the microphones can be estimated even in the presence of stationary coherent noise. The key to this are generalized eigenvalue decompositions of the power spectral density (PSD) matrices of the noisy speech and noise-only microphone signals with a controlled estimation of these matrices exploiting time-frequency sparseness of the speech sources. Separation is further improved by subsequent Gram-Schmidt orthogonalization which places spatial nulls at the interferers{\rq} directions, while noise reduction is improved by employing a novel blocking matrix and adaptive interference canceller in a Generalized Sidelobe Canceller (GSC)-like structure. We report promising experimental results for 2 speech sources with significant coherent noise in reverberant environments (RT60=0oms..500ms).


Modeling the dynamics of speech and noise for speech feature enhancement in ASR

S. Windmann, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008), 2008, pp. 4409-4412

In this paper a switching linear dynamical model (SLDM) approach for speech feature enhancement is improved by employing more accurate models for the dynamics of speech and noise. The model of the clean speech feature trajectory is improved by augmenting the state vector to capture information derived from the delta features. Further a hidden noise state variable is introduced to obtain a more elaborated model for the noise dynamics. Approximate Bayesian inference in the SLDM is carried out by a bank of extended Kalman filters, whose outputs are combined according to the a posteriori probability of the individual state models. Experimental results on the AURORA2 database show improved recognition accuracy.


2007



OFDM Channel Estimation Based on Combined Estimation in Time and Frequency Domain

R. Haeb-Umbach, M. Bevermeier, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), 2007, pp. III-277-III-280

In this paper we present a novel channel impulse response estimation technique for block-oriented OFDM transmission based on combining estimators: the estimates provided by a Kalman filter operating in the time domain and a Wiener filter in the frequency domain are optimally combined by taking into account their estimated error covariances. The resulting estimator turns out to be identical to the MAP estimator of correlated jointly Gaussian mean vectors. Different variants of the proposed scheme are experimentally investigated in an EEEE 802.11a-like system setup. They compare favourably with known approaches from the literature resulting in reduced mean square estimation error and bit error rate. Further, robustness and complexity issues are discussed


Amigo Context Management Service with Applications in Ambient Communication Scenarios

J. Schmalenstroeer, V. Leutnant, R. Haeb-Umbach, in: AMI-07 - European Conference on Ambient Intelligence, 2007



Projekt Amigo - Sprachsignalverarbeitung im vernetzten Haus

J. Schmalenstroeer, E. Warsitz, R. Haeb-Umbach, in: 33. Deutsche Jahrestagung fuer Akustik (DAGA 2007), 2007


Zweistufige Sprache/Pause-Detektion in stark gestoerter Umgebung

E. Warsitz, R. Haeb-Umbach, J. Schmalenstroeer, in: 33. Deutsche Jahrestagung fuer Akustik (DAGA 2007), 2007



A Novel Similarity Measure for Positioning Cellular Phones by a Comparison With a Database of Signal Power Levels

R. Haeb-Umbach, S. Peschke, IEEE Transactions on Vehicular Technology (2007), 56(1), pp. 368-372

In this paper, we propose a novel similarity measure to be used for localizing mobile terminals by comparing measured signal power levels with a database of predictions. The proposed measure provides the possibility to incorporate inherent information about signal power level measurements requested by the serving base station but not reported by the mobile terminal. Increased positioning accuracy was observed both in simulations and with real field data


Velocity Estimation of Mobile Terminals by Exploiting GSM Downlink Signalling

S. Peschke, R. Haeb-Umbach, in: 4th Workshop on Positioning Navigation and Communication (WPNC 2007), 2007, pp. 217-222

In this paper, we experimentally evaluate algorithms for velocity estimation of a GSM 900 mobile terminal which are based on the analysis of the statistical properties of the fast fading process. It is shown how theses statistics can be obtained from the training sequences present in downlink transmission bursts without establishing an active connection. Realistic simulations of a GSM channel according to the COST 207 channel models have been conducted. These models incorporate effects like multipath propagation, fading, cochannel interference and additive noise. It is shown that velocity estimation by searching for the maximum slope of the power density spectrum of the fast fading performs best.


Blind Acoustic Beamforming Based on Generalized Eigenvalue Decomposition

E. Warsitz, R. Haeb-Umbach, IEEE Transactions on Audio, Speech, and Language Processing (2007), 15(5), pp. 1529-1539

Maximizing the output signal-to-noise ratio (SNR) of a sensor array in the presence of spatially colored noise leads to a generalized eigenvalue problem. While this approach has extensively been employed in narrowband (antenna) array beamforming, it is typically not used for broadband (microphone) array beamforming due to the uncontrolled amount of speech distortion introduced by a narrowband SNR criterion. In this paper, we show how the distortion of the desired signal can be controlled by a single-channel post-filter, resulting in a performance comparable to the generalized minimum variance distortionless response beamformer, where arbitrary transfer functions relate the source and the microphones. Results are given both for directional and diffuse noise. A novel gradient ascent adaptation algorithm is presented, and its good convergence properties are experimentally revealed by comparison with alternatives from the literature. A key feature of the proposed beamformer is that it operates blindly, i.e., it neither requires knowledge about the array geometry nor an explicit estimation of the transfer functions from source to sensors or the direction-of-arrival.




2006

A Probabilistic Similarity Measure and a Non-Linear Post-Filter for Mobile Phone Positioning using GSM Signal Power Measurements

S. Peschke, R. Haeb-Umbach, in: European Navigation Conference \& Exhibition (ENC 2006), 2006

In this paper we present the design of a particle filter for post filtering instantaneous positioning estimates of GSM mobile terminals. The instantaneous estimates are obtained by comparing signal power levels, which are reported by the mobile terminal to the base station, with a database of predictions using a novel statistically motivated similarity measure. Unlike a simple Euclidian distance measure, the proposed scheme incorporates inherent information about signal power level measurements requested by the serving base station but not reported by the mobile terminal. Furthermore, we show how the Monte Carlo method of particle filtering helps to obtain better position estimates and, surprisingly, also helps to reduce the computational complexity. Results are presented for real field data.


Mehrkanalige Sprachsignalverarbeitung durch adaptives Eigenbeamforming fuer Freisprecheinrichtungen im Kraftfahrzeug

E. Warsitz, R. Haeb-Umbach, in: 32. Deutsche Jahrestagung fuer Akustik (DAGA 2006), 2006

Broadband adaptive beamformers, which use a narrowband SNR-maximization optimization criterion for noise reduction, typically cause distortions of the desired speech signal at the beamformer output. In this paper two methods are investigated to control the speech distortion by comparing the eigenvector beamformer with a maximum likelihood beamformer: One is an analytic solution for the ideal case of absence of reverberation and the other one is a statistically motivated approach. We use the recently introduced gradient-ascent algorithm for adaptive principal eigenvector beamforming and then normalize the filter coefficients by the proposed distortion control methods. Experimental results in terms of the achievable SNR gain and a perceptual speech quality measure are given for the normalized eigenvector beamformer and are compared to standard beamforming methods.


Einkanalige Sprachsignalverbesserung mit Hilfe eines marginalisierten Partikelfilters

S. Windmann, R. Haeb-Umbach, in: 7. ITG-Fachtagung Sprachkommunikation, 2006

Es wird ein marginalisiertes Partikelfilter beschrieben, das zur einkanaligen Sprachsignalverbesserung mit einem nichtlinearen dynamischen Zustandsmodell eingesetzt werden soll. Das System besteht aus einem Partikelfilter zum Tracking von LSP-Parametern und einem Kalman-Filter fuer jedes Partikel, das zur Sprachsignalverbesserung verwendet wird. In unserem Ansatz wird angenommen, dass die Parameter in kurzen Sprachsignalbloecken konstant sind, waehrend das Sprachsignal sich mit jedem Abtastwert aendert. Bei weissem Rauschen werden aehnliche SNR-Gewinne wie mit einem Kalman-EM-iterative Algorithmus erzielt, waehrend das Hintergrundrauschen und die Log-spektrale Distanz etwas geringer sind. Mit einem erweiterten Zustandsmodell wurden auch Untersuchungen fuer farbiges Rauschen durchgefuehrt.


Comparison of Decoder-based Transmission Error Compensation Techniques for Distributed Speech Recognition

V. Ion, R. Haeb-Umbach, in: 7. ITG-Fachtagung Sprachkommunikation, 2006

In this study we evaluate transmission error compensation techniques for distributed speech recognition systems based on modification of the speech decoder. The candidates are marginalization, weighted Viterbi and our recently proposed soft-feature uncertainty decoding. For the latter, it is shown how the Bayesian speech recognition approach must be reformulated for recognition at the server side. The resulting predictive classifier is able to take account of the transmission errors by changing the contribution of the affected speech features to the acoustic score. The comparison of the experimental results has proven the superiority of our approach.


Particle Filtering of Database assisted Positioning Estimates using a novel Similarity Measure for GSM Signal Power Level Measurements

S. Peschke, R. Haeb-Umbach, in: 3rd Workshop on Positioning Navigation and Communication (WPNC 2006), 2006

In this paper we present a novel and statistically motivated similarity measure for database assisted positioning of GSM mobile terminals by evaluating signal power level reports which are transmitted regulary. Unlike a simple Euclidian distance measure, the proposed scheme incorporates inherent information about signal power level measurements requested by the serving base station but not reported by the mobile terminal. Furthermore we show how the Monte Carlo method of nonlinear post filtering using particle filtering helps to obtain better position estimates and surprisingly also helps to reduce the computational complexity. Results are presented for real field data.


Controlling Speech Distortion in Adaptive Frequency-Domain Principal Eigenvector Beamforming

E. Warsitz, R. Haeb-Umbach, in: International Workshop on Acoustic Echo and Noise Control (IWAENC 2006), 2006

Broadband adaptive beamformers, which use a narrowband SNR-maximization optimization criterion for noise reduction, typically cause distortions of the desired speech signal at the beamformer output. In this paper two methodsare investigated to control the speech distortion by comparing the eigenvector beamformer with a maximum likelihood beamformer: One is an analytic solution for the ideal case of absence of reverberation and the other one is a statistically motivated approach. We use the recently introduced gradient-ascent algorithm for adaptive principal eigenvector beamforming and then normalize the filter coefficient s by the proposed distortion control methods. Experimental results in terms of the achievable SNR gain and a perceptual speech quality measure are given for the normalized eigenvector beamformer and are compared to standard beamforming methods.


Iterative Speech Enhancement using a Non-Linear Dynamic State Model of Speech and its Parameters

S. Windmann, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2006), 2006, pp. I

A marginalized particle filter is proposed for performing single channel speech enhancement with a non-linear dynamic state model. The system consists of a particle filter for tracking line spectral pair (LSP) parameters and a Kalman filter per particle for speech enhancement. The state model for the LSPs has been learnt on clean speech training data. In our approach parameters and speech samples are processed at different time scales by assuming the parameters to be constant for small blocks of data. Further enhancement is obtained by an iteration which can be applied on these small blocks. The experiments show that similar SNR gains are obtained as with the Kalman-LM-iterative algorithm. However better values of the noise level and the log-spectral distance are achieved


An Inexpensive Packet Loss Compensation Scheme for Distributed Speech Recognition Based on Soft-Features

V. Ion, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2006), 2006, pp. I

Soft-feature based speech recognition, which is an example of uncertainty decoding, has been proven to be a robust error mitigation method for distributed speech recognition over wireless channels exhibiting bit errors. In this paper we extend this concept to packet-oriented transmissions. The a posteriori probability density function of the lost feature vector, given the closest received neighbours, is computed. In the experiments, the nearest frame repetition, which is shown to be equivalent to the MAP estimate, outperforms the MMSE estimate for long bursts. Taking the variance into account at the speech recognition stage results in superior performance compared to classical schemes using point estimates. A computationally and memory efficient implementation of the proposed packet loss compensation scheme based on table lookup is presented


Uncertainty decoding for distributed speech recognition over error-prone networks

V. Ion, R. Haeb-Umbach, Speech Communication (2006), 48(11), pp. 1435-1446

In this paper, we propose an enhanced error concealment strategy at the server side of a distributed speech recognition (DSR) system, which is fully compatible with the existing DSR standard. It is based on a Bayesian approach, where the a posteriori probability density of the error-free feature vector is computed, given all received feature vectors which are possibly corrupted by transmission errors. Rather than computing a point estimate, such as the MMSE estimate, and plugging it into the Bayesian decision rule, we employ uncertainty decoding, which results in an integration over the uncertainty in the feature domain. In a typical scenario the communication between the thin client, often a mobile device, and the recognition server spreads across heterogeneous networks. Both bit errors on circuit-switched links and lost data packets on IP connections are mitigated by our approach in a unified manner. The experiments reveal improved robustness both for small- and large-vocabulary recognition tasks.


Online Speaker Change Detection by Combining BIC with Microphone Array Beamforming

J. Schmalenstroeer, R. Haeb-Umbach, in: Interspeech 2006, 2006

In this paper we consider the problem of detecting speaker changes in audio signals recorded by distant microphones. It is shown that the possibility to exploit the spatial separation of speakers more than makes up the degradation in detection accuracy due to the increased source-to-sensor distance compared to close-talking microphones. Speaker direction information is derived from the filter coefficients of an adaptive Filter-and-Sum Beamformer and is combined with BIC analysis. The experimental results reveal significant improvements compared to BIC-only change detection, be it with the distant or close-talking microphone.


Improved Source Modeling and Predictive Classification for Channel Robust Speech Recognition

V. Ion, R. Haeb-Umbach, in: Interspeech 2006, 2006

The accuracy of distributed speech recognition has been shown to be very sensitive to errors occurring during transmission. One reason for this is that the classifier, usually trained under error free conditions, is unable to cope with the mismatch between an error free and error prone channel. In this paper we present a novel decision rule for classification which is able to account for channel errors. To achieve this, the classical Bayesian speech recognition approach has been reformulated for the server side, where the observation is known only to the extent, as is given by its a posteriori density function. We present a method to estimate the a posteriori density which is based on a Markov model of the source, which captures correlations of both static and dynamic features. A practical implementation is given, accompanied by experimental results for distributed speech recognition over an IP-network.


2005

Adaptive Filter-and-Sum Beamforming in Spatially Correlated Noise

R. Haeb-Umbach, E. Warsitz, in: International Workshop on Acoustic Echo and Noise Control (IWAENC 2005), 2005

In this paper we propose a novel adaptation algorithm for Filter-and-Sum beamforming in spatially correlated noise. Deterministic and stochastic gradient ascent algorithms are derived from a constrained optimization problem, which iteratively estimate the principal eigenvecto r of a generalized eigenvalue problem. The method does not require an explicit estimation of the speaker location. It is shown that the well-known Delay-and-Sum beamformer and the previously introduced Filter-and-Sum beamformer in spatially white noise are obtained as special cases. Further, bounds on the maximally achievable SNR gains are derived and it is shown that the proposed adaptation algorithm is able to approach these performance bounds.


A Unified Probabilistic Approach to Error Concealment for Distributed Speech Recognition

V. Ion, R. Haeb-Umbach, in: Interspeech 2005, 2005

The transmission errors in a wireless or packet oriented network may dramatically decrease the performance of a distributed speech recognition DSR) system. Error concealment has been shown to be an effective way to mantain an acceptable word error rate when dealing with error prone communication channels. In this paper we propose an extension of our previously introduced soft features approach for the case that the soft-output of the channel decoder is not available at the server side of the DSR system. We found a simple method to estimate bit reliability information which still gives good speech recognition results. It is shown that some other error concealment schemes turn out to be special cases of the method proposed here.


Acoustic filter-and-sum beamforming by adaptive principal component analysis

E. Warsitz, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2005), 2005, pp. iv/797-iv/800 Vol. 4

For human-machine interfaces in distant-talking environments multichannel signal processing is often employed to obtain an enhanced signal for subsequent processing. In this paper we propose a novel adaptation algorithm for a filter-and-sum beamformer to adjust the coefficients of FIR filters to changing acoustic room impulses, e.g. due to speaker movement. A deterministic and a stochastic gradient ascent algorithm are derived from a constrained optimization problem, which iteratively estimates the eigenvector corresponding to the largest eigenvalue of the cross power spectral density of the microphone signals. The method does not require an explicit estimation of the speaker location. The experimental results show fast adaptation and excellent robustness of the proposed algorithm.


A Comparison of Soft-Feature Distributed Speech Recognition with Candidate Codecs for Speech Enabled Mobile Services

V. Ion, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2005), 2005, pp. 333-336

In this paper we present a comparison of the recently proposed Soft-Feature Distributed Speech Recognition (SFDSR) with the two evaluated candidate codecs for Speech Enabled Services over wireless networks: Adaptive Multirate Codec (AMR) and the ETSI Extended Advanced Front-End for Distributed Speech Recognition (XAFE). It is shown that SFDSR achieves the best recognition performance on a simulated GSM transmission, followed by XAFE and AMR.We also present some new results concerning SFDSR which demonstrate the versatility of the approach. Further, a simple method is introduced which considerably reduces the computational effort.




2004

Soft Features for Improved Distributed Speech Recognition over Wireless Networks

R. Haeb-Umbach, V. Ion, in: International Conference on Spoken Language Processing (ICSLP 2004), 2004

A major drawback of distributed versus terminal-based speech recognition is the fact that transmission errors can lead to degraded recognition performance. In this paper we employ soft features to mitigate the effect of bit errors on wireless transmission links: At the receiver a posteriori probabilities of the transmitted feature vectors are computed by combining bit reliability information provided by the channel decoder and a priori knowledge about residual redundancy in the feature vectors. While the first-order moment of the a posteriori probability function is the MMSE estimate, the second-order moment is a measure of the uncertainty in the reconstructed features. We conducted realistic simulations of GSM transmission and achieved significant improvements in word accuracy compared to the error mitigation strategy described in the ETSI standard.


Robust speaker direction estimation with particle filtering

E. Warsitz, R. Haeb-Umbach, in: IEEE Workshop on Multimedia Signal Processing (MMSP 2004), 2004, pp. 367-370

The paper is concerned with binaural signal processing for a bimodal human-robot interface with hearing and vision. The two microphone signals are processed to obtain an enhanced single-channel input signal for the subsequent speech recognizer and to localize the acoustic source, an important information for establishing a natural human-robot communication. We utilize a robust adaptive algorithm for filter-and-sum beamforming (FSB) and extract speaker direction information from the resulting FIR filter coefficients. Further, particle filtering is applied which conducts a nonlinear Bayesian tracking of speaker movement. Good location accuracy can be achieved even in highly reverberant environments. The results obtained outperform the conventional generalized cross correlation (GCC) method.


Adaptive Beamforming Combined with Particle Filtering for Acoustic Source Localization

E. Warsitz, R. Haeb-Umbach, S. Peschke, in: International Conference on Spoken Language Processing (ICSLP 2004), 2004

While the main objective of adaptive Filter-and-Sum beamforming is to obtain an enhanced speech signal for subsequent processing like speech recognition, we show how speaker localization information can be derived from the filter coefficients. To increase localization accuracy, speaker tracking is performed by non-linear Bayesian state estimation, which is realized by sequential Monte Carlo methods. Improved acquisition and tracking performance was achieved even in highly reverberant environments, in comparison with both a Kalman Filter and a recently proposed Particle Filter operating on the output of a nonadaptive Delay-and-Sum beamformer.


Multipath-Resistant Time of Arrival Estimation for Satellite Positioning

R. Bischoff, R. Haeb-Umbach, S.R. Nammi, AEUe, Int. Journal on Electronics and Communications (2004), 58(1)

Satellite positioning systems, such as GPS or the future European system Galileo, employ direct-sequence spread-spectrum signals. The positioning accuracy is strongly affected by the quality of the pseudo range measurements. These measurements necessitate code and carrier synchronization of the received signal with the internally generated reference signals. In this type of systems one major error source is the multipath phenomenon, which results in a sum of delayed and weighted copies of the original signal to be present at the receiver input. This can result in a systematic error of the code tracking loop resulting in range errors in the order of several tens of meters. In this paper we propose an extension of the standard code tracking loop capable of estimating the parameters of the line-of-sight (LOS) signal and separating the LOS from the reflected signal portions. It is based on an analysis of the cross correlation of the received signal with a locally generated code sequence in the vicinity of the tracking point of a Delay-Locked Loop (DLL). For this reason, we call this method Cross Correlation Function (CCF) Analysis. The proposed method achieves considerably more accurate estimates than a DLL. Its performance is comparable to the Multipath Estimating Delay-Locked Loop (MEDLL) which is considered to be the best method for reducing multipath induced errors, so far. However, the computational complexity of the CCF Analysis is by a factor of three smaller compared to the MEDLL. Extensive simulations have been conducted for the proposed method and the MEDLL in order to assess the robustness of the two approaches under various signal constellations.


2003

Auf ein Wort - Moeglichkeiten und Grenzen der automatischen Spracherkennung

R. Haeb-Umbach, Forschungsforum Paderborn (2003), pp. 68-71


2002

Employment of a multipath receiver structure in a combined GALILEO/UMTS receiver

R. Bischoff, R. Haeb-Umbach, W. Schulz, G. Heinrichs, in: IEEE 55th Vehicular Technology Conference (VTC 2002 Spring), 2002, pp. 1844-1848 vol.4

Current navigation systems like GPS (Global Positioning System) and its Russian counterpart GLONASS (Global Navigation Satellite System) only evaluate the direct signal path. The receivers treat the reflected paths also reaching the receiver antenna as disturbance which has to be suppressed. Multipath affects the tracking accuracy by resulting in a degeneration of the S-curve of the DLL (delay locked loop). Nowadays the future European systems GALILEO and GPSIIF/III with two new signals are on the way to the market and it is time to think about new receiver structures. Therefore we investigated if it is possible to use multipath for navigation constructively.


Estimation of Bias Location Error due to Absence of the LOS-Signal in a UMTS-System

T. Hesse, R. Bischoff, W. Schulz, R. Haeb-Umbach, in: International Symposium on Location Based Services for Cellular Users (LOCELLUS 2002), 2002

Current location methods for cellular communication systems TOA and E-OTD exploit time delays and time differences of various base station signals measured in a mobile phone to determine its location. These methods assume line-of-sight (LOS) connections to all utilized base stations. Since mobile radio channels are mainly characterized by scattered propagation paths and non-line-of-sight (NLOS) propagation, bias errors occur when measured time delays and time differences are utilized in position calculation algorithms. In this paper, the distribution of the time error due to NLOS propagation is estimated based on the channel model proposed in [1]. In combination with actual channel measurements in [2] the NLOS time error and its probability distribution function is estimated. With this information being determined for each received signal, position calculation algorithms can utilize the reliability information to enhance positioning accuracy. [1] L. J. Greenstein, V. Erceg, Y. S. Yeh, M. V. Clark, {grqq}A new path-gain/delay-spread propagation model for digital cellular channels'', IEEE Transactions on Vehicular Technology, Vol. 46, No. 2, May 1997 [2] H. Asplund, {grqq}Wideband Channel Measurements in Central Stockholm'', T1P1.5/98-242r1


Large Vocabulary Continuous Speech Recognition of Broadcast News - The Philips/RWTH Approach

P. Beyerlein, X. Aubert, R. Haeb-Umbach, M. Harris, D. Klakow, A. Wendemuth, S. Molau, N. Ney, M. Pitz, A. Sixtus, Speech Communication (2002)(37), pp. 109-131

Automatic speech recognition of real-live broadcast news (BN) data (Hub-4) has become a challenging research topic in recent years. This paper summarizes our key efforts to build a large vocabulary continuous speech recognition system for the heterogenous BN task without inducing undesired complexity and computational resources. These key efforts included: - automatic segmentation of the audio signal into speech utterances; - efficient one-pass trigram decoding using look-ahead techniques; - optimal log-linear interpolation of a variety of acoustic and language models using discriminative model combination (DMC); - handling short-range and weak longer-range correlations in natural speech and language by the use of phrases and of distance-language models; - improving the acoustic modeling by a robust feature extraction, channel normalization, adaptation techniques as well as automatic script selection and verification. The starting point of the system development was the Philips 64k-NAB word-internal triphone trigram system. On the speaker-independent but microphone-dependent NAB-task (transcription of read newspaper texts) we obtained a word error rate of about 10\%. Now, at the conclusion of the system development, we have arrived at Philips at an DMC-interpolated phrase-based crossword-pentaphone 4-gram system. This system transcribes BN data with an overall word error rate of about 17\%.


A Joint Time Multiplex Receiver for UMTS and Galileo

R. Bischoff, R. Haeb-Umbach, G. Heinrichs, in: ION-GPS 2002, 2002

Currently the future satellite navigation system Galileo and the third generation mobile communications system UMTS are on their way to the market in Europe. cdma2000 is under development in the USA and, furthermore, a new civil GPS signal in L2 band and the new frequency band L5 are added. In a hybrid receiver for satellite navigation and mobile radio communications, the possibility of an additional usage of the mobile radio signals for navigation purposes could also be a remedy to one problem of satellite navigation systems, which is the reduced location accuracy inside of buildings and urban canyons. A hybrid receiver with two fully separated receiver branches would lead to an increased bill of material and to increased power consumption in the receiver. This paper, therefore, introduces a hybrid receiver capable of evaluating Galileo/GPS as well as UMTS/cdma2000 signals with reduced computational efforts. Furthermore, the proposed structure performs a constructive superposition of the incoming paths to improve location accuracy.


2001

Implementation of a Rake Receiver Architecture into a Galileo Receiver

R. Bischoff, R. Haeb-Umbach, W. Schulz, G. Heinrichs, in: 1st ESA Workshop on Satellite Navigation User Equipment Technology (Navitec 2001), 2001


Automatic generation of phonetic regression class trees for MLLR adaptation

R. Haeb-Umbach, IEEE Transactions on Speech and Audio Processing (2001), 9(3), pp. 299-302

In this paper, it is shown that a correlation criterion is the appropriate criterion for bottom-up clustering to obtain broad phonetic class regression trees for maximum likelihood linear regression (MLLR)-based speaker adaptation. The correlation structure among speech units is estimated on the speaker-independent training data. In adaptation experiments the tree outperformed a regression tree obtained from clustering according to closeness in acoustic space and achieved results comparable with those of a manually designed broad phonetic class tree


Multiclass linear dimension reduction by weighted pairwise Fisher criteria

M. Loog, R. Duin, R. Haeb-Umbach, IEEE Transactions on Pattern Analysis and Machine Intelligence (2001), 23(7), pp. 762-766

We derive a class of computationally inexpensive linear dimension reduction criteria by introducing a weighted variant of the well-known K-class Fisher criterion associated with linear discriminant analysis (LDA). It can be seen that LDA weights contributions of individual class pairs according to the Euclidean distance of the respective class means. We generalize upon LDA by introducing a different weighting function


2000

Multi-class Linear Feature Extraction by Nonlinear PCA

R.P. Duin, M. Loog, R. Haeb-Umbach, in: International Conference on Pattern Recognition (ICPR 2000), 2000

The traditional way to find a linear solution to the feature extraction problem is based on the maximization of the class-between scatter over the class-within scatter (Fisher mapping). For the multi-class problem this is, however, sub-optimal due to class conjunctions, even for the simple situation of normal distributed classes with identical covariance matrices. We propose a novel, equally fast method, based on nonlinear PCA. Although still sub-optimal, it may avoid the class conjunction. The proposed method is experimentally compared with Fisher mapping and with a neural network based approach to nonlinear PCA. It appears to outperform both methods, the first one even in a dramatic way.


Data-driven Phonetic Regression Class Tree Estimation for MLLR Adaptation

R. Haeb-Umbach, in: International Conference on Spoken Language Processing (ICSLP 2000), 2000


LDA derived cepstral trajectory filters in adverse environmental conditions

M. Lieb, R. Haeb-Umbach, in: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), 2000, pp. II1105-II1108 vol.2

Amongst several data driven approaches for designing filters for the time sequence of spectral parameters, the linear discriminant analysis (LDA) based method has been proposed for automatic speech recognition. Here we apply LDA-based filter design to cepstral features, which better match the inherent assumption of this method that feature vector components are uncorrelated. Extensive recognition experiments have been conducted both on the standard TIMIT phone recognition task and on a proprietary 130-words command word task under various adverse environmental conditions, including reverberant data with real-life room impulse responses and data processed by acoustic echo cancellation algorithms. Significant error rate reductions have been achieved when applying the novel long-range feature filters compared to standard approaches employing cepstral mean normalization and delta and delta-delta features, in particular when facing acoustic echo cancellation scenarios and room reverberation. For example, the phone accuracy on reverberated TIMIT data could be increased from 50.7\% to 56.0\%


Multi-class Linear Dimension Reduction by Generalized Fisher Criteria

M. Loog, R. Haeb-Umbach, in: International Conference on Spoken Language Processing (ICSLP 2000), 2000


1999

An Investigation of Cepstral Parameterisations for Large Vocabulary Speech Recognition

R. Haeb-Umbach, M. Loog, in: Eurospeech, 1999

We examined variants of MFCC and PLP cepstral parameterisations in the context of large vocabulary continuous speech recognition under different acous-tical environmental conditions: Compared to MFCC, mel-frequency PLP uses a cubic root intensity-to-loudness law, and an LPC analysis is applied to the mel-warped spectrum. In LPC-smoothed MFCC, the only difference to MFCC is the additional LPC smoothing of the warped spectrum. While neither technique was able to significantly outperform the MFCC parameterisation in our setup which includes an LDA feature transformation, feature set combination via DMC at the acoustic likelihood level and via ROVER at the recognized word level delivered small but consistent improvements.


A study of broadcast news audio stream segmentation and segment clustering

M.J. Harris, X.L. Aubert, R. Haeb-Umbach, P. Beyerlein, in: Eurospeech, 1999

In transcription of broadcast news, dividing the signal into homogeneous segments, and clustering together similar segments is important. Decoding a complete broadcast news program in one chunk is technically di cult. Also, through creation of homogeneous clusters of segments, improvement from adaptation can be increased. Two systems of segmentation and clustering are compared. The best system used the BIC algorithm to produce long, homogeneous segments, and a nearest neighbour bottom-up agglomerative clustering algorithm to produce homogeneous clusters. Adaptation brought a word error rate (WER) improvement from 23:4% to 21:0% using the automatic segmentation and clustering, compared to an improvement from 21:8% to 20:0% using a handmade \correct" segmentation and clustering.


Investigations on inter-speaker variability in the feature space

R. Haeb-Umbach, in: ICASSP99 Phoenix, AZ, 1999

We apply Fisher variate analysis to measure the effectiveness of speaker normalization techniques. A trace criterion, which measures the ratio of the variations due to different phonemes compared to variations due to different speakers, serves as a first assessment of a feature set without the need for recognition experiments. By using this measure and by recognition experiments we demonstrate that cepstral mean normalization also has a speaker normalization effect, in addition to the well-known channel normalization effect. Similarly vocal tract normalization (VTN) is shown to remove inter-speaker variability. For VTN we show that normalization on a per sentence basis performs better than normalization on a per speaker basis. Recognition results are given on Wall Street Journal and Hub-4 databases


The Philips/RWTH system for transcription of broadcast news

P. Beyerlein, X.L. Aubert, R. Haeb-Umbach, M.J. Harris, D. Klakow, A. Wendemuth, S. Molau, M. Pitz, A. Sixtus, in: Eurospeech, 1999

This paper contains a description of the Philips/RWTH 1998 HUB4 system which has been build in a joint e ort of Philips Research Laboratories Aachen and Aachen University of Technology. We will focus our discussion on recent improvements compared to the original 1997 HUB4 system and evaluate them on the HUB4'97 evaluation data. The paper will deal with 1. a rough system overview including feature extraction, acoustic training, audio stream segmentation, and decoding 2. log-linear interpolation of distance-language models, 3. and the integration of various acoustic and language models via Discriminative Model Combination (DMC). The performance of the described system is 23% (relative) better than the performance of the 1997 Philips HUB4 system. A word error rate of 17.9% was achieved on the 1997 HUB4 evaluation set, compared to 23.5% using the original 1997 system.


The Philips/RWTH System for Transcription of Broadcast News

P. Beyerlein, X.L. Aubert, R. Haeb-Umbach, M.J. Harris, D. Klakow, A. Wendemuth, S. Molau, M. Pitz, A. Sixtus, in: Broadcast News Transcription and Understanding Workshop, Washington, 1999

This paper contains a description of the Philips/RWTH 1998 HUB4 system which has been build in a joint e ort of Philips Research Laboratories Aachen and Aachen University of Technology. We will focus our discussion on recent improvements compared to the original 1997 HUB4 system and evaluate them on the HUB4'97 evaluation data. The paper will deal with 1. a rough system overview including feature extraction, acoustic training, audio stream segmentation, and decoding 2. log-linear interpolation of distance-language models, 3. and the integration of various acoustic and language models via Discriminative Model Combination (DMC). The performance of the described system is 23% (relative) better than the performance of the 1997 Philips HUB4 system. A word error rate of 17.9% was achieved on the 1997 HUB4 evaluation set, compared to 23.5% using the original 1997 system.


1998

Acoustic Modeling in the Philips Hub-4 Continuous-Speech Recognition System

R. Haeb-Umbach, X.L. Aubert, P. Beyerlein, D. Klakow, M. Ullrich, A. Wendemuth, P. Wilcox, in: DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, 1998

In this paper we describe some characteristics of the acoustic modeling used in the Philips continuous-speech recognition system for the DARPA Hub-4 1997 evaluation, which are related to robustness issues. We aimed at a conceptually simple system: We trained two model sets on 70 hours of the Hub-4 training data, one for within-word and one for cross-word decoding. These model sets were used for both genders and all environmental conditions. In order to be able to do so, channel normalization (mean, variance normalization) and speaker normalization (vocal tract length normalization, realized by an appropriate shift of the center frequencies of the mel filter bank) have been applied, as well as adaptation techniques. MLLR-based unsupervised batch adaptation on clusters of segments was conducted both after a first within-word decoding and a cross-word decoding pass. The training strategy and the effects of the various normalization and adaptation techniques will be discussed in the paper.


A Study on Speaker Normalization Using Vocal Tract Normalization and Speaker Adaptive Training

L. Welling, R. Haeb-Umbach, X. Aubert, N. Haberland, in: ICASSP 1998, Seattle, 1998

Although speaker normalization is attempted in very different manners, vocal tract normalization (VTN) and speaker adaptive training (SAT) share many common properties. We show that both lead to more compact representations of the phonetically relevant variations of the training data and that both achieve improved error rate performance only if a complementary normalization or adaptation operation is conducted on the test data. Algorithms for fast test speaker enrollment are presented for both normalization methods: in the framework of SAT, a pre-transformation step is proposed, which alone, i.e. without subsequent unsupervised MLLR adaption, reduces the error rate by almost 10% on the WSJ 5k test sets. For VTN, the use of a Gaussian mixture model makes obsolete a first recognition pass to obtain a preliminary transcription of the test utterance at hardly and loss in performance.


Language-Model Investigations related to Broadcast News

D. Klakow, X.L. Aubert, R. Haeb-Umbach, P. Beyerlein, M. Ullrich, A. Wendemuth, P. Wilcox, in: DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, 1998

In this paper we present some experiments that have been performed while developing language models for the PHILIPS Broadcast News system. Three main issues will be discussed: construction of phrases, adaptation of remote corpora to this task, and the combination of the different models. Also, perplexities on the 1997 evaluation data are reported.


Automatic Transcription of English Broadcast News

P. Beyerlein, X.L. Aubert, R. Haeb-Umbach, D. Klakow, M. Ullrich, A. Wendemuth, P. Wilcox, in: DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne, 1998

In this paper the Philips Broadcast News transcription system is described. The Broadcast News task aims at the recognition of "found" speech in radio and television broadcasts without any additional side information (e.g. speaking style, background conditions). The system was derived from the Philips continuous mixture density crossword HMM system, using MFCC features and Laplacian densities. A segmentation was performed to obtain sentence-like partitions of the broadcasts. Using data-driven clustering, the obtained segments were grouped into clusters with similar acoustic conditions for adaptation purposes. Gender independent word-internal and crossword triphone models were trained on 70 hours of the HUB4 training data. No focus condition specific training was applied. Channel and speaker normalization was done by mean and variance normalization as well as VTN and MLLR. The transcription was produced by an adaptive multiple pass decoder starting with phrase-bigram decoding using word-internal triphones and finishing with a phrase-trigram decoding using MLLR-adapted crossword models.


1997

The development of a command-based speech interface for a telephone answering machine

S. Gamm, R. Haeb-Umbach, D. Langmann, Speech Communication (1997)

This paper reports the design of a command-based speech interface for an answering machine or a voice mail system. Automatic speech recognition was integrated in order to facilitate the remote control and the retrieval of voice messages from any telephone in a speech-only dialogue. The design goal was that consumers would perceive the speech interface as a benefit compared with the common touch-tone interface. In this paper we will first describe the speech technology underlying the system. Then it will be shown how, based on this technology, the user interface was designed in a top-down approach. We started with the development of a concept and tested it by means of a Wizard-of-Oz simulation. After refining the concept in parallel design, it was implemented in a high-fidelity prototype. By means of qualitative user testing the design was improved in three iteration steps. The achievement of the design goal was finally verified with user tests in two countries.


Investigation of Acoustic Front Ends for Speaker-Independent Speech Recognition in the Car

D. Langmann, F. Wuppermann, R. Haeb-Umbach, A. Fischer, T. Eisele, in: Aachener Kolloquium on Signal Theory, 1997


Signal Representations for Hidden Markov Model Based On-Line Handwriting Recognition

J. Dolfing, R. Haeb-Umbach, in: ICASSP, Munich, 1997

Addresses the problem of online, writer-independent, unconstrained handwriting recognition. Based on hidden Markov models (HMM), which are successfully employed in speech recognition tasks, we focus on representations which address scalability, recognition performance and compactness. 'Delayed' features are introduced which integrate more global, handwriting specific knowledge into the HMM representation. These features lead to larger error-rate reduction than 'delta' features which are known from speech recognition and even require fewer additional components. Scalability is addressed with a size-independent representation. Compactness is achieved with linear discriminant analysis. The representations are discussed and the results for a mixed-style word recognition task with vocabularies of 200 (up to 99% correct words) and 20000 words (up to 88.8% correct words) are given.


Robust Speech Recognition for Wireless Networks and Mobile Telephony

R. Haeb-Umbach, in: Eurospeech, 1997

The increased popularity of mobile telephony introduces both challenges and opportunitites for automatic speech recognition. ASR offers ways to simplify the use of mobile phones, notably in hands- and eyes-busy situations. However, the acoustic environment can be severely degraded and the wireless network may add additional distortions to the speech signal. This paper gives an overview of the sources of degradation and attempts to robust speech recognition for mobile communications. Emphasis is placed on approaches which are suitable for implementation in mobile terminals. Two example applications are described which illustrate the robustness issues and design considerations typical of low-cost noisy speech recognition: voice-dialling in a GSM phone and hands-free digit recognition in the car.


European Speech Databases for Telephone Applications

H. Hoege, H.S. Tropf, R. Winsky, H. van den Heuvel, R. Haeb-Umbach, K. Choukri, in: ICASSP, Munich, 1997

The SpeechDat project aims to produce speech databases for all official languages of the European Union and some major dialectal variants and minority languages resulting in 28 speech databases. They will be recorded over fixed and mobile telephone networks. This will provide a realistic basis for training and assessment of both isolated and continuous-speech utterances, employing whole-word or subword approaches, and thus can be used for developing voice driven teleservices including speaker verification. The specification of the databases has been developed jointly, and is essentially the same for each language to facilitate dissemination and use. There will be a controlled variation among the speakers concerning sex, age, dialect, environment of call, etc. The validation of all databases will be carried out centrally. The SpeechDat databases will be transferred to ELRA for distribution. The next databases to be recorded will cover East European languages.


Acoustic Front Ends for Speaker-Independent Digit Recognition in Car Environments

D. Langmann, A. Fischer, F. Wuppermann, R. Haeb-Umbach, T. Eisele, in: Eurospeech, 1997

This paper describes speaker-independent speech recognition experiments concerning acoustic front end processing on a speech database that was recorded in 3 different cars. We investigate different feature analysis approaches (mel-filter bank, mel-cepstrum, perceptually linear predictive coding) and present results with noise compensation techniques based on spectral subtraction. Although the methods employed lead to considerable error rate reduction the error analysis shows that low signal-to-noise ratios are still a problem


1996

FRESCO: The French Telephone Speech Data Collection - Part of the European SpeechDat(M) Project

D. Langmann, R. Haeb-Umbach, in: ICSLP, Philadelphia, 1996

The paper describes the design, collection and postprocessing of the French SpeechDat corpus FRESCO. Being a database of approximately 35000 utterances recorded from 1000 callers over the terrestrial telephone network in France, it comprises immediately usable and relevant speech for the initial training and assessment of speaker independent phoneme model or word model based speech recognizers, as they are employed in automated telephone services. FRESCO is one of the 1000 speaker telephone speech databases produced as "case studies" within the European project SpeechDat(M).


Robust Rejection Modeling for a Small-Vocabulary Application

D. Langmann, R. Haeb-Umbach, T. Eisele, in: ITG Fachtagung Sprachkommunikation, Frankfurt, 1996


A Comparative Study of Linear Feature Transformation Techniques for Automatic Speech Recognition

T. Eisele, R. Haeb-Umbach, D. Langmann, in: ICSLP , Philadelphia, 1996

Although widely used, there are still open questions concerning which properties of linear discriminant analysis (LDA) account for its success in many speech recognition systems. In order to gain more insight into the nature of the transformation we compare LDA with mel-cepstral feature vectors with respect to the following criteria: decorrelation and ordering property; invariance under linear transforms; automatic learning of dynamical features; and data dependence of the transformation.


Findings with the Design of a Command-Based Speech Interface for a Voice Mail System

S. Gamm, R. Haeb-Umbach, D. Langmann, in: IEEE Workshop on Interactive Voice Technology for Telecommunications Applications, 1996

This paper tells the story of the design of a command-based speech interface for a voice mail system. Speech recognition was integrated in the voice mail system in order to allow the remote interrogation of messages in a speech-only dialogue. Our design goal was that consumers would perceive voice control as a clear benefit versus touch-tone control. It is shown how the speech interface was designed in a top-down approach. We started with a concept development and tested it by means of a Wizard-of-Oz simulation. After refining the concept in parallel design, the design was implemented in a high-fidelity prototype. By means of qualitative user testing it was improved in three iteration steps. We verified the achievement of our design goal with tests in two countries


1995

Application of Clustering Techniques to Mixture Density Modelling for Continuous-Speech Recognition

C. Dugast, P. Beyerlein, R. Haeb-Umbach, in: ICASSP, Detroit, 1995

Clustering techniques have been integrated at different levels into the training procedure of a continuous-density hidden Markov model (HMM) speech recognizer. These clustering techniques can be used in two ways. First acoustically similar states are tied together. It will help to reduce the number of parameters but also allow to train otherwise rarely seen states together with more robust ones (state-tying). Secondly densities are clustered across states, this reduces the number of densities while at the same time keeping the best performances of our recognizer (density-clustering). We have applied these techniques both to word-based small-vocabulary and phoneme-based large-vocabulary recognition tasks. On the WSJ task, we could achieve a reduction of the word error rate by 7%. On the TI/NIST-connected digit task, the number of parameters was reduced by a factor 2-3 while keeping the same string error rate.


User interface design of voice controlled consumer electronics

S. Gamm, R. Haeb-Umbach, Philips Journal of Research (1995)

Today speech recognition of a small vocabulary can be realized so cost-effectively that the technology can penetrate into consumer electronics. But, as first applications that failed on the market show, it is by no means obvious how to incorporate voice control in a user interface. This paper addresses the issue of how to design a voice control so that the user perceives it as a benefit. User interface guidelines that are adapted or specific to voice control are presented. Then the process of designing a voice control in the user-centred approach is described. By means of two examples, the car stereo and telephone answering machine, it is shown how this is turned into practice.


Human Factors of a Voice-Controlled Car Stereo

S. Gamm, R. Haeb-Umbach, in: Eurospeech, Madrid, 1995


Continuous speech dictation - From theory to practice

V. Steinbiss, H.J. Ney, U. Essen, B.H. Tran, X.L. Aubert, C. Dugast, R. Kneser, H.G. Meier, M. Oerder, R. Haeb-Umbach, D. Geller, W. Hoellerbauer, H. Bartosik, Speech Communication (1995)

This paper gives an overview of the Philips research system for phoneme-based, large-vocabulary, continuousspeech recognition. The system has been successfully applied to various tasks in the German and (American) English languages, ranging from small vocabulary tasks to very large vocabulary tasks. Here, we concentrate on continuousspeech recognition for dictation in real applications, the dictation of legal reports and radiology reports in German. We describe this task and report on experimental results. We also describe a commercial PC-based dictation system which includes a PC implementation of our scientific recognition prototype. In order to allow for a comparison with the performance of other systems, a section with an evaluation on the standard Wall Street Journal task (dictation of American English newspaper text) is supplied. The recognition architecture is based on an integrated statistical approach. We describe the characteristic features of the system as opposed to other systems: 1. the Viterbi criterion is consistently applied both in training and testing; 2. continuous mixture densities are used without tying or smoothing; 3. time-synchronous beam search in connection with a phoneme look-ahead is applied to a tree-organized lexicon.


The Philips Research system for continuous-speech dictation

V. Steinbiss, H.J. Ney, X.L. Aubert, S. Besling, C. Dugast, U. Essen, D. Geller, R. Haeb-Umbach, R. Kneser, H.G. Meier, M. Oerder, B.H. Tran, Philips Journal of Research (1995)

This paper gives an overview of the Philips Research system for continuous-speech recognition. The recognition architecture is based on an integrated statistical approach. The system has been successfully applied to various tasks in American English and German, ranging from small vocabulary tasks to very large vocabulary tasks and from recognition only to speech understanding. Here, we concentrate on phoneme-based continuous-speech recognition for large vocabulary recognition as used for dictation, which covers a significant part of our research work on speech recognition. We describe this task and report on experimental results. In order to allow a comparison with the performance of other systems, a section with an evaluation on the standard North American Business news (NAB2) task (dictation of American English newspaper text) is supplied.


Speech recognition algorithms for voice control interfaces

R. Haeb-Umbach, P. Beyerlein, D. Geller, Philips Journal of Research (1995)

Recognition accuracy has been the primary objective of most speech recognition research, and impressive results have been obtained, e.g. less than 0.3% word error rate on a speaker-independent digit recognition task. When it comes to real-world applications, robustness and real-time response might be more important issues. For the first requirement we review some of the work on robustness and discuss one specific technique, spectral normalization, in more detail. The requirement of real-time response has to be considered in the light of the limited hardware resources in voice control applications, which are due to the tight cost constraints. In this paper we discuss in detail one specific means to reduce the processing and memory demands: a clustering technique applied at various levels within the acoustic modelling.


Automatic Transcription of Unknown Words in a Speech Recognition System

R. Haeb-Umbach, P. Beyerlein, E. Thelen, in: ICASSP, Detroit, 1995

We address the problem of automatically finding an acoustic representation (i.e. a transcription) of unknown words as a sequence of subword units, given a few sample utterances of the unknown words, and an inventory of speaker-independent subword units. The problem arises if a user wants to add his own vocabulary to a speaker-independent recognition system simply by speaking the words a few times. Two methods are investigated which are both based on a maximum-likelihood formulation of the problem. The experimental results show that both automatic transcription methods provide a good estimate of the acoustic models of unknown words. The recognition error rates obtained with such models in a speaker-independent recognition task are clearly better than those resulting from separate whole-word models. They are comparable with the performance of transcriptions drawn from a dictionary.


The Usability Engineering of a Voice-Controlled Answering Machine

S. Gamm, R. Haeb-Umbach, D. Langmann, in: International Symposium on Human Factors in Telecommunications, Melbourne, 1995


1994

Improvements in beam search for 10000-word continuous-speech recognition

R. Haeb-Umbach, H. Ney, IEEE Transactions on Speech and Audio Processing (1994)

The authors describe the improvements in a time-synchronous beam search strategy for a 10000-word continuous-speech recognition task. Basically they introduced two measures, namely a tree organization of the pronunciation lexicon and a novel look-ahead technique at the phoneme level. The experimental tests performed showed that the number of state hypotheses could be reduced from 50000 to 3000, i.e., by a factor of about 17. At the same time, the word error rate did not increase.


Progress in Large-Vocabulary, Continuous Speech Recognition

H. Ney, V. Steinbeiss, X.L. Aubert, R. Haeb-Umbach, in: Artifical Intelligence, Progress and Prospects of Speech Research and Technology, Munich, 1994


An Overview of the Philips Research System for Large Vocabulary Continuous Speech Recognition

H. Ney, V. Steinbeiss, R. Haeb-Umbach, B.H. Tran, International Journal on Pattern Recognition and Artificial Intelligence (1994)

This paper gives an overview of a research system for phoneme based, large vocabulary continuous speech recognition. The system to be described has been applied to the SPICOS task, the DARPA RM task and a 12000 word dictation task. Experimental results for these three tasks will be presented. Like many other systems, the recognition architecture is based on an integrated statistical approach. In this paper, we describe the characteristic features of the system as opposed to other systems: (1) The Viterbi criterion is consistently applied both in training and testing. (2) Continuous mixture densities are used without any tying or smoothing; this approach can be viewed as a sort of ‘statistical template matching’. (3) Time-synchronous beam search is used consistently throughout all tasks; extensions using a tree organization of the vocabulary and phoneme lookahead are presented so that a 12000 word task can be handled.


1993

Improvements in Connected Digit Recognition Using Linear Discriminant Analysis and Mixture Densities

R. Haeb-Umbach, D. Geller, H. Ney, in: ICASSP, Minneapolis, 1993

Four methods were used to reduce the error rate of a continuous-density hidden Markov-model-based speech recognizer on the TI/NIST connected-digits recognition task. Energy thresholding sets a lower limit on the energy in each frequency channel to suppress spurious distortion accumulation caused by random noise. This led to an improvement in error rate by 15%. Spectrum normalization was used to compensate for across-speaker variations, resulting in an additional improvement by 20%. The acoustic resolution was increased up to 32 component densities per mixture. Each doubling of the number of component densities yielded a reduction in error rate by roughly 20%. Linear discriminant analysis was used for improved feature selection. A single class-independent transformation matrix was applied to a large input vector consisting of several adjacent frames, resulting in an improvement by 20% for high acoustic resolution. The final string error rate was 0.84%.


The Philips Research System for Large-Vocabulary Continuous-Speech Recognition

V. Steinbiss, H. Ney, R. Haeb-Umbach, B. Train, U. Essen, R. Kneser, M. Oerder, H.G. Meier, X. Aubert, C. Dugast, D. Geller, W. Hoellerbauer, H. Bartosik, in: EUROSPEECH, Berlin, 1993

This paper gives a status report of the Philips research system for phoneme-based, large-vocabulary, continuous-speech recognition. Like for many other systems, the recognition architecture is based on an integrated statistical approach. We describe the characteristic features of the system as opposed to other systems: 1. The Viterbi criterion is consistently applied both in training and testing. 2. Continuous mixture densities are used without tying or smoothing. 3. Time-synchronous beam search in connection with a phoneme look-ahead is applied to a tree-organized lexicon. The system has been successfully applied to the American English DARPA RM task. Here, we report experimental results for a German 13 000-word Philips internal dictation task. In addition to the scientific prototype, a PC version has been set up which is described here for the first time.


Continuous Mixture Densities and Linear Discriminant Analysis for Improved Context-Dependent Acoustic Models

X.L. Aubert, R. Haeb-Umbach, H. Ney, in: ICASSP, Minneapolis, 1993

Linear discriminant analysis (LDA) experiments reported previously (ICASSP-92 vol.1, p.13-16), are extended to context-dependent models and speaker-independent large vocabulary continuous speech recognition. Two variants of using mixture densities are compared: state-specific modeling and the monophone-tying approach where densities are shared across the states relevant to the same phoneme. Results are presented on the DARPA Resource Management (RM) task for both speaker-dependent (SD) and speaker-independent (SI) parts. Using triphone models based on LDA and continuous mixture densities, significant improvements have been observed and the following word error rates have been achieved: for the SD part, 7.8% without grammar and 1.5% with word pair; and for the SI part, 17.2% and 4.6%, respectively. These scores are averaged over 1200 SD or SI evaluation sentences and are among the best published so far on the RM database.


Design and use of speech recognition algorithms for a mobile radio telephone

S. Dobler, D. Geller, R. Haeb-Umbach, P. Meyer, H. Ney, H.W. Ruehl, Speech Communication (1993)

To decrease the hazards of using mobile phones while driving, voice processing provides several tools that simplify their use: echo cancellation allows comfortable hands-free conversation, feedback and user guidance by voice allow to operate the phone in eyes-busy situations, and last not least speech recognition frees from keypad data entry to operate the telephone. A comprehensive view of a device incorporating the above mentioned technologies, which has been realized as an add-on for the Philips car telephone family, will be presented. Emphasis is placed on the speech recognition algorithms. Robustness of the algorithms to changing acoustic environment was improved by estimating and subtracting the long-term spectrum. We will show that, if this operation is done recursively, it is equivalent to the high-pass filtering or RASTA (Relative Spectral Approaches) methods recently proposed in the literature.


1992


Improvements in Speech Recognition for Voice Dialling in the Car Environment

D. Geller, R. Haeb-Umbach, H. Ney, in: ESCA Workshop on Speech Recognition in Adverse Conditions, Cannes-Mandelieu, 1992


Improvements in Beam Search for 10,000-Word-Continuous Speech Recognition

H. Ney, R. Haeb-Umbach, B.H. Tran, M. Oerder, in: ICASSP, San Francisco, 1992

This paper describes the improvements in a time synchronous beam search strategy for a 10000-word continuous speech recognition task. The improvements are based on two measures: a tree-organization of the pronunciation lexicon and a novel look-ahead technique at the phoneme level, both of which interact directly with the detailed search at the state levels of the phoneme models. Experimental tests were performed for four speakers on a 12306-word task. As a result of the above measures, the overall search effort was reduced by a factor of 17 without a loss in recognition accuracy.


Trellis codes for partial-response magnetooptical direct overwrite recording

R. Haeb-Umbach, R. Lynch, IEEE Journal on Selected Areas in Communications (1992)

The authors present conditions on the error sequences between channel input sequences which guarantee certain lower bounds on the free Euclidian distance at the output of a partial-response (PR) class I or II channel. From these expressions, trellis codes are derived which improve performance of binary signaling over noisy PR channels with reduced complexity maximum-likelihood sequence detection. They are shown to be compatible with the input restriction caused by the magnetooptical resonant coil direct overwrite recording scheme. The codes achieve high signal-to-noise ratio coding gains of 3 dB (on PR class I) and 2.2 dB (on PR class II) with rates as close to, but strictly less than, the capacity of the initial input restriction as desired. The performance of these codes is analyzed with an optical channel simulation system which shows that one code has the rare but highly desirable property that its maximum-likelihood sequence detector (MLSD) is less complex than the MLSD of the reference system and still achieves an error rate performance gain of 1.8 dB


Linear Discriminant Analysis for Improved Large Vocabulary Continuous Speech Recognition

R. Haeb-Umbach, H. Ney, in: ICASSP, San Francisco, 1992

The interaction of linear discriminant analysis (LDA) and a modeling approach using continuous Laplacian mixture density HMM is studied experimentally. The largest improvements in speech recognition could be obtained when the classes for the LDA transform were defined to be sub-phone units. On a 12000 word German recognition task with small overlap between training and test vocabulary a reduction in error rate by one-fifth was achieved compared to the case without LDA. On the development set of the DARPA RM1 task the error rate was reduced by one-third. For the DARPA speaker-dependent no-grammar case, the error rate averaged over 12 speakers was 9.9%. This was achieved with a recognizer using LDA and a set of only 47 Viterbi-trained context-independent phonemes.


A modified trellis coding technique for partial response channels

R. Haeb-Umbach, IEEE Transactions on Communications (1992)

The problem of trellis coding for multilevel baseband transmission over partial response channels with transfer polynomials of the form (1+or-D/sup N/) is addressed. The novel method presented here accounts for the channel memory by using multidimensional signal sets and partitioning the signal set present at the noiseless channel output. It is shown that this coding technique can be viewed as a generalization of a well-known procedure for binary signaling: the concatenation of convolutional codes and inner block codes that are tuned to the channel polynomial. It results in high coding gains with moderate complexity if some bandwidth expansion is accepted.


1991

A Look-Ahead Technique for Large Vocabulary Continuous Speech Recognition

R. Haeb-Umbach, H. Ney, in: EUROSPEECH, Genova, 1991

In a large vocabulary continuous speech recognition task the search for the "best" (in the maximum-a-posteriori sense) word sequence is the most (computing) time consuming part of the system. End-of-word hypotheses are created almost every time frame. With a stochastic language model every lexicon entry is an admissible successor candidate. By using a "fast match" module which scores the word candidates according to their acoustic feasibility ahead of the current time frame, the search cost can be considerably reduced. Only the fraction of the words with favourable fast match scores will be further processed in the detailed match, where the likelihood of a segment of acoustics given the word model is computed. We derive a novel word selection strategy which is "consistent" in the sense that it introduces no additional decoding errors and which still reduces the search space by a factor of 2 - 3 compared to standard Viterbi beam search. Giving up the consistency requirement, pruning strategies can be deduced which further reduce the search effort significantly: the size of the word startup list is reduced to 2% - 4% of its original size with a modest increase in error rate by l%-2%.


1990

Coding and Signal Processing for a Magneto-optic Resonant Bias Coil Overwrite Experiment

R. Haeb-Umbach, D. Rugar, T. Howell, G.P. Coleman, in: International Conference on Communication, Atlanta, 1990

The resonant bias coil technique requires even-numbered transitions to be written on even-numbered clock cycles, and odd-numbered transitions on odd-numbered clock cycles. This constraint is met by double-spaced run-length-limited (d, k, 2) codes where the number of consecutive zeros is even. Exploiting the polarity of the transitions, a detection window that has double the size of the code bit period is possible. The authors describe a detection circuit that achieves the enlarged detection window, and present a phase detector which is particularly simple to implement. Channel bit-error-rate measurements have been carried out employing a rate 1/3 (2, 8, 2) code. Error rates of 10/sup -8/ were achieved for recording densities up to 33 kb/in. The results demonstrate the expected excellent immunity of the direct overwrite scheme to bloom.


1989

A systematic approach to carrier recovery and detection of digitally phase modulated signals on fading channels

R. Haeb-Umbach, H. Meyr, IEEE Transactions on Communications (1989)

The problem of optimal carrier recovery and detection of digitally phase modulated signals on fading channels by using a nonstructured approach is presented, i.e. no constraint is placed on the receiver structure. First, the optimal receiver is derived for digitally phase-modulated signals when transmitted over a frequency-nonselective fading channel with memory. The memory results from the fact that usually the coherence time of the channel is larger than the symbol period. Symbols adjacent in time cannot be detected independently and therefore the well-known quadratic receiver is not optimal in this case. A maximum a posteriori (MAP) detector is derived and explicitly utilizes the channel memory for carrier recovery. The derivation shows that the optimal carrier recovery is, under certain conditions, a Kalman filter. Some attractive properties of this carrier recovery unit (including the absence of hang up) are discussed. Then the error rate of several digital modulation schemes is calculated taking the performance of the filter into account. The differences in susceptibility of the modulation schemes to carrier phase jitter are specified.


1988

A Comparison of Coherent and Differentially Coherent Detection Schemes for Fading Channels

R. Haeb-Umbach, in: International Conference on Vehicular Technology, Philadelphia, 1988

The common basis of coherent and differentially coherent detection is considered from the viewpoint of carrier recovery as the estimation of fading distortion. In the differentially coherent receiver a simple estimate of the fading distortion is used whereas the coherent receiver uses the optimal estimate. The bit error rates (BERs) for M-ary PSK (phase-shift keyed) and DPSK (double phase-shift keyed) transmission are calculated using a single method of calculation for both detection schemes. The calculation takes into account nonperfect carrier recovery, cochannel interference, and diversity. The results allow a direct comparison of the two schemes and show that coherent detection is preferable in many realistic fading environments.


A digital Synchronizer for Linearly Modulated Signals Transmitted over a Frequency-Nonselective Fading Channel

R. Haeb-Umbach, H. Meyr, in: International Conference on Communications, Philadelphia, 1988

A digital carrier recovery structure which allows coherent detection on frequency-nonselective fading channels is presented. The synchronizer estimates the multiplicative distortion introduced by the channel. It is shown that the structure is superior to a phase-locked loop and is well suited for a fully digital realization. A detailed synchronizer design and simulation results are presented for a land-mobile radio channel. This includes a novel scheme for a fully digital frequency offset estimation and correction.


1987

Optimal Carrier Recovery and Detection on Frequency-Nonselective Fading Channels

R. Haeb-Umbach, H. Meyr, in: Proc. Symposium on Inf. Theory and Appl. (SITA), Tokio, 1987


1986

An all digital implementation of a receiver for bandwidth-efficient communication

M. Oerder, G. Ascheid, R. Haeb-Umbach, H. Meyr, Signal Processing: Theories and Applications (1986)


Open list in Research Information System

The University for the Information Society