The closing date for this job submission has passed.

Job Description

Blind source separation (BSS) can be seen as a generalization of denoising a noisy signal when several sensors are available. Each of them records the same physical phenomenon in a different way: such a diversity is then useful to separate the present signals for instance by independent component analysis (ICA) or sparse component analysis (SCA).
The main objective of speech separation/extraction is to mimic the ability of a human to separate multiple sound sources from their sound mixtures using a machine, i.e. computer-based solution of the so-called cocktail party problem. This problem was coined by Colin Cherry in 1953, who first asked the question: “How do we [humans] recognize what one person is saying when others are speaking at the same time?". Despite being studied extensively, it remains a scientific challenge as well as an active research area. A main stream of effort made in the past decade in the signal processing community has been to address the problem under the framework of convolutive blind source separation (CBSS) where the sound recordings are modeled as linear convolutive mixtures of the unknown speech sources.

In the last decades, a lot of unimodal algorithms has been developed, i.e. operating only in the audio domain. However, as is widely accepted, both speech production and perception are inherently audio-visual processes which involve information from multiple modalities. On the one hand, the production of speech is usually coupled with the visual movement of the mouth and facial muscles. On the other hand, looking at the lip movement of a speaker (i.e. lip reading) is helpful for listeners to understand what has been said in a conversation, in particular when multiple competing conversations and background noise are present simultaneously. In this direction, i.e. integrating visual information into an audio-only speech source separation system, are emerging as an exciting new area in signal processing: multimodal (audio-visual) speech separation.

In this PhD thesis, we plan to extract speech sources using not only the audio information but also the video one in a multimodal framework. Several approaches will be considered, for instance, we plan to extend our previous multimodal speech source separation [6] in a more efficient way by using recent advances in source separation of multidimensional sources, or by analyzing the audio scene using the video of it. That is, we plan to perform a scene analysis using the video (for instance speaker localization) which will be used to propose new audio source separation algorithms. As an example, this latter idea will allow us to estimate the filters between the speakers and the microphones. Several questions have to be investigated among which, how can we robustly estimate these filters ?.

Job Information

Contact
email redacted
Related URL
http://www.gipsa-lab.grenob...
Institution
Grenoble University, GIPSA-Lab
Topic Categories
Location
Grenoble, France
Closing Date
Sept. 1, 2014