The Visual Microphone: Videos and Results

 

This page contrains captured videos and results by our technique. When recovering speech, we additionally apply speech enhancement audio denoising to the sounds recovered by our method. For non-speech clips, we denoise using spectral subtraction. More details can be found in our paper.

The videos are Motion JPEG compressed and stored in AVI format (our results were produced on those compressed videos).
The spectrograms shown on this page are on a log scale specified in dB. The specific scale for each plot is shown next to it.

* NOTE: our audio results are best experienced using good speakers, preferably headphones.

Contents

 

[1] Fisher, William M.; Doddington, George R. and Goudie-Marshall, Kathleen M. (1986). "The DARPA Speech Recognition Research Database: Specifications and Status". Proceedings of DARPA Workshop on Speech Recognition. pp. 93–99.

 

 

Visual Sound Recovery

In these experiments, we played the a MIDI recording of "Mary had a little lamb" at a bag of chips and a plant and recover the audio from a video of those objects using our technique. These examples are presented in Figure 1 in the paper. However, in that figure, the color axis of the spectrograms are shown on a linear scale. Here, we show them on a logarithm scale to give a sense of the noise characteristics of our algorithm. Because the input audio changes consists of pure tones that change frequency abruptly, the spectrum is smeared across all frequencies at these changes as shown both in the input and our result.

Input video Input audio Recovered audio
Chips2, 2200Hz, 704x400
.avi (10.8GB)
Mary had a little lamb MIDI
.wav
.wav
Plant, 2200Hz, 704x400
.avi (12.1GB)
Mary had a little lamb MIDI
.wav
.wav

In these experiments, a male speaker recited the nursery rhyme "Mary had a little lamb..." near a different bag of chips. The audio was recorded by a microphone and also recovered from a video of the object using our technique. We provide both for comparison as well as a version of the audio sampled at the same rate as the video.

Input video Recorded by microphone Microphone recording resampled to video rate Recovered audio
Chips1, 2200Hz, 704x704
.avi (13.3GB)
"Mary had a little lamb..."
.wav
.wav .wav
Chip1, 20000Hz, 192x192
.avi
(11.6GB)
"Mary had a little lamb..."
.wav
.wav .wav

 

 

Rolling Shutter

Videos taken with a Pentax K-01 with a 31mm lens. The camera recorded at 60 FPS at a resolution of 1280x720, with an exposure time of 1/2000 sec.

Input video Input audio Recovered audio
KitKat, 60Hz rolling shutter,
1280x720 .avi (173MB)
Mary had a little lamb MIDI
.wav
.wav
KitKat, 60Hz rolling shutter,
1280x720 .avi (32MB)
"Once upon a midnight dreary"
.wav
.wav