Description
This page provides details of our paper Joint Speech Recognition and Audio Captioning, submitted to ICASSP 2022. For better model interpretability and holistic understanding of real-world audio samples, we aim to bring together the growing field of automated audio captioning (AAC) and the well studied automatic speech recognition (ASR), in an end-to-end manner. The goal of AAC is to generate natural language descriptions of contents in audio samples, while ASR extracts a transcript of speech content. An exemplar output from our jointly trained models as compared to independently trained ASR and AAC models is shown below.
Method | Generated Output(s) |
---|---|
ASR-only | n. e. scale now that’s a break job |
AAC-only | a train approaches and blows a horn |
Cat-AAC-ASR (concatenating AAC & ASR outputs) |
nice giel now that’s a break job a gun fires and a person whistles |
Dual-decoder | nise feel now that’s a break job gunshots fire and male voices with gunshots and blowing while a duck quacks in the background |
Human Generated | nice kill, now that’s a great shot a man speaking and gunshots ringing out |
Mixing Samples from WSJ and AudioCaps
A major hurdle in evaluating joint ASR AAC models is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we create a multi-task dataset (see below instructions) by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset.
Requirements
pip install youtube-dl
Clone the repository
git clone https://github.com/chintu619/Joint-ASR-AAC.git
cd Joint-ASR-AAC
Download AudioCaps data
- Downloading YouTube audio using
youtube-dl
andffmpeg
packagecat data/train_audiocaps/files_su.txt | ./download_audiocaps.sh
- Expected directory structure:
corpora/audiocaps_data │ └───train │ │ 000AjsqXq54.wav │ │ 001_HxkADSI.wav │ │ 004NnY1farU.wav │ │ ...
Convert WSJ data to .WAV
format
- Convert WSJ (WSJ0, WSJ1) samples from original
.wv1
format to.wav
format as follows:corpora/wsj_wav │ └───train_si284 │ │ 011c0201.wav │ │ ... │ │───test_dev93 │ │ 4k0c0301.wav │ │ ... │ │───test_eval92 │ │ 440c0401.wav │ │ ...
Mix WSJ and AudioCaps samples
- Mix samples from both datasets with specified mixing weight. Output samples will be saved to
corpora/wsj_mixnospwav{0.1,0.2,...}
mix_audio.py "train_si284 test_dev93 test_eval92" train_audiocaps 0.1 mix_audio.py "train_si284 test_dev93 test_eval92" train_audiocaps 0.2 ...
Notes
- Some of the audio files in YouTube might no longer be available (shown as
ERROR: Video unavailable
while downloading) - The audio samples in the AudioCaps dataset are not publicly available. Alternatively, one can use the Clotho-V2 dataset available here. We also provide a script to download and reformat this dataset using
./download_clothov2.sh
.