View on GitHub

Joint-ASR-AAC

Joint speech recognition and audio captioning

Description

This page provides details of our paper Joint Speech Recognition and Audio Captioning, submitted to ICASSP 2022. For better model interpretability and holistic understanding of real-world audio samples, we aim to bring together the growing field of automated audio captioning (AAC) and the well studied automatic speech recognition (ASR), in an end-to-end manner. The goal of AAC is to generate natural language descriptions of contents in audio samples, while ASR extracts a transcript of speech content. An exemplar output from our jointly trained models as compared to independently trained ASR and AAC models is shown below.

Method Generated Output(s)
ASR-only n. e. scale now that’s a break job
AAC-only a train approaches and blows a horn
Cat-AAC-ASR
(concatenating AAC & ASR outputs)
nice giel now that’s a break job
a gun fires and a person whistles
Dual-decoder nise feel now that’s a break job
gunshots fire and male voices with gunshots and blowing while a duck quacks in the background
Human Generated nice kill, now that’s a great shot
a man speaking and gunshots ringing out

Mixing Samples from WSJ and AudioCaps

A major hurdle in evaluating joint ASR AAC models is the lack of labeled audio datasets with both speech transcriptions and audio captions. Therefore we create a multi-task dataset (see below instructions) by mixing the clean speech Wall Street Journal corpus with multiple levels of background noises chosen from the AudioCaps dataset.

Requirements

  pip install youtube-dl

Clone the repository

  git clone https://github.com/chintu619/Joint-ASR-AAC.git
  cd Joint-ASR-AAC

Download AudioCaps data

Convert WSJ data to .WAV format

Mix WSJ and AudioCaps samples

Notes