.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_beginner_audio_data_augmentation_tutorial.py: Audio Data Augmentation ======================= ``torchaudio`` provides a variety of ways to augment audio data. In this tutorial, we look into a way to apply effects, filters, RIR (room impulse response) and codecs. At the end, we synthesize noisy speech over phone from clean speech. .. code-block:: default import torch import torchaudio import torchaudio.functional as F print(torch.__version__) print(torchaudio.__version__) Preparation ----------- First, we import the modules and download the audio assets we use in this tutorial. .. code-block:: default import math from IPython.display import Audio import matplotlib.pyplot as plt from torchaudio.utils import download_asset SAMPLE_WAV = download_asset("tutorial-assets/steam-train-whistle-daniel_simon.wav") SAMPLE_RIR = download_asset("tutorial-assets/Lab41-SRI-VOiCES-rm1-impulse-mc01-stu-clo-8000hz.wav") SAMPLE_SPEECH = download_asset("tutorial-assets/Lab41-SRI-VOiCES-src-sp0307-ch127535-sg0042-8000hz.wav") SAMPLE_NOISE = download_asset("tutorial-assets/Lab41-SRI-VOiCES-rm1-babb-mc01-stu-clo-8000hz.wav") Applying effects and filtering ------------------------------ :py:func:`torchaudio.sox_effects` allows for directly applying filters similar to those available in ``sox`` to Tensor objects and file object audio sources. There are two functions for this: - :py:func:`torchaudio.sox_effects.apply_effects_tensor` for applying effects to Tensor. - :py:func:`torchaudio.sox_effects.apply_effects_file` for applying effects to other audio sources. Both functions accept effect definitions in the form ``List[List[str]]``. This is mostly consistent with how ``sox`` command works, but one caveat is that ``sox`` adds some effects automatically, whereas ``torchaudio``’s implementation does not. For the list of available effects, please refer to `the sox documentation `__. **Tip** If you need to load and resample your audio data on the fly, then you can use :py:func:`torchaudio.sox_effects.apply_effects_file` with effect ``"rate"``. **Note** :py:func:`torchaudio.sox_effects.apply_effects_file` accepts a file-like object or path-like object. Similar to :py:func:`torchaudio.load`, when the audio format cannot be inferred from either the file extension or header, you can provide argument ``format`` to specify the format of the audio source. **Note** This process is not differentiable. .. code-block:: default # Load the data waveform1, sample_rate1 = torchaudio.load(SAMPLE_WAV) # Define effects effects = [ ["lowpass", "-1", "300"], # apply single-pole lowpass filter ["speed", "0.8"], # reduce the speed # This only changes sample rate, so it is necessary to # add `rate` effect with original sample rate after this. ["rate", f"{sample_rate1}"], ["reverb", "-w"], # Reverbration gives some dramatic feeling ] # Apply effects waveform2, sample_rate2 = torchaudio.sox_effects.apply_effects_tensor(waveform1, sample_rate1, effects) print(waveform1.shape, sample_rate1) print(waveform2.shape, sample_rate2) Note that the number of frames and number of channels are different from those of the original after the effects are applied. Let’s listen to the audio. .. code-block:: default def plot_waveform(waveform, sample_rate, title="Waveform", xlim=None): waveform = waveform.numpy() num_channels, num_frames = waveform.shape time_axis = torch.arange(0, num_frames) / sample_rate figure, axes = plt.subplots(num_channels, 1) if num_channels == 1: axes = [axes] for c in range(num_channels): axes[c].plot(time_axis, waveform[c], linewidth=1) axes[c].grid(True) if num_channels > 1: axes[c].set_ylabel(f"Channel {c+1}") if xlim: axes[c].set_xlim(xlim) figure.suptitle(title) plt.show(block=False) .. code-block:: default def plot_specgram(waveform, sample_rate, title="Spectrogram", xlim=None): waveform = waveform.numpy() num_channels, _ = waveform.shape figure, axes = plt.subplots(num_channels, 1) if num_channels == 1: axes = [axes] for c in range(num_channels): axes[c].specgram(waveform[c], Fs=sample_rate) if num_channels > 1: axes[c].set_ylabel(f"Channel {c+1}") if xlim: axes[c].set_xlim(xlim) figure.suptitle(title) plt.show(block=False) Original: ~~~~~~~~~ .. code-block:: default plot_waveform(waveform1, sample_rate1, title="Original", xlim=(-0.1, 3.2)) plot_specgram(waveform1, sample_rate1, title="Original", xlim=(0, 3.04)) Audio(waveform1, rate=sample_rate1) Effects applied: ~~~~~~~~~~~~~~~~ .. code-block:: default plot_waveform(waveform2, sample_rate2, title="Effects Applied", xlim=(-0.1, 3.2)) plot_specgram(waveform2, sample_rate2, title="Effects Applied", xlim=(0, 3.04)) Audio(waveform2, rate=sample_rate2) Doesn’t it sound more dramatic? Simulating room reverberation ----------------------------- `Convolution reverb `__ is a technique that's used to make clean audio sound as though it has been produced in a different environment. Using Room Impulse Response (RIR), for instance, we can make clean speech sound as though it has been uttered in a conference room. For this process, we need RIR data. The following data are from the VOiCES dataset, but you can record your own — just turn on your microphone and clap your hands. .. code-block:: default rir_raw, sample_rate = torchaudio.load(SAMPLE_RIR) plot_waveform(rir_raw, sample_rate, title="Room Impulse Response (raw)") plot_specgram(rir_raw, sample_rate, title="Room Impulse Response (raw)") Audio(rir_raw, rate=sample_rate) First, we need to clean up the RIR. We extract the main impulse, normalize the signal power, then flip along the time axis. .. code-block:: default rir = rir_raw[:, int(sample_rate * 1.01) : int(sample_rate * 1.3)] rir = rir / torch.norm(rir, p=2) RIR = torch.flip(rir, [1]) plot_waveform(rir, sample_rate, title="Room Impulse Response") Then, we convolve the speech signal with the RIR filter. .. code-block:: default speech, _ = torchaudio.load(SAMPLE_SPEECH) speech_ = torch.nn.functional.pad(speech, (RIR.shape[1] - 1, 0)) augmented = torch.nn.functional.conv1d(speech_[None, ...], RIR[None, ...])[0] Original: ~~~~~~~~~ .. code-block:: default plot_waveform(speech, sample_rate, title="Original") plot_specgram(speech, sample_rate, title="Original") Audio(speech, rate=sample_rate) RIR applied: ~~~~~~~~~~~~ .. code-block:: default plot_waveform(augmented, sample_rate, title="RIR Applied") plot_specgram(augmented, sample_rate, title="RIR Applied") Audio(augmented, rate=sample_rate) Adding background noise ----------------------- To add background noise to audio data, you can simply add a noise Tensor to the Tensor representing the audio data. A common method to adjust the intensity of noise is changing the Signal-to-Noise Ratio (SNR). [`wikipedia `__] $$ \\mathrm{SNR} = \\frac{P_{signal}}{P_{noise}} $$ $$ \\mathrm{SNR_{dB}} = 10 \\log _{{10}} \\mathrm {SNR} $$ .. code-block:: default speech, _ = torchaudio.load(SAMPLE_SPEECH) noise, _ = torchaudio.load(SAMPLE_NOISE) noise = noise[:, : speech.shape[1]] speech_power = speech.norm(p=2) noise_power = noise.norm(p=2) snr_dbs = [20, 10, 3] noisy_speeches = [] for snr_db in snr_dbs: snr = 10 ** (snr_db / 20) scale = snr * noise_power / speech_power noisy_speeches.append((scale * speech + noise) / 2) Background noise: ~~~~~~~~~~~~~~~~~ .. code-block:: default plot_waveform(noise, sample_rate, title="Background noise") plot_specgram(noise, sample_rate, title="Background noise") Audio(noise, rate=sample_rate) SNR 20 dB: ~~~~~~~~~~ .. code-block:: default snr_db, noisy_speech = snr_dbs[0], noisy_speeches[0] plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") Audio(noisy_speech, rate=sample_rate) SNR 10 dB: ~~~~~~~~~~ .. code-block:: default snr_db, noisy_speech = snr_dbs[1], noisy_speeches[1] plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") Audio(noisy_speech, rate=sample_rate) SNR 3 dB: ~~~~~~~~~ .. code-block:: default snr_db, noisy_speech = snr_dbs[2], noisy_speeches[2] plot_waveform(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") plot_specgram(noisy_speech, sample_rate, title=f"SNR: {snr_db} [dB]") Audio(noisy_speech, rate=sample_rate) Applying codec to Tensor object ------------------------------- :py:func:`torchaudio.functional.apply_codec` can apply codecs to a Tensor object. **Note** This process is not differentiable. .. code-block:: default waveform, sample_rate = torchaudio.load(SAMPLE_SPEECH) configs = [ {"format": "wav", "encoding": "ULAW", "bits_per_sample": 8}, {"format": "gsm"}, {"format": "vorbis", "compression": -1}, ] waveforms = [] for param in configs: augmented = F.apply_codec(waveform, sample_rate, **param) waveforms.append(augmented) Original: ~~~~~~~~~ .. code-block:: default plot_waveform(waveform, sample_rate, title="Original") plot_specgram(waveform, sample_rate, title="Original") Audio(waveform, rate=sample_rate) 8 bit mu-law: ~~~~~~~~~~~~~ .. code-block:: default plot_waveform(waveforms[0], sample_rate, title="8 bit mu-law") plot_specgram(waveforms[0], sample_rate, title="8 bit mu-law") Audio(waveforms[0], rate=sample_rate) GSM-FR: ~~~~~~~ .. code-block:: default plot_waveform(waveforms[1], sample_rate, title="GSM-FR") plot_specgram(waveforms[1], sample_rate, title="GSM-FR") Audio(waveforms[1], rate=sample_rate) Vorbis: ~~~~~~~ .. code-block:: default plot_waveform(waveforms[2], sample_rate, title="Vorbis") plot_specgram(waveforms[2], sample_rate, title="Vorbis") Audio(waveforms[2], rate=sample_rate) Simulating a phone recoding --------------------------- Combining the previous techniques, we can simulate audio that sounds like a person talking over a phone in a echoey room with people talking in the background. .. code-block:: default sample_rate = 16000 original_speech, sample_rate = torchaudio.load(SAMPLE_SPEECH) plot_specgram(original_speech, sample_rate, title="Original") # Apply RIR speech_ = torch.nn.functional.pad(original_speech, (RIR.shape[1] - 1, 0)) rir_applied = torch.nn.functional.conv1d(speech_[None, ...], RIR[None, ...])[0] plot_specgram(rir_applied, sample_rate, title="RIR Applied") # Add background noise # Because the noise is recorded in the actual environment, we consider that # the noise contains the acoustic feature of the environment. Therefore, we add # the noise after RIR application. noise, _ = torchaudio.load(SAMPLE_NOISE) noise = noise[:, : rir_applied.shape[1]] snr_db = 8 scale = math.exp(snr_db / 10) * noise.norm(p=2) / rir_applied.norm(p=2) bg_added = (scale * rir_applied + noise) / 2 plot_specgram(bg_added, sample_rate, title="BG noise added") # Apply filtering and change sample rate filtered, sample_rate2 = torchaudio.sox_effects.apply_effects_tensor( bg_added, sample_rate, effects=[ ["lowpass", "4000"], [ "compand", "0.02,0.05", "-60,-60,-30,-10,-20,-8,-5,-8,-2,-8", "-8", "-7", "0.05", ], ["rate", "8000"], ], ) plot_specgram(filtered, sample_rate2, title="Filtered") # Apply telephony codec codec_applied = F.apply_codec(filtered, sample_rate2, format="gsm") plot_specgram(codec_applied, sample_rate2, title="GSM Codec Applied") Original speech: ~~~~~~~~~~~~~~~~ .. code-block:: default Audio(original_speech, rate=sample_rate) RIR applied: ~~~~~~~~~~~~ .. code-block:: default Audio(rir_applied, rate=sample_rate) Background noise added: ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: default Audio(bg_added, rate=sample_rate) Filtered: ~~~~~~~~~ .. code-block:: default Audio(filtered, rate=sample_rate2) Codec aplied: ~~~~~~~~~~~~~ .. code-block:: default Audio(codec_applied, rate=sample_rate2) .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 0.000 seconds) .. _sphx_glr_download_beginner_audio_data_augmentation_tutorial.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download :download:`Download Python source code: audio_data_augmentation_tutorial.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: audio_data_augmentation_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_