Spatialized WSJ0-2mix

The recipes/sp_wsj02mix directory provides an end-to-end pipeline for the Spatialized WSJ0-2mix speech separation benchmark. The recipe automates data generation, preprocessing, model training, and evaluation on noisy, reverberant two-speaker mixtures rendered with spatial room impulse responses.

The Makefile is organized into GNU Make stages (stage0stage5). Each stage drops .done files so you can resume from intermediate results. Run a stage with make stageN or execute the complete pipeline via make all. Important variables such as data, duration, and train_path can be overridden on the command line. For example:

make stage2 duration=96000 train_path=models/nfca/unet

Configurable Make variables

You can inspect the full Makefile on GitHub: recipes/sp_wsj02mix/Makefile.

Variable

Default

Configurable

Description

data

derev

Selects which preprocessed data stream (e.g., dereverberated vs. reverberant) to pack into HDF5 files.

duration

64000

Number of audio samples per excerpt when building the HDF5 dataset.

train_path

models/nfca/unet

Directory that stores the model training configuration, checkpoints, and job logs.

tag

nfca

Short identifier used when naming inference runs.

inference_name

$(tag)_<timestamp>_<rand>

Inference results directory name; automatically generated but can be overridden for resumption.

inference_command

python -m sbss.nfca.pipelines.separate batch $(train_path) $$src_path $$dst_path

Command executed during Stage 4 for inference; modify to plug in alternative pipelines.

inference_path

results/$(inference_name)

Output directory for separated signals, logs, and evaluation artifacts.

cmd / job_ops

inherited from recipes/globals.mk

Cluster submission command plus optional job arguments; default invokes aiaccel-job local.

Stage-by-stage guide

Stage 0: dataset preparation

make stage0 calls scripts/0_prepare_dataset.py to download/generate the dry, noisy, and reverberant Spatialized WSJ0-2mix waveforms. Ensure the hdf5/ directory exists before launching this step.

Stage 1: preprocessing

make stage1 launches two commands per split (tr, cv, tt):

  • scripts/1_add_noise.py adds diffuse noise to the clean mixtures

  • scripts/1_dereverberate.py removes the simulated room effect to provide auxiliary dereverberated targets

Both commands are submitted through $(cmd) so they can fan out across job slots on a cluster.

Stage 2: HDF5 generation

make stage2 converts the processed audio into chunked HDF5 datasets by executing scripts/2_make_hdf5_unsupervised.py with the current data and duration arguments. Only the tr and cv splits are packaged, as defined by HDF5_SPLITS.

Stage 3: model training

make stage3 triggers aiaccel.torch.apps.train with the Lightning/Aiaccel configuration stored under $(train_path)/config.yaml. Before training kicks off, existing checkpoints and logs are removed after an interactive confirmation. The step finishes when $(train_path)/.train.done is created.

Stage 4: inference

make stage4 separates the cv and tt sets by running python -m sbss.nfca.pipelines.separate batch with the trained checkpoint specified by train_path. Outputs (and job logs) are stored in results/<inference_name>. Each split produces a stamp file named .4_inference.<split>.done under the corresponding results directory.

Stage 5: evaluation

make stage5 evaluates SDR, STOI, and PESQ via the scripts scripts/5_evaluate_sdr.py, scripts/5_evaluate_stoi.py, and scripts/5_evaluate_pesq.py. After all metric jobs succeed, the scores target runs scripts/summarize_scores.py to aggregate the measurements for the current inference_name.

Configuration file

The dataset configuration is stored in recipes/sp_wsj02mix/config.yaml, which inherits recipes/globals.yaml. Key fields include:

Key

Default

Configurable

Description

n_mixtures.tr / n_mixtures.cv / n_mixtures.tt

20000 / 5000 / 3000

Numbers of mixtures for the corresponding splits.

n_mics

4

Microphone channels assumed in STFTs and SCMs.

n_fft / hop_length

512 / 128

STFT window size and hop length shared by encoders/decoders.

snr

30

Target SNR (dB) for adding white noise when generating mixtures.

wpe.taps / wpe.delay

10 / 3

WPE dereverberation hyperparameters used in preprocessing scripts.

blacklist

list of WSJ0 filenames

Mixtures excluded due to anomalies when rendering Spatialized WSJ0-2mix.