This repo provides checkpoints for the Interspeech 2024 paper Scaling up masked audio encoder learning for general audio classification.
The goal of this work is to investigate the scalability of masked autoencoders for audio.
Prior work did not scale beyond 10,000 hours of audio, while Dasheng used 272,000 hours of training data.
Yes, the performance for the base model is 49.7 mAP. One can use it as follows:
from typing import Any, Mapping
import dasheng
import torch
class DashengAudiosetClassifier(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.dashengmodel = dasheng.dasheng_base()
self.classifier = torch.nn.Sequential(torch.nn.LayerNorm(self.dashengmodel.embed_dim), torch.nn.Linear(self.dashengmodel.embed_dim, 527))
def load_state_dict(self, state_dict: Mapping[str, Any], strict: bool = True, assign: bool = False):
self.dashengmodel.load_state_dict(state_dict, strict=False)
for_classifier_dict = {}
for k,v in state_dict.items():
if 'outputlayer' in k:
for_classifier_dict[k.replace('outputlayer.','')] = v
self.classifier.load_state_dict(for_classifier_dict)
return self
def forward(self, x):
x = self.dashengmodel(x).mean(1)
return self.classifier(x).sigmoid()
mdl = DashengAudiosetClassifier()
check = torch.hub.load_state_dict_from_url('https://zenodo.org/records/13315686/files/dasheng_audioset_mAP497.pt?download=1',map_location='cpu')
mdl.load_state_dict(check)
prediction = mdl(torch.randn(1,16000))
Citation
@inproceedings{dinkel2024dasheng,
title={Scaling up masked audio encoder learning for general audio classification},
author={Dinkel, Heinrich and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun and Wang, Bin},
booktitle={Interspeech 2024},
year={2024}
}
Dasheng (大声)
Official PyTorch code for Deep Audio-Signal Holistic Embeddings
Scaling up masked audio encoder learning for general audio classification
TL;DR
This repo provides checkpoints for the Interspeech 2024 paper Scaling up masked audio encoder learning for general audio classification. The goal of this work is to investigate the scalability of masked autoencoders for audio. Prior work did not scale beyond 10,000 hours of audio, while Dasheng used 272,000 hours of training data.
Huggingface 🤗
Please see here for usage instructions.
Models
Dasheng models have been trained on 272k hours of general audio, mainly VGGSound, Audioset, MTG-Jamendo and ACAV100M.
Models with their evaluation results on the HEAR benchmark, averaged across different domains.
K-Nearest Neighbor results
Performance of features without parameterized training.
1. Installation (Recommended for inference)
Install the package.
1.2 Installation for Training
2. Usage
Forward some audio data (note should be 16khz)
3. Training
Install dependencies:
3.1 Prepare data
We rely on the excellent webdataset library for I/O. Thus one simply needs to pack their data into a bunch of
.tarfiles.A simple example of such a file would be:
We also provide a simple script [wavlist_to_tar] that automates this process, which is installed with the package.
Creating
your_data.tsvis simple:3.2 Training from source
To train one should first adjust the config in
dasheng/train/config/*yamlaccordingly, by adding their training data.MultiGPU support is realized using Accelerate
FAQ
Is there an Audioset-finetuned Dasheng?
Yes, the performance for the base model is 49.7 mAP. One can use it as follows:
Citation