Language Resources and Tools

Datasets and scripts for basic natural language and speech processing.

This is not an official Google product.

Natural Languages

Directory	Language Available
af	Afrikaans
bn	Bengali / Bangla
hi_ur	Hindi & Urdu
is	Icelandic
jv	Javanese
km	Khmer
lo	Lao
my	Burmese / Myanmar
ne	Nepali
si	Sinhala
su	Sundanese
xh	Xhosa
zu	Zulu

Tools

We are including a few tools for working with the natural language datasets. These tools are written in C++ and Python and are built with Bazel. To compile and use these tools, install a recent version of Bazel (minimally Bazel release 0.4.5 is required).

Opensourced Audio Data

Resource	Link
Sinhala TTS recordings (~3K)	https://www.openslr.org/30/
TTS recordings for four South African languages (af, st, tn, xh)	https://www.openslr.org/32/
Large Javanese ASR training data set (~185K)	https://www.openslr.org/35/
Large Sundanese ASR training data set (~220K)	https://www.openslr.org/36/
High quality TTS data for Bengali languages	https://www.openslr.org/37/
High quality TTS data for Javanese	https://www.openslr.org/41/
High quality TTS data for Khmer	https://www.openslr.org/42/
High quality TTS data for Nepali	https://www.openslr.org/43/
High quality TTS data for Sundanese	https://www.openslr.org/44/
Large Sinhala ASR training data set	https://www.openslr.org/52/
Large Bengali ASR training data set	https://www.openslr.org/53/
Large Nepali ASR training data set	https://www.openslr.org/54/
Crowdsourced high-quality Argentinian Spanish speech data set	https://www.openslr.org/61/
Crowdsourced high-quality Malayalam multi-speaker speech data set	https://www.openslr.org/63/
Crowdsourced high-quality Marathi multi-speaker speech data set	https://www.openslr.org/64/
Crowdsourced high-quality Tamil multi-speaker speech data set	https://www.openslr.org/65/
Crowdsourced high-quality Telugu multi-speaker speech data set	https://www.openslr.org/66/
Data set which contains recordings of Catalan	https://www.openslr.org/69
Crowdsourced high-quality Nigerian English speech data set	https://www.openslr.org/70
Crowdsourced high-quality Chilean Spanish speech data set	https://www.openslr.org/71
Crowdsourced high-quality Colombian Spanish speech data set	https://www.openslr.org/72
Crowdsourced high-quality Peruvian Spanish speech data set	https://www.openslr.org/73
Crowdsourced high-quality Puerto Rico Spanish speech data set	https://www.openslr.org/74
Crowdsourced high-quality Venezuelan Spanish speech data set	https://www.openslr.org/75
Crowdsourced high-quality Basque speech data set	https://www.openslr.org/76
Crowdsourced high-quality Galician speech data set	https://www.openslr.org/77
Crowdsourced high-quality Gujarati multi-speaker speech data set	https://www.openslr.org/78
Crowdsourced high-quality Kannada multi-speaker speech data set	https://www.openslr.org/79
Crowdsourced high-quality Burmese speech data set	https://www.openslr.org/80
Data set which contains male and female recordings of English from various dialects of the UK and Ireland.	https://www.openslr.org/83
Crowdsourced high-quality Yoruba speech data set	https://www.openslr.org/86

Publications

License

Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.

Where specifically noted, some datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).

The directory third_party/ contains third-party works, which we are including under the respective licenses of the upstream projects. See third_party/README.md for further details.