Datasets and scripts for basic natural language and speech processing.
This is not an official Google product.
Natural Languages
Directory
Language Available
af
Afrikaans
bn
Bengali / Bangla
hi_ur
Hindi & Urdu
is
Icelandic
jv
Javanese
km
Khmer
lo
Lao
my
Burmese / Myanmar
ne
Nepali
si
Sinhala
su
Sundanese
xh
Xhosa
zu
Zulu
Tools
We are including a few tools for working with the natural language
datasets. These tools are written in C++ and Python and are built with
Bazel. To compile and use these tools,
install a recent version of Bazel
(minimally Bazel release 0.4.5 is required).
The directory third_party/ contains third-party works, which we
are including under the respective licenses of the upstream projects. See third_party/README.md for further details.
Language Resources and Tools
Datasets and scripts for basic natural language and speech processing.
This is not an official Google product.
Natural Languages
Tools
We are including a few tools for working with the natural language datasets. These tools are written in C++ and Python and are built with Bazel. To compile and use these tools, install a recent version of Bazel (minimally Bazel release 0.4.5 is required).
Opensourced Audio Data
Other reading resources
SLTU 2016 Tutorial - https://sites.google.com/site/sltututorial/overview
Publications
Burmese Speech Corpus, Finite-State Text Normalization and Pronunciation Grammars with an Application to Text-to-Speech
Crowdsourcing Latin American Spanish for Low-Resource Text-to-Speech
Open-source Multi-speaker Corpora of the English Accents in the British Isles
Open-source Multi-speaker Speech Corpora for Building Gujarati, Kannada, Malayalam, Marathi, Tamil and Telugu Speech Synthesis Systems
Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview
Open-Source High Quality Speech Datasets for Basque, Catalan and Galician
Text Normalization for Bangla, Khmer, Nepali, Javanese, Sinhala, and Sundanese TTS Systems
A Step-by-Step Process for Building TTS Voices Using Open Source Data and Framework for Bangla, Javanese, Khmer, Nepali, Sinhala, and Sundanese
FonBund: A Library for Combining Cross-lingual Phonological Segment Data
Building Open Javanese and Sundanese Corpora for Multilingual Text-to-Speech
Rapid development of TTS corpora for four South African languages
Building Statistical Parametric Multi-speaker Synthesis for Bangladeshi Bangla
TTS for Low Resource Languages: A Bangla Synthesizer
License
Unless otherwise noted, all original files are licensed under an Apache License, Version 2.0.
Where specifically noted, some datasets are licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
The directory third_party/ contains third-party works, which we are including under the respective licenses of the upstream projects. See third_party/README.md for further details.