OSCAR or Open Super-large Crawled coRpus is a huge multilingual corpus obtained by language classification and filtering of the corpus using the architecture. Data is distributed by language in both original and deduplicated form.
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump () with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
iapp_wiki_qa_squad is an extractive question answering dataset from Thai Wikipedia articles. It is adapted from to format, resulting in 5761/742/739 questions from 1529/191/192 articles by iApp Technology.
thaiqa_squad is an open-domain, extractive question answering dataset (4,000 questions in train and 74 questions in dev) in format, originally created by from Wikipedia articles and adapted to format by .
A large-scale corpus for Thai text summarization obtained from several online news websites namely Thairath, ThaiPBS, Prachathai, and The Standard. This dataset consists of over 350,000 article and summary pairs written by journalists.
Thai UCC Corpus is translate from by PyThaiNLP Translator and Google Translator.