# Free Working Datasets

## Pretraining

{% embed url="<https://huggingface.co/datasets/oscar>" %}
OSCAR or **O**pen **S**uper-large **C**rawled [**A**LMAnaCH](https://team.inria.fr/almanach/) co**R**pus is a huge multilingual corpus obtained by language classification and filtering of the [Common Crawl](https://commoncrawl.org/) corpus using the [goclassy](https://github.com/pjox/goclassy) architecture. Data is distributed by language in both original and deduplicated form.
{% endembed %}

{% embed url="<https://huggingface.co/datasets/wikipedia>" %}
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (<https://dumps.wikimedia.org/>) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
{% endembed %}

## Finetuning

{% embed url="<https://huggingface.co/datasets/iapp_wiki_qa_squad>" %}
`iapp_wiki_qa_squad` is an extractive question answering dataset from Thai Wikipedia articles. It is adapted from [the original iapp-wiki-qa-dataset](https://github.com/iapp-technology/iapp-wiki-qa-dataset) to [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format, resulting in 5761/742/739 questions from 1529/191/192 articles by iApp Technology.
{% endembed %}

{% embed url="<https://huggingface.co/datasets/thaiqa_squad>" %}
`thaiqa_squad` is an open-domain, extractive question answering dataset (4,000 questions in `train` and 74 questions in `dev`) in [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format, originally created by [NECTEC](https://www.nectec.or.th/en/) from Wikipedia articles and adapted to [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format by [PyThaiNLP](https://github.com/PyThaiNLP/).
{% endembed %}

{% embed url="<https://huggingface.co/datasets/thaisum>" %}
A large-scale corpus for Thai text summarization obtained from several online news websites namely Thairath, ThaiPBS, Prachathai, and The Standard. This dataset consists of over 350,000 article and summary pairs written by journalists.
{% endembed %}

## Unhealthy Comments Corpus

{% embed url="<https://huggingface.co/datasets/nakcnx/Thai-UCC>" %}
Thai UCC Corpus is translate from [UCC (Unhealthy Comments Corpus)](https://github.com/conversationai/unhealthy-conversations) by PyThaiNLP Translator and Google Translator.
{% endembed %}
