> For the complete documentation index, see [llms.txt](https://openthaigpt.aieat.or.th/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://openthaigpt.aieat.or.th/previous-versions-and-resources/open-resources/free-working-datasets.md).

# Free Working Datasets

## Pretraining

{% embed url="<https://huggingface.co/datasets/oscar>" %}
OSCAR or **O**pen **S**uper-large **C**rawled [**A**LMAnaCH](https://team.inria.fr/almanach/) co**R**pus is a huge multilingual corpus obtained by language classification and filtering of the [Common Crawl](https://commoncrawl.org/) corpus using the [goclassy](https://github.com/pjox/goclassy) architecture. Data is distributed by language in both original and deduplicated form.
{% endembed %}

{% embed url="<https://huggingface.co/datasets/wikipedia>" %}
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (<https://dumps.wikimedia.org/>) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
{% endembed %}

## Finetuning

{% embed url="<https://huggingface.co/datasets/iapp_wiki_qa_squad>" %}
`iapp_wiki_qa_squad` is an extractive question answering dataset from Thai Wikipedia articles. It is adapted from [the original iapp-wiki-qa-dataset](https://github.com/iapp-technology/iapp-wiki-qa-dataset) to [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format, resulting in 5761/742/739 questions from 1529/191/192 articles by iApp Technology.
{% endembed %}

{% embed url="<https://huggingface.co/datasets/thaiqa_squad>" %}
`thaiqa_squad` is an open-domain, extractive question answering dataset (4,000 questions in `train` and 74 questions in `dev`) in [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format, originally created by [NECTEC](https://www.nectec.or.th/en/) from Wikipedia articles and adapted to [SQuAD](https://rajpurkar.github.io/SQuAD-explorer/) format by [PyThaiNLP](https://github.com/PyThaiNLP/).
{% endembed %}

{% embed url="<https://huggingface.co/datasets/thaisum>" %}
A large-scale corpus for Thai text summarization obtained from several online news websites namely Thairath, ThaiPBS, Prachathai, and The Standard. This dataset consists of over 350,000 article and summary pairs written by journalists.
{% endembed %}

## Unhealthy Comments Corpus

{% embed url="<https://huggingface.co/datasets/nakcnx/Thai-UCC>" %}
Thai UCC Corpus is translate from [UCC (Unhealthy Comments Corpus)](https://github.com/conversationai/unhealthy-conversations) by PyThaiNLP Translator and Google Translator.
{% endembed %}


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://openthaigpt.aieat.or.th/previous-versions-and-resources/open-resources/free-working-datasets.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.