OpenThaiGPT
More
Search
Ctrl + K
🆕
Free Working Datasets
For OpenThaiGPT
Pretraining
Finetuning
Unhealthy Comments Corpus
Previous
Open Resources
Next
Related Paper / Knowledge
Last updated
1 year ago
Pretraining
Finetuning
Unhealthy Comments Corpus
oscar · Datasets at Hugging Face
huggingface
OSCAR or
O
pen
S
uper-large
C
rawled
A
LMAnaCH
co
R
pus is a huge multilingual corpus obtained by language classification and filtering of the
Common Crawl
corpus using the
goclassy
architecture. Data is distributed by language in both original and deduplicated form.
wikipedia · Datasets at Hugging Face
huggingface
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump (
https://dumps.wikimedia.org/
) with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
iapp_wiki_qa_squad · Datasets at Hugging Face
huggingface
iapp_wiki_qa_squad
is an extractive question answering dataset from Thai Wikipedia articles. It is adapted from
the original iapp-wiki-qa-dataset
to
SQuAD
format, resulting in 5761/742/739 questions from 1529/191/192 articles by iApp Technology.
thaiqa_squad · Datasets at Hugging Face
huggingface
thaiqa_squad
is an open-domain, extractive question answering dataset (4,000 questions in
train
and 74 questions in
dev
) in
SQuAD
format, originally created by
NECTEC
from Wikipedia articles and adapted to
SQuAD
format by
PyThaiNLP
.
thaisum · Datasets at Hugging Face
huggingface
A large-scale corpus for Thai text summarization obtained from several online news websites namely Thairath, ThaiPBS, Prachathai, and The Standard. This dataset consists of over 350,000 article and summary pairs written by journalists.
nakcnx/Thai-UCC · Datasets at Hugging Face
huggingface
Thai UCC Corpus is translate from
UCC (Unhealthy Comments Corpus)
by PyThaiNLP Translator and Google Translator.