OpenThaiGPT
  • 🏠ยินดีต้อนรับสู่ OpenThaiGPT 1.6 และ OpenThaiGPT R1
  • 📚OpenThaiRAG
  • 🎇Web Demo!
  • ▶️Colab Demo!
  • 🔥OpenThaiGPT 1.0.0 <8 Apr 2024>
  • ❤️องค์กรที่ร่วมสนับสนุน (Sponsors)
  • 🤟ทีมอาสาสมัคร (Volunteer)
  • กฎกติกาความร่วมมือ (Rules)
  • ร่วมกับเรา (Join Us)
  • License
  • Previous Versions and Resources
    • 💻Released Code / Colabs
      • Pretraining LLM
      • 🆕InstructGPT Finetuning
      • Reinforcement Learning with Human Feedback (RLHF)
    • 😍การช่วยกันสร้าง Dataset สนทนา Chat ภาษาไทย!
    • 📄Released Datasets (14/04/23)
    • 📦Released OpenThaiGPT Pip Python Library <0.1.1> (26/05/23)
    • 🔥Released OpenThaiGPT 7b <1.0.0-beta> (16/08/23)
    • 🔥Released OpenThaiGPT 13b <1.0.0-beta> (20/12/23)
    • แผนการดำเนินงาน Roadmap
    • Open Resources
      • 🆕Free Working Datasets
      • Related Paper / Knowledge
      • Computing Resources
    • Previous Events
      • 🥳OpenThaiGPT Meet Up #2
      • 🆕อัพเดท! จากทีม Finetune (8 Apr)
      • Core-team Volunteer Meeting 19 March 15:30
      • Finetuning / RLHF Volunteer Event (18 March)
      • Safety Net Volunteer Event (12 March 19:00-19:45)
      • Pre-training Volunteer Event (11 March 19:00-20:15)
      • Volunteer Meetup #1 (Zoom) 5 March 13:00
      • First Meet Up (25 Feb 2023)!
      • 🔥Released Models Version <0.1.0-beta> (16/05/23)
      • 🔥Released Models Version <1.0.0-alpha> (03/08/23)
    • ChatGPT สร้างขึ้นมาได้อย่างไร (How to build ChatGPT?)
    • OpenThaiGPT Version 1.0
    • OpenThaiGPT 1.5
Powered by GitBook
On this page
  • Pretraining
  • Finetuning
  • Unhealthy Comments Corpus

Was this helpful?

Export as PDF
  1. Previous Versions and Resources
  2. Open Resources

Free Working Datasets

For OpenThaiGPT

PreviousOpen ResourcesNextRelated Paper / Knowledge

Last updated 2 years ago

Was this helpful?

Pretraining

Finetuning

Unhealthy Comments Corpus

🆕
oscar · Datasets at Hugging Facehuggingface
OSCAR or Open Super-large Crawled coRpus is a huge multilingual corpus obtained by language classification and filtering of the corpus using the architecture. Data is distributed by language in both original and deduplicated form.
ALMAnaCH
Common Crawl
goclassy
Logo
wikipedia · Datasets at Hugging Facehuggingface
Wikipedia dataset containing cleaned articles of all languages. The datasets are built from the Wikipedia dump () with one split per language. Each example contains the content of one full Wikipedia article with cleaning to strip markdown and unwanted sections (references, etc.).
iapp_wiki_qa_squad · Datasets at Hugging Facehuggingface
iapp_wiki_qa_squad is an extractive question answering dataset from Thai Wikipedia articles. It is adapted from to format, resulting in 5761/742/739 questions from 1529/191/192 articles by iApp Technology.
https://dumps.wikimedia.org/
the original iapp-wiki-qa-dataset
SQuAD
Logo
Logo
thaiqa_squad · Datasets at Hugging Facehuggingface
thaiqa_squad is an open-domain, extractive question answering dataset (4,000 questions in train and 74 questions in dev) in format, originally created by from Wikipedia articles and adapted to format by .
thaisum · Datasets at Hugging Facehuggingface
A large-scale corpus for Thai text summarization obtained from several online news websites namely Thairath, ThaiPBS, Prachathai, and The Standard. This dataset consists of over 350,000 article and summary pairs written by journalists.
Thai UCC Corpus is translate from by PyThaiNLP Translator and Google Translator.
SQuAD
NECTEC
SQuAD
PyThaiNLP
Logo
Logo
nakcnx/Thai-UCC · Datasets at Hugging Facehuggingface
UCC (Unhealthy Comments Corpus)
Logo