HuggingFaceFW/fineweb-2

FineWeb-2 is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.

Αναλυτική Εισαγωγή

FineWeb-2 is a large-scale dataset designed to provide high-quality web data for training large language models. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.

Visit Website

Περισσότερα
Σύνολο δεδομένων

Psych-101: Human Psychological Experiment Transcripts Dataset

Psych-101 is a dataset of natural language transcripts from human psychological experiments, comprising trial-by-trial data from 160 experiments and 60,092 participants, making 10,681,650 choices. It provides valuable insights into human decision-making processes and is available under the Apache License 2.0.

Emotional First Aid Raw Dataset: Psychological Counseling QA Raw Corpus

The Emotional First Aid Raw Dataset is a collection of raw, unannotated psychological counseling Q&A data, designed to support research in AI applications for mental health. It contains over 172,000 topics with 2,381,273 messages, totaling 44,514,786 characters, providing a rich source of data for natural language processing and AI development.

MentalManip: 心理操纵检测数据集

MentalManip数据集是由Wang等人（2024b）引入的，专门用于检测和分类心理操纵的对话数据集。该数据集包含4000个多轮虚构对话，来源于在线电影剧本，并进行了多层次的标注，包括操纵的存在、操纵技巧和目标脆弱性。数据集的创建旨在通过高质量的标注确保数据的一致性和准确性，从而支持心理操纵检测的研究。

URL Ιστοσελίδας

https://huggingface.co/datasets/HuggingFaceFW/fineweb-2

Κατηγορίες

Σύνολο δεδομένων

Λέξεις-Κλειδιά

HuggingFacefineweb-2DatasetFailed to Load