FineWeb-2 is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.
FineWeb-2 is a large-scale dataset designed to provide high-quality web data for training large language models. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.
The SimpleToM dataset is designed to evaluate models' ability to reason about beliefs and actions in various scenarios. It includes a variety of situations with multiple choice questions and answers, covering topics such as food items, personal belongings, and service industries.
SoulChat2.0 is a framework for constructing the digital twin of psychological counselors, designed to support the development of AI applications in mental health. It includes a data generation module and a modeling module, enabling the creation of personalized counseling models based on limited real-world counseling cases.
MentalManip数据集是由Wang等人(2024b)引入的,专门用于检测和分类心理操纵的对话数据集。该数据集包含4000个多轮虚构对话,来源于在线电影剧本,并进行了多层次的标注,包括操纵的存在、操纵技巧和目标脆弱性。数据集的创建旨在通过高质量的标注确保数据的一致性和准确性,从而支持心理操纵检测的研究。