FineWeb-2 is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.
FineWeb-2 is a large-scale dataset designed to provide high-quality web data for training large language models. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.
SoulChat2.0 is a framework for constructing the digital twin of psychological counselors, designed to support the development of AI applications in mental health. It includes a data generation module and a modeling module, enabling the creation of personalized counseling models based on limited real-world counseling cases.
Collaborative assessment as an intervention in the treatment of mental Illness: a systematic review
The American National Mental Health Services Survey (N-MHSS) is an annual survey conducted by the Substance Abuse and Mental Health Services Administration (SAMHSA) to collect data on mental health treatment facilities across the United States. The survey provides detailed information on the services and characteristics of these facilities, helping to inform policy and improve mental health care.