SoulChat2.0 is a framework for constructing the digital twin of psychological counselors, designed to support the development of AI applications in mental health. It includes a data generation module and a modeling module, enabling the creation of personalized counseling models based on limited real-world counseling cases.
SoulChat2.0 is a significant advancement in the field of mental health AI, offering a novel approach to building digital twins of psychological counselors. The framework leverages advanced LLMs to generate high-quality synthetic data that captures the language style and therapeutic techniques of specific counselors. This data is then used to fine-tune models, resulting in AI systems that can provide personalized and effective counseling support.
Psychology Wiki Datasetpsychology_wiki数据集的构建基于心理学领域的英文维基百科内容,通过系统化的数据采集与整理,确保了信息的广泛覆盖与深度挖掘。数据集中的每一篇文章均经过严格的筛选与标注,涵盖了标题、正文、相关性、受欢迎程度及排名等多个维度,为心理学研究提供了丰富的文本资源。
This paper discusses Helply - a synthesized ML training dataset focused on psychology and therapy, created by Alex Scott and published by NamelessAI. The dataset developed by Alex Scott is a comprehensive collection of synthesized data designed to train LLMs in understanding psychological and therapeutic contexts. This dataset aims to simulate real-world interactions between therapists and patients, enabling ML models to learn from a wide range of scenarios and therapeutic techniques.
FineWeb-2 is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.