The Substance Abuse and Mental Health Data Archive (SAMHDA) provides a comprehensive collection of data sets related to mental health and substance use. It includes ongoing studies, population surveys, treatment facility surveys, and client-level data, offering valuable insights for researchers and policymakers.
SAMHDA is a valuable resource for researchers and professionals interested in mental health and substance use data. It provides a wide range of data sets, including the National Mental Health Services Survey (N-MHSS), Mental Health Client-Level Data (MH-CLD), and the National Survey on Drug Use and Health (NSDUH). These data sets cover various aspects of mental health and substance use, from treatment facilities to individual-level data, and are essential for understanding and addressing related issues.
FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.
The CaiTI_dataset repository contains datasets for Motivational Interviewing and Cognitive Behavioral Therapy, curated by therapists to train CaiTI.
Psy-Insight is a bilingual, interpretable multi-turn dataset for mental health counseling dialogues. It includes 6,208 rounds of multi-turn counseling dialogues in English and 5,776 rounds in Chinese, annotated with step-by-step reasoning labels and multi-task labels. This dataset is designed to support the application of large language models in mental health and is suitable for tasks such as emotion classification and psychological treatment interpretation.