FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.
FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.
The ISSP is a cross-national collaboration program conducting annual surveys on diverse topics relevant to social sciences. Established in 1984, it includes members from various cultures around the globe. Over one million respondents have participated in ISSP surveys, and all collected data and documentation are available free of charge.
This dataset contains survey responses from individuals in the tech industry about their mental health, including questions about treatment, workplace resources, and attitudes towards discussing mental health in the workplace. By analyzing this dataset, we can better understand how prevalent mental health issues are among those who work in the tech sector—and what kinds of resources they rely upon to find help—so that more can be done to create a healthier working environment for all.
The Substance Abuse and Mental Health Data Archive (SAMHDA) provides a comprehensive collection of data sets related to mental health and substance use. It includes ongoing studies, population surveys, treatment facility surveys, and client-level data, offering valuable insights for researchers and policymakers.