FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.
FineWeb is a large-scale dataset designed to provide high-quality web data for training large language models. It includes over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. The dataset is processed using the datatrove library and is optimized for LLM performance. It outperforms other commonly used web datasets in benchmark tasks.
This study surveys the attitudes and behaviors of US higher education faculty members regarding online resources, the library, and related topics. It covers a wide range of issues, including faculty dependence on electronic scholarly resources, the transition from print to electronic journals, publishing preferences, e-books, and the preservation of scholarly journals.
The ISSP is a cross-national collaboration program conducting annual surveys on diverse topics relevant to social sciences. Established in 1984, it includes members from various cultures around the globe. Over one million respondents have participated in ISSP surveys, and all collected data and documentation are available free of charge.
Dataset Card for Psychology Therapy Dataset : This dataset card aims to provide information about a dataset focused on psychology therapy conversations. Language(s) (NLP): Turkish (tr)