The DS4C dataset is a structured collection of COVID-19 data from South Korea, based on reports from the Korea Centers for Disease Control & Prevention (KCDC) and local governments. It includes information on infections, patient routes, and various analyses. The dataset has been used for multiple research and visualization projects.
The Data Science for COVID-19 (DS4C) project provides a comprehensive dataset for analyzing the COVID-19 pandemic in South Korea. The dataset includes detailed information on infections, patient routes, and other relevant data. It has been used for various research and visualization projects, including competitions and academic studies. The data is sourced from the KCDC and local governments, ensuring accuracy and reliability.
FineWeb is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. It is optimized for LLM performance and processed using the datatrove library. The dataset aims to provide high-quality data for training large language models and outperforms other commonly used web datasets.We’re on a journey to advance and democratize artificial intelligence through open source and open science.
The DAIC-WOZ dataset contains clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post-traumatic stress disorder. This repository provides code for extracting question-level features from the DAIC-WOZ dataset, which can be used for multimodal analysis of depression levels.
The Mental Health Corpus contains labeled comments on mental health issues, used for sentiment and toxic language analysis.