t4d: Conversion Algorithm from ToMi to T4D Dataset

This project implements the conversion algorithm from the ToMi dataset to the T4D (Thinking is for Doing) dataset, as introduced in the paper https://arxiv.org/abs/2310.03051. It filters examples with Theory of Mind (ToM) questions and adapts the algorithm to account for second-order false beliefs.

Gedetailleerde introductie

The t4d project is a direct implementation of the conversion algorithm from the ToMi dataset to the T4D dataset. It is designed to filter and process examples that involve Theory of Mind questions, providing a valuable resource for researchers working on cognitive and social AI models. The project is built to convert a predefined dataset A (ToMi) to dataset B (T4D) and is licensed under the Apache License, Version 2.0.

Visit Website

Meer
Dataset

Psy-Insight: Mental Health Counseling Dataset

Psy-Insight is a bilingual, interpretable multi-turn dataset for mental health counseling dialogues. It includes 6,208 rounds of multi-turn counseling dialogues in English and 5,776 rounds in Chinese, annotated with step-by-step reasoning labels and multi-task labels. This dataset is designed to support the application of large language models in mental health and is suitable for tasks such as emotion classification and psychological treatment interpretation.

SoulChat2.0: Psychological Counselor's Digital Twin Framework

SoulChat2.0 is a framework for constructing the digital twin of psychological counselors, designed to support the development of AI applications in mental health. It includes a data generation module and a modeling module, enabling the creation of personalized counseling models based on limited real-world counseling cases.

HuggingFaceFW/fineweb-2

FineWeb-2 is a dataset of over 15 trillion tokens of cleaned and deduplicated English web data from CommonCrawl. This is the second iteration of the popular 🍷 FineWeb dataset, bringing high quality pretraining data to over 1000 🗣️ languages.The 🥂 FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments.In particular, on the set of 9 diverse languages we used to guide our processing decisions, 🥂 FineWeb2 outperforms other popular pretraining datasets covering multiple languages (such as CC-100, mC4, CulturaX or HPLT, while being substantially larger) and, in some cases, even performs better than some datasets specifically curated for a single one of these languages, in our diverse set of carefully selected evaluation tasks: FineTasks.

Website URL

https://github.com/sachith-gunasekara/t4d

Categorieën

Dataset AI LLM

Trefwoorden

t4dToMiT4DTheory of MindConversion AlgorithmAIResearch