Repository navigation

#

data-curation

fastdup is a powerful, free tool designed to rapidly generate valuable insights from image and video datasets. It helps enhance the quality of both images and labels, while significantly reducing data operation costs, all with unmatched scalability.

Python
1720
8 天前

[ICLR 2025] Official implementation of paper "Improving Data Efficiency via Curating LLM-Driven Rating Systems"

Python
97
5 个月前

Metamapper is a data discovery and documentation platform for improving how teams understand and interact with their data.

Python
79
10 天前

A library for detecting problematic data segments in structured and unstructured data with few lines of code.

Python
64
2 年前

Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning

Python
51
3 年前

Lesson guide and textbook for "Data as a Science" course.

Jupyter Notebook
47
4 年前

A tool for downloading from public image boards (which allow scraping) / preview your images & tags / edit your images & tags. Additional tabs for downloading other desired code repositories as well as S.O.T.A. diffusion and auto-tag/caption models for your purposes. Custom datasets can be added!

Python
38
16 天前

🧼🔎 A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).

Python
36
5 个月前

A web service for semi-automated conversion of raw imaging data to BIDS

Vue
31
1 个月前

Target-oriented Proactive Dialogue Systems with Personalization: Problem Formulation and Dataset Curation (EMNLP 2023)

Python
30
1 年前

Curation of BIDS (CuBIDS): A sanity-preserving software package for processing BIDS datasets.

Python
28
2 天前

AqSolDB: A curated aqueous solubility dataset contains 9.982 unique compounds.

Python
23
5 年前