28 results found Sort:
- Filter by Primary Language:
- Python (15)
- Jupyter Notebook (3)
- CSS (1)
- TeX (1)
- TypeScript (1)
- +
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Created
2018-05-11
1,772 commits to master branch, last one 15 days ago
Refine high-quality datasets and visual AI models
Created
2020-04-22
23,242 commits to develop branch, last one 12 hours ago
A Doctor for your data
Created
2023-05-02
33 commits to master branch, last one 3 months ago
The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
Created
2022-07-04
70 commits to main branch, last one 4 months ago
Interactively explore unstructured datasets from your dataframe.
Created
2023-01-29
1,527 commits to main branch, last one 4 months ago
Resources for Data Centric AI
Created
2021-06-11
296 commits to main branch, last one 2 years ago
A curated, but incomplete, list of data-centric AI resources.
Created
2023-03-07
69 commits to main branch, last one 10 months ago
Automatically find issues in image datasets and practice data-centric computer vision.
Created
2022-05-26
338 commits to main branch, last one 22 days ago
Curated list of open source tooling for data-centric AI on unstructured data.
nlp
data-drift
awesome-list
noisy-labels
data-curation
deep-learning
bias-detection
explainable-ai
feature-vector
synthetic-data
active-learning
computer-vision
data-centric-ai
data-versioning
machine-learning
outlier-detection
data-visualization
documentation-only
uncertainty-estimation
robust-machine-learning
Created
2023-02-27
34 commits to main branch, last one about a year ago
Lab assignments for Introduction to Data-Centric AI, MIT IAP 2024 π©π½βπ»
Created
2022-12-05
39 commits to master branch, last one about a month ago
Official PyTorch implementation of the paper "Dataset Distillation with Neural Characteristic Function: A Minmax Perspective" (NCFM) in CVPR 2025 (Highlight).
Created
2025-03-01
53 commits to main branch, last one 13 days ago
Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"
Created
2024-09-09
12 commits to main branch, last one 6 days ago
[NeurIPS 2021] WRENCH: Weak supeRvision bENCHmark
Created
2021-08-23
173 commits to main branch, last one about a year ago
[NeurIPS 2023] This is the code for the paper `Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias`.
Created
2023-05-31
26 commits to main branch, last one about a year ago
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Created
2021-04-02
4,629 commits to develop branch, last one 6 days ago
Introduction to Data-Centric AI, MIT IAP 2023 π€
Created
2022-12-05
214 commits to master branch, last one 2 months ago
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Created
2023-06-07
345 commits to main branch, last one 2 months ago
Self-Evolved Diverse Data Sampling for Efficient Instruction Tuning
Created
2023-11-14
6 commits to master branch, last one about a year ago
Papers about training data quality management for ML models.
Created
2024-03-05
58 commits to main branch, last one 2 months ago
nbsynthetic is simple and robust tabular synthetic data generation library for small and medium size datasets
Created
2022-08-29
246 commits to master branch, last one 2 years ago
[ECCV 2022] Official Implementation for Unsupervised Selective Labeling for More Effective Semi-Supervised Learning
Created
2022-07-20
11 commits to main branch, last one about a year ago
A Data Centric NER annotation tool for your Named Entity Recognition projects
Created
2020-09-13
32 commits to main branch, last one about a year ago
Trending projects & awesome papers about data-centric llm studies.
Created
2024-06-19
24 commits to main branch, last one 6 days ago
A curated list of awesome open source tools and commercial products to catalog, version, and manage data π
Created
2022-04-20
1 commits to main branch, last one 3 years ago
Client interface to Cleanlab Studio and the Trustworthy Language Model
llm
automl
annotations
data-quality
data-science
noisy-labels
data-cleaning
data-curation
data-labeling
data-profiling
computer-vision
data-centric-ai
data-validation
structured-data
machine-learning
model-deployment
outlier-detection
text-classification
image-classification
natural-language-processing
Created
2022-03-03
851 commits to main branch, last one 2 months ago
π§Όπ A holistic self-supervised data cleaning strategy to detect irrelevant samples, near duplicates and label errors (NeurIPS'24).
Created
2024-02-14
111 commits to main branch, last one about a month ago
A list of data-efficient and data-centric LLM (Large Language Model) papers. Our Survey Paper: Towards Efficient LLM Post Training: A Data-centric Perspective
Created
2025-02-19
4 commits to main branch, last one 2 months ago
Intriguing Properties of Data Attribution on Diffusion Models (ICLR 2024)
Created
2023-11-01
4 commits to main branch, last one about a year ago