44 results found Sort:

684
8.9k
agpl-3.0
85
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Created 2018-05-11
1,629 commits to master branch, last one 7 days ago
202
8.6k
other
68
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Created 2015-05-03
8,836 commits to main branch, last one 17 hours ago
510
6.8k
apache-2.0
53
The open-source tool for building high-quality datasets and computer vision models
Created 2020-04-22
20,368 commits to develop branch, last one a day ago
283
3.1k
mit
18
A light-weight, flexible, and expressive statistical data testing library
Created 2018-11-01
722 commits to main branch, last one a day ago
1.9k
2.1k
unknown
197
Jupyter notebook and datasets from the pandas video series
Created 2016-03-31
88 commits to master branch, last one 2 months ago
233
1.4k
apache-2.0
38
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Created 2017-07-13
6,411 commits to develop branch, last one about a year ago
130
1.4k
other
36
simple tools for data cleaning in R
Created 2016-04-12
978 commits to main branch, last one 9 days ago
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Created 2017-11-27
509 commits to master branch, last one 2 months ago
90
1.0k
bsd-3-clause
21
Prepping tables for machine learning
Created 2018-03-12
1,541 commits to main branch, last one 2 days ago
62
621
unknown
16
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
Created 2023-06-26
89 commits to main branch, last one 7 days ago
Schema-Inspector is a simple JavaScript object sanitization and validation module.
Created 2014-01-02
169 commits to master branch, last one 8 months ago
51
479
mit
5
Easy to use Python library of customized functions for cleaning and analyzing data.
Created 2020-03-25
853 commits to main branch, last one 4 days ago
23
423
apache-2.0
10
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
Created 2022-09-21
652 commits to main branch, last one about a month ago
37
403
unknown
19
Professional data validation for the R environment
Created 2014-02-21
803 commits to master branch, last one 14 days ago
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Created 2018-10-05
30 commits to master branch, last one 2 years ago
82
370
apache-2.0
25
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Created 2018-06-18
646 commits to master branch, last one 3 years ago
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
Created 2020-04-09
1,194 commits to main branch, last one 23 hours ago
31
215
unknown
22
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
Created 2019-09-30
22 commits to master branch, last one 2 years ago
Pydantic extension for annotating autocorrecting fields.
Created 2024-02-17
106 commits to main branch, last one 2 months ago
26
141
unknown
10
An R package for data screening
Created 2016-09-26
493 commits to master branch, last one 2 years ago
36
139
apache-2.0
5
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
Created 2019-05-26
3,418 commits to master branch, last one 8 months ago
8
138
gpl-3.0
6
CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count uniqu...
Created 2019-12-15
331 commits to master branch, last one 2 months ago
35
137
apache-2.0
12
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Created 2018-08-16
2,174 commits to develop branch, last one about a year ago
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
Created 2023-09-05
890 commits to main branch, last one 29 days ago
5
118
bsd-2-clause
2
Outlier Detection Thresholding
Created 2022-05-29
365 commits to main branch, last one a day ago
🗺️ Data Cleaning and Textual Data Visualization 🗺️
Created 2022-05-19
1,022 commits to main branch, last one 10 days ago
5
102
unknown
8
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
Created 2017-03-04
246 commits to master branch, last one 2 months ago
9
84
lgpl-3.0
4
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Created 2021-04-02
3,467 commits to develop branch, last one 22 days ago