51 results found Sort:

751
9.8k
agpl-3.0
90
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Created 2018-05-11
1,743 commits to master branch, last one 28 days ago
217
9.0k
other
71
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Created 2015-05-03
8,946 commits to main branch, last one a day ago
311
3.4k
mit
20
A light-weight, flexible, and expressive statistical data testing library
Created 2018-11-01
779 commits to main branch, last one 8 days ago
1.9k
2.2k
unknown
198
Jupyter notebook and datasets from the pandas video series
Created 2016-03-31
88 commits to master branch, last one 8 months ago
232
1.5k
apache-2.0
37
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Created 2017-07-13
6,411 commits to develop branch, last one about a year ago
133
1.4k
other
36
simple tools for data cleaning in R
Created 2016-04-12
981 commits to main branch, last one a day ago
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Created 2017-11-27
510 commits to master branch, last one 26 days ago
97
1.2k
bsd-3-clause
20
Prepping tables for machine learning
Created 2018-03-12
1,696 commits to main branch, last one a day ago
76
716
unknown
16
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
Created 2023-06-26
91 commits to main branch, last one about a month ago
Schema-Inspector is a simple JavaScript object sanitization and validation module.
Created 2014-01-02
169 commits to master branch, last one about a year ago
54
502
mit
5
Easy to use Python library of customized functions for cleaning and analyzing data.
Created 2020-03-25
884 commits to main branch, last one 19 days ago
25
436
apache-2.0
10
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
Created 2022-09-21
653 commits to main branch, last one 5 months ago
39
407
unknown
19
Professional data validation for the R environment
Created 2014-02-21
809 commits to master branch, last one about a month ago
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
Created 2020-04-09
1,429 commits to main branch, last one 19 hours ago
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Created 2018-10-05
30 commits to master branch, last one 3 years ago
84
376
apache-2.0
25
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Created 2018-06-18
646 commits to master branch, last one 3 years ago
32
217
unknown
22
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
Created 2019-09-30
22 commits to master branch, last one 2 years ago
Pydantic extension for annotating autocorrecting fields.
Created 2024-02-17
106 commits to main branch, last one 8 months ago
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
Created 2023-09-05
890 commits to main branch, last one 6 months ago
10
167
gpl-3.0
6
CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count uniqu...
Created 2019-12-15
347 commits to master branch, last one 2 months ago
🗺️ Data Cleaning and Textual Data Visualization 🗺️
Created 2022-05-19
1,023 commits to main branch, last one 5 months ago
26
143
unknown
10
An R package for data screening
Created 2016-09-26
493 commits to master branch, last one 2 years ago
35
140
apache-2.0
12
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Created 2018-08-16
2,174 commits to develop branch, last one 2 years ago
35
140
apache-2.0
5
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
Created 2019-05-26
3,418 commits to master branch, last one about a year ago
5
124
bsd-2-clause
1
Outlier Detection Thresholding
Created 2022-05-29
383 commits to main branch, last one about a month ago
8
108
lgpl-3.0
5
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Created 2021-04-02
3,956 commits to develop branch, last one 2 months ago
5
104
unknown
8
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
Created 2017-03-04
246 commits to master branch, last one 8 months ago