53 results found Sort:

754
9.8k
agpl-3.0
88
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Created 2018-05-11
1,749 commits to master branch, last one a day ago
220
9.1k
other
70
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Created 2015-05-03
8,962 commits to main branch, last one 6 days ago
316
3.5k
mit
22
A light-weight, flexible, and expressive statistical data testing library
Created 2018-11-01
805 commits to main branch, last one 3 days ago
1.9k
2.2k
unknown
198
Jupyter notebook and datasets from the pandas video series
Created 2016-03-31
88 commits to master branch, last one 9 months ago
232
1.5k
apache-2.0
38
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Created 2017-07-13
6,411 commits to develop branch, last one about a year ago
130
1.4k
other
36
simple tools for data cleaning in R
Created 2016-04-12
985 commits to main branch, last one 7 days ago
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Created 2017-11-27
510 commits to master branch, last one 2 months ago
107
1.3k
bsd-3-clause
22
Prepping tables for machine learning
Created 2018-03-12
1,725 commits to main branch, last one 18 days ago
79
735
unknown
16
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
Created 2023-06-26
91 commits to main branch, last one 2 months ago
Schema-Inspector is a simple JavaScript object sanitization and validation module.
Created 2014-01-02
170 commits to master branch, last one 26 days ago
54
502
mit
5
Easy to use Python library of customized functions for cleaning and analyzing data.
Created 2020-03-25
886 commits to main branch, last one 29 days ago
26
445
apache-2.0
10
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
Created 2022-09-21
653 commits to main branch, last one 6 months ago
39
409
unknown
19
Professional data validation for the R environment
Created 2014-02-21
812 commits to master branch, last one 25 days ago
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algor...
Created 2020-04-09
1,495 commits to main branch, last one 19 days ago
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Created 2018-10-05
30 commits to master branch, last one 3 years ago
85
377
apache-2.0
25
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Created 2018-06-18
646 commits to master branch, last one 3 years ago
32
218
unknown
22
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
Created 2019-09-30
22 commits to master branch, last one 2 years ago
Pydantic extension for annotating autocorrecting fields.
Created 2024-02-17
106 commits to main branch, last one 9 months ago
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
Created 2023-09-05
890 commits to main branch, last one 8 months ago
13
169
gpl-3.0
6
CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count uniqu...
Created 2019-12-15
347 commits to master branch, last one 3 months ago
🗺️ Data Cleaning and Textual Data Visualization 🗺️
Created 2022-05-19
1,023 commits to main branch, last one 6 months ago
26
143
unknown
10
An R package for data screening
Created 2016-09-26
493 commits to master branch, last one 2 years ago
35
140
apache-2.0
12
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Created 2018-08-16
2,174 commits to develop branch, last one 2 years ago
35
140
apache-2.0
5
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
Created 2019-05-26
3,418 commits to master branch, last one about a year ago
5
127
bsd-2-clause
1
Outlier Detection Thresholding
Created 2022-05-29
392 commits to main branch, last one 11 days ago
8
110
lgpl-3.0
5
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Created 2021-04-02
3,956 commits to develop branch, last one 3 months ago
5
104
unknown
8
Cluster and merge similar string values: an R implementation of Open Refine clustering algorithms
Created 2017-03-04
246 commits to master branch, last one 9 months ago