123 results found Sort:

22
5.2k
other
8
A small library for converting tokenized PHP source code into XML (and potentially other formats)
Created 2017-04-05
75 commits to master branch, last one about a year ago
211
2.6k
apache-2.0
29
Parser Building Toolkit for JavaScript
Created 2015-04-15
2,345 commits to master branch, last one 11 days ago
Persian NLP Toolkit
Created 2013-10-29
1,411 commits to master branch, last one 9 months ago
109
1.2k
mit
57
Solves basic Russian NLP tasks, API for lower level Natasha projects
Created 2016-08-03
413 commits to master branch, last one 5 months ago
125
1.1k
mit
10
Online playground for OpenAPI tokenizers
Created 2023-03-02
57 commits to master branch, last one about a month ago
185
958
other
41
한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
Created 2017-05-13
766 commits to master branch, last one 4 years ago
56
853
mit
22
Self-contained Japanese Morphological Analyzer written in pure Go
Created 2014-06-26
816 commits to v2 branch, last one a day ago
68
843
bsd-3-clause
11
Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
Created 2017-03-10
363 commits to main branch, last one about a year ago
93
714
gpl-3.0
26
An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation
Created 2018-08-23
1,499 commits to main branch, last one 19 hours ago
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashta...
Created 2017-02-07
77 commits to master branch, last one 2 years ago
95
664
other
6
支持中文和拼音的 SQLite fts5 全文搜索扩展 | A SQLite3 fts5 tokenizer which supports Chinese and PinYin
Created 2020-03-03
165 commits to master branch, last one about a month ago
数据标注是一款专门对文本数据进行处理和标注的工具,通过简化快捷的文本标注流程和动态的算法反馈,支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础,再由自动标注反哺人工标注,最后由人工标注进行纠偏,从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。
Created 2024-08-26
30 commits to main branch, last one 2 months ago
Open Korean Text Processor - An Open-source Korean Text Processor
Created 2017-01-24
799 commits to master branch, last one about a year ago
117
602
other
22
The fast scanner generator for Java™ with full Unicode support
Created 2015-02-15
2,110 commits to master branch, last one 2 months ago
Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript
Created 2023-05-12
196 commits to main branch, last one about a year ago
Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.
Created 2025-02-19
9 commits to main branch, last one about a month ago
72
542
bsd-3-clause
18
:herb: NodeJS PHP Parser - extract AST or tokens
Created 2014-12-07
1,844 commits to main branch, last one 2 months ago
The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o / GPT-o1. Port of OpenAI's tiktoken with additional features.
Created 2023-03-22
118 commits to main branch, last one 23 days ago
Tiny JavaScript tokenizer.
Created 2014-03-08
153 commits to main branch, last one about a month ago
91
496
apache-2.0
32
High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other...
Created 2014-03-31
148 commits to master branch, last one about a year ago
Python port of Moses tokenizer, truecaser and normalizer
Created 2018-04-20
374 commits to master branch, last one about a year ago
VSCode extension to highlight nested code blocks
Created 2021-05-12
201 commits to main branch, last one 5 months ago
144
475
other
61
CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, t...
Created 2015-10-18
2,621 commits to master branch, last one 2 years ago
A multilingual command line sentence tokenizer in Golang
Created 2015-08-07
223 commits to main branch, last one about a year ago
41
441
mit
6
A multilingual morphological analysis library.
Created 2020-01-22
527 commits to main branch, last one a day ago
36
439
mit
7
A Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis.
Created 2019-10-14
284 commits to main branch, last one 2 months ago
28
410
other
11
Lex machinary for go.
Created 2014-04-28
157 commits to master branch, last one 2 years ago
23
396
mit
10
A Japanese tokenizer based on recurrent neural networks
Created 2018-02-14
194 commits to master branch, last one 9 months ago
44
385
apache-2.0
31
Juman++ (a Morphological Analyzer Toolkit)
Created 2016-10-11
1,093 commits to master branch, last one 2 years ago
15
353
apache-2.0
7
🎤 vibrato: Viterbi-based accelerated tokenizer
Created 2022-07-06
176 commits to main branch, last one 9 days ago