Search Results - RepositoryStats

tokenizer theseer

22

5.2k

other

8

A small library for converting tokenized PHP source code into XML (and potentially other formats)

php xml tokenizer

Created 2017-04-05

75 commits to master branch, last one about a year ago

chevrotain Chevrotain

213

2.6k

apache-2.0

29

Parser Building Toolkit for JavaScript

lexer parsing grammars tokenizer javascript typescript open-source parser-library

Created 2015-04-15

2,345 commits to master branch, last one about a month ago

hazm roshan-research

188

1.3k

mit

23

Persian NLP Toolkit

nlp farsi python persian tokenizer embeddings persian-nlp pos-tagging lemmatization normalization text-processing dependency-parser natural-language-processing

Created 2013-10-29

1,411 commits to master branch, last one 10 months ago

natasha natasha

109

1.2k

mit

57

Solves basic Russian NLP tasks, API for lower level Natasha projects

ner nlp python syntax russian tokenizer embeddings morphology visualization sentence-segmentation

Created 2016-08-03

413 commits to master branch, last one 6 months ago

tiktokenizer dqbd

130

1.1k

mit

10

Online playground for OpenAPI tokenizers

nextjs openai chatgpt t3-stack tiktoken tokenizer

Created 2023-03-02

57 commits to master branch, last one 2 months ago

soynlp lovit

185

961

other

40

한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.

nlp tokenizer korean-nlp postagging word-extraction korean-text-processing

Created 2017-05-13

766 commits to master branch, last one 4 years ago

kagome ikawaha

56

859

mit

22

Self-contained Japanese Morphological Analyzer written in pure Go

korean japanese tokenizer nlp-library pos-tagging segmentation hacktoberfest japanese-language morphological-analysis

Created 2014-06-26

821 commits to v2 branch, last one 22 days ago

moo no-context

68

848

bsd-3-clause

11

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

lexer regexp tokenizer javascript

Created 2017-03-10

363 commits to main branch, last one about a year ago

Wordless BLKSerene

93

717

gpl-3.0

26

An Integrated Corpus Tool With Multilingual Support for the Study of Language, Literature, and Translation

corpus tagger stopword tokenizer lemmatizer literature linguistics translation corpus-tools corpus-search corpus-analysis corpus-processing corpus-statistics dependency-parser corpus-linguistics

Created 2018-08-23

1,499 commits to main branch, last one 27 days ago

simple wangfenjin

96

672

other

6

支持中文和拼音的 SQLite fts5 全文搜索扩展｜ A SQLite3 fts5 tokenizer which supports Chinese and PinYin

fts fts5 cpp14 pinyin sqlite chinese sqlite3 tokenizer sqlite3-fts5

Created 2020-03-03

169 commits to master branch, last one 5 days ago

Data-Labeling risesoft-y9

95

670

gpl-3.0

68

数据标注是一款专门对文本数据进行处理和标注的工具，通过简化快捷的文本标注流程和动态的算法反馈，支持用户快速标注关键词并能通过算法持续减少人工标注的成本和时间。数据标注的过程先由人工标注构建基础，再由自动标注反哺人工标注，最后由人工标注进行纠偏，从而大幅度提高标注的精准度和高效性。数据标注需要依赖开源的数字底座进行人员岗位管控。

java vue3 nacos docker chinese tokenizer springboot2 elasticsearch data-annotations tokenizer-parser data-annotation-tools

Created 2024-08-26

30 commits to main branch, last one 3 months ago

ekphrasis cbaziotis

91

668

mit

18

Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashta...

nlp semeval tokenizer nlp-library tokenization spell-corrector text-processing text-segmentation word-segmentation word-normalization spelling-correction

Created 2017-02-07

77 commits to master branch, last one 2 years ago

open-korean-text open-korean-text

98

623

apache-2.0

49

Open Korean Text Processor - An Open-source Korean Text Processor

korean tokenizer text-processing korean-tokenizer korean-text-processing natural-language-processing

Created 2017-01-24

799 commits to master branch, last one about a year ago

jflex jflex-de

117

604

other

22

The fast scanner generator for Java™ with full Unicode support

cup dfa nfa flex java yacc lexer regexp grammar parsing scanner tokenizer bazel-rules maven-plugin lexer-generator dfa-minimization lexical-analyzer scanner-generator

Created 2015-02-15

2,110 commits to master branch, last one 3 months ago

tokenmonster alasdairforsythe

21

576

mit

11

Ungreedy subword tokenizer and vocabulary trainer for Python, Go & Javascript

tokenize tokenizer tokenizing vocabulary tokenisation tokenization text-tokenization vocabulary-builder vocabulary-generator

Created 2023-05-12

196 commits to main branch, last one about a year ago

Deepdive-llama3-from-scratch therealoliver

44

570

mit

4

Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.

Created 2025-02-19

9 commits to main branch, last one about a month ago

gpt-tokenizer niieani

39

557

mit

4

The fastest JavaScript BPE Tokenizer Encoder Decoder for OpenAI's GPT-2 / GPT-3 / GPT-4 / GPT-4o / GPT-o1. Port of OpenAI's tiktoken with additional features.

bpe gpt-2 gpt-3 gpt-4 gpt-4o gpt-o1 openai decoder encoder tokenizer machine-learning

Created 2023-03-22

118 commits to main branch, last one about a month ago

php-parser glayzzle

71

542

bsd-3-clause

18

:herb: NodeJS PHP Parser - extract AST or tokens

ast php lexer parser php-ast tokenizer javascript php-parser development static-code-analysis

Created 2014-12-07

1,846 commits to main branch, last one 6 days ago

js-tokens lydell

34

517

mit

6

Tiny JavaScript tokenizer.

regex tokenizer ecmascript javascript

Created 2014-03-08

156 commits to main branch, last one 19 days ago

friso lionsoul2014

92

498

apache-2.0

32

High performance Chinese tokenizer with both GBK and UTF-8 charset support based on MMSEG algorithm developed by ANSI C. Completely based on modular implementation and can be easily embedded in other...

c tokenizer cjk-tokenizer php-tokenizer full-text-search korean-tokenizer chinese-tokenizer japanese-tokenizer chinese-word-segmentation

Created 2014-03-31

148 commits to master branch, last one about a year ago

sacremoses hplt-project

59

494

mit

11

Python port of Moses tokenizer, truecaser and normalizer

nlp tokenizer machine-translation

Created 2018-04-20

374 commits to master branch, last one about a year ago

vscode-blockman leodevbro

17

478

mit

7

VSCode extension to highlight nested code blocks

ast parser tokenizer vscode-api indentation vscode-blockman highlight-blocks vscode-extension abstract-syntax-tree

Created 2021-05-12

201 commits to main branch, last one 6 months ago

cogcomp-nlp CogComp

144

475

other

61

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, t...

ner nlp pos cogcomp big-data tokenizer lemmatizer similarity data-mining pos-tagging lemmatization transliteration dependency-parsing relation-extraction parts-of-speech-tagging named-entity-recognition natural-language-processing natural-language-understanding

Created 2015-10-18

2,621 commits to master branch, last one 2 years ago