LanguageMachines / ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

Date Created 2013-03-26 (11 years ago)
Commits 1,584 (last one 2 months ago)
Stargazers 67 (0 this week)
Watchers 13 (0 this week)
Forks 14
License gpl-3.0
Ranking

RepositoryStats indexes 621,960 repositories, of these LanguageMachines/ucto is ranked #398,118 (36th percentile) for total stargazers, and #168,490 for total watchers. Github reports the primary language for this repository as C++, for repositories using this language it is ranked #22,218/33,184.

LanguageMachines/ucto is also tagged with popular topics, for these it's ranked: nlp (#1,787/2492),  natural-language-processing (#1,082/1461),  language (#564/760)

Other Information

LanguageMachines/ucto has Github issues enabled, there are 12 open issues and 81 closed issues.

There have been 47 releases, the latest one was published on 2024-12-16 (2 months ago) with the name v0.35.

Homepage URL: https://languagemachines.github.io/ucto

Star History

Github stargazers over time

707060605050404030302020101000201420142015201520162016201720172018201820192019202020202021202120222022202320232024202420252025

Watcher History

Github watchers over time, collection started in '23

131313131212121212121111111120232023Feb '23Feb '23Apr '23Apr '23Jun '23Jun '23Aug '23Aug '23Oct '23Oct '23Dec '23Dec '23Feb '24Feb '24Apr '24Apr '24Jun '24Jun '24Aug '24Aug '24Oct '24Oct '24Dec '24Dec '24Feb '25Feb '25

Recent Commit History

297 commits on the default branch (master) since jan '22

300300250250200200150150100100505000Jul '22Jul '2220232023Jul '23Jul '2320242024Jul '24Jul '2420252025

Yearly Commits

Commits to the default branch (master) per year

2502502002001501501001005050002013201320142014201520152016201620172017201820182019201920202020202120212022202220242024

Issue History

Total Issues
Open Issues
Closed Issues
1001009090808070706060505040403030202010100020162016201720172018201820192019202020202021202120222022202320232024202420252025

Languages

The primary language is C++ but there's also others...

C++C++VerilogVerilogCoqCoqShellShellM4M4PythonPythonVVNewLispNewLispDockerfileDockerfileMakefileMakefile

updated: 2025-02-09 @ 12:12am, id: 9028617 / R_kgDOAInECQ