LanguageMachines / ucto
Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --
RepositoryStats indexes 582,612 repositories, of these LanguageMachines/ucto is ranked #387,387 (34th percentile) for total stargazers, and #165,538 for total watchers. Github reports the primary language for this repository as C++, for repositories using this language it is ranked #21,608/31,193.
LanguageMachines/ucto has Github issues enabled, there are 12 open issues and 81 closed issues.
There have been 46 releases, the latest one was published on 2024-09-12 (2 months ago) with the name v0.34.
Homepage URL: https://languagemachines.github.io/ucto
Star History
Github stargazers over time
Watcher History
Github watchers over time, collection started in '23
Recent Commit History
292 commits on the default branch (master) since jan '22
Yearly Commits
Commits to the default branch (master) per year
Issue History
Languages
The primary language is C++ but there's also others...
updated: 2024-09-25 @ 07:31am, id: 9028617 / R_kgDOAInECQ