LanguageMachines / ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --

Date Created 2013-03-26 (11 years ago)

Commits 1,584 (last one 2 months ago)

Stargazers 67 (0 this week)

Watchers 13 (0 this week)

Forks 14

License gpl-3.0

Ranking

RepositoryStats indexes 621,960 repositories, of these LanguageMachines/ucto is ranked #398,118 (36th percentile) for total stargazers, and #168,490 for total watchers. Github reports the primary language for this repository as C++, for repositories using this language it is ranked #22,218/33,184.

LanguageMachines/ucto is also tagged with popular topics, for these it's ranked: nlp (#1,787/2492), natural-language-processing (#1,082/1461), language (#564/760)

Other Information

LanguageMachines/ucto has Github issues enabled, there are 12 open issues and 81 closed issues.

There have been 47 releases, the latest one was published on 2024-12-16 (2 months ago) with the name v0.35.

Homepage URL: https://languagemachines.github.io/ucto

All Topics

nlp folia language tokeniser punctuation computational-linguistics natural-language-processing

Star History

Github stargazers over time

Watcher History

Github watchers over time, collection started in '23

Recent Commit History

297 commits on the default branch (master) since jan '22

Yearly Commits

Commits to the default branch (master) per year

Issue History

Languages

The primary language is C++ but there's also others...

updated: 2025-02-09 @ 12:12am, id: 9028617 / R_kgDOAInECQ