summaryrefslogtreecommitdiff
path: root/textproc/py-tokenizers/pkg-descr
blob: 37ba482d53d27dcc6eaa4ce0de60379a720cf362 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Provides an implementation of today's most used tokenizers, with a
focus on performance and versatility.

Main features:
- Train new vocabularies and tokenize, using today's most used
  tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust
  implementation. Takes less than 20 seconds to tokenize a GB of text
  on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible
  to get the part of the original sentence that corresponds to a given
  token.
- Does all the pre-processing: Truncate, Pad, add the special tokens
  your model needs.