blob: 37ba482d53d27dcc6eaa4ce0de60379a720cf362 (
plain) (
blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
Provides an implementation of today's most used tokenizers, with a
focus on performance and versatility.
Main features:
- Train new vocabularies and tokenize, using today's most used
tokenizers.
- Extremely fast (both training and tokenization), thanks to the Rust
implementation. Takes less than 20 seconds to tokenize a GB of text
on a server's CPU.
- Easy to use, but also extremely versatile.
- Designed for research and production.
- Normalization comes with alignments tracking. It's always possible
to get the part of the original sentence that corresponds to a given
token.
- Does all the pre-processing: Truncate, Pad, add the special tokens
your model needs.
|