Tokenization
Word-based:: Split a sentence on spaces, as well as applying language-specific rules to try to separate parts of meaning even when there are no spaces (such as turning “don’t” into “do n’t”). Generally, punctuation marks are also split into separate tokens.
Subword based:: Split words into smaller parts, based on the most commonly occurring substrings. For instance, “occasion” might be tokenized as “o c ca sion.”
Character-based:: Split a sentence into its individual characters.
Word Tokenization with fastai
from fastai.text.all import *
path = untar_data(URLs.IMDB)
files = get_text_files(path, folders = ['train', 'test', 'unsup'])
txt = files[0].open().read(); txt[:75]
> 'This movie, which I just discovered at the video store, has apparently sit '
use coll_repr(collection, n) function to display the results. This displays the first n items of collection along with the full size—it’s what L uses by default.
Note that fastai’s tokenizers take a collection of documents to tokenize, so we have to wrap txt in a list:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))
> (#201) ['This','movie',',','which','I','just','discovered','at','the','video','store',',','has','apparently','sit','around','for','a','couple','of','years','without','a','distributor','.','It',"'s",'easy','to','see'...]
Here are some of the main special tokens you’ll see:
xxbos:: Indicates the beginning of a text (here, a review)
xxmaj:: Indicates the next word begins with a capital (since we lowercased everything)
xxunk:: Indicates the word is unknown
# To see the rules that were used, you can check the default rules:
defaults.text_proc_rules
[<function fastai.text.core.fix_html(x)>,
<function fastai.text.core.replace_rep(t)>,
<function fastai.text.core.replace_wrep(t)>,
<function fastai.text.core.spec_add_spaces(t)>,
<function fastai.text.core.rm_useless_spaces(t)>,
<function fastai.text.core.replace_all_caps(t)>,
<function fastai.text.core.replace_maj(t)>,
<function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]
Here is a brief summary of what each does:
fix_html:: Replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)
replace_rep:: Replaces any character repeated three times or more with a special token for repetition (xxrep), the number of times it’s repeated, then the character
replace_wrep:: Replaces any word repeated three times or more with a special token for word repetition (xxwrep), the number of times it’s repeated, then the word
spec_add_spaces:: Adds spaces around / and #
rm_useless_spaces:: Removes all repetitions of the space character
replace_all_caps:: Lowercases a word written in all caps and adds a special token for all caps (xxup) in front of it
replace_maj:: Lowercases a capitalized word and adds a special token for capitalized (xxmaj) in front of it
lowercase:: Lowercases all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)
Subword Tokenization
To handle these cases, it’s generally best to use subword tokenization. This proceeds in two steps:
Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.
Tokenize the corpus using this vocab of subword units.
txts = L(o.open().read() for o in files[:2000])
def subword(sz):
sp = SubwordTokenizer(vocab_sz=sz)
sp.setup(txts)
return ' '.join(first(sp([txt]))[:40])
subword(1000)
> '▁This ▁movie , ▁which ▁I ▁just ▁dis c over ed ▁at ▁the ▁video ▁st or e , ▁has ▁a p par ent ly ▁s it ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁dis t ri but or . ▁It'
the special character ▁ represents a space character in the original text.
If we use a smaller vocab, then each token will represent fewer characters, and it will take more tokens to represent a sentence:
subword(200)
'▁ T h i s ▁movie , ▁w h i ch ▁I ▁ j us t ▁ d i s c o ver ed ▁a t ▁the ▁ v id e o ▁ st or e , ▁h a s'
On the other hand, if we use a larger vocab, then most common English words will end up in the vocab themselves, and we will not need as many to represent a sentence:
"▁This ▁movie , ▁which ▁I ▁just ▁discover ed ▁at ▁the ▁video ▁store , ▁has ▁apparently ▁sit ▁around ▁for ▁a ▁couple ▁of ▁years ▁without ▁a ▁distributor . ▁It ' s ▁easy ▁to ▁see ▁why . ▁The ▁story ▁of ▁two ▁friends ▁living"
Numericalization with fastai
Numericalization is the process of mapping tokens to integers. The steps are basically identical to those necessary to create a Category variable, such as the dependent variable of digits in MNIST:
Make a list of all possible levels of that categorical variable (the vocab).
Replace each level with its index in the vocab.
toks200 = txts[:200].map(tkn)
toks200[0]
>(#228) ['xxbos','xxmaj','this','movie',',','which','i','just','discovered','at'...]
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)
> "(#2000) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','in','i','it'...]"
The defaults to Numericalize are min_freq=3,max_vocab=60000. max_vocab=60000 results in fastai replacing all words other than the most common 60,000 with a special unknown word token, xxunk
the default min_freq=3 means that any word appearing less than three times is replaced with xxunk.
# Once we've created our `Numericalize` object, we can use it as if it were a function:
toks = tkn(txt)
nums = num(toks)[:20]; nums
> tensor([ 2, 8, 21, 28, 11, 90, 18, 59, 0, 45, 9, 351, 499, 11, 72, 533, 584, 146, 29, 12])
' '.join(num.vocab[o] for o in nums)
> 'xxbos xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a'
Putting Our Texts into Batches for a Language Model
We now have 90 tokens, separated by spaces. Let’s say we want a batch size of 6. We need to break this text into 6 contiguous parts of length 15:
#hide_input
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
xxbos xxmaj in this chapter , we will go back over the example of classifying
movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj
first we will look at the processing steps necessary to convert text into numbers and
how to customize it . xxmaj by doing this , we 'll have another example
of the preprocessor used in the data block xxup api . \n xxmaj then we
will study how we build a language model and train it for a while .
但是15对于GPU来说可能更大,我们需要把它divide
Going back to our previous example with 6 batches of length 15, if we chose a sequence length of 5, that would mean we first feed the following array:
# first
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
# Then this one
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
# finally
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))
So to recap, at every epoch we shuffle our collection of documents and concatenate them into a stream of tokens.
We then cut that stream into a batch of fixed-size consecutive mini-streams. Our model will then read the mini-streams in order,
and thanks to an inner state, it will produce the same activation whatever sequence length we picked.
This is all done behind the scenes by the fastai library when we create an LMDataLoader. We do this by first applying our Numericalize object to the tokenized texts:
nums200 = toks200.map(num)
and then passing that to LMDataLoader:
dl = LMDataLoader(nums200)
Let’s confirm that this gives the expected results, by grabbing the first batch:
x,y = first(dl)
x.shape,y.shape
and then looking at the first row of the independent variable, which should be the start of the first text:
' '.join(num.vocab[o] for o in x[0][:20])
'xxbos xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a'
The dependent variable is the same thing offset by one token:
' '.join(num.vocab[o] for o in y[0][:20])
'xxmaj this movie , which i just xxunk at the video store , has apparently sit around for a couple'
Training a Text Classifier
Language Model Using DataBlock
fastai handles tokenization and numericalization automatically when TextBlock is passed to DataBlock
All of the arguments that can be passed to Tokenize and Numericalize can also be passed to TextBlock
And don’t forget about DataBlock’s handy summary method, which is very useful for debugging data issues.
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])
dls_lm = DataBlock(
blocks=TextBlock.from_folder(path, is_lm=True),
get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)
dls_lm.show_batch(max_n=2)
Fine-Tuning the Language Model
As we discussed earlier, the embeddings in the pretrained model are merged with random embeddings added for words that weren’t in the pretraining vocabulary. This is handled automatically inside language_model_learner:
learn = language_model_learner(
dls_lm, AWD_LSTM, drop_mult=0.3,
metrics=[accuracy, Perplexity()]).to_fp16()
The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab).
The perplexity metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., torch.exp(cross_entropy)).
We also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we’ve seen) is both hard to interpret,
and tells us more about the model’s confidence than its accuracy.
It takes quite a while to train each epoch, so we’ll be saving the intermediate model results during the training process.
Since fine_tune doesn’t do that for us, we’ll use fit_one_cycle. Just like vision_learner,
language_model_learner automatically calls freeze when using a pretrained model (which is the default),
so this will only train the embeddings (the only part of the model that contains randomly initialized weights—i.e.,
embeddings for words that are in our IMDb vocab, but aren’t in the pretrained model vocab):
learn.fit_one_cycle(1, 2e-2)
This model takes a while to train, so it’s a good opportunity to talk about saving intermediary results.
You can easily save the state of your model like so:
learn.save('1epoch')
# If you want to load your model in another machine after creating your Learner the same way, or resume training later, you can load the content of this file with:
learn = learn.load('1epoch')
# Once the initial training has completed, we can continue fine-tuning the model after unfreezing:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)
# Once this is done, we save all of our model except the final layer that converts activations to probabilities of picking each token in our vocabulary.
# The model not including the final layer is called the encoder. We can save it with save_encoder:
learn.save_encoder('finetuned')
# use our model
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75)
for _ in range(N_SENTENCES)]
Creating the Classifier DataLoaders
dls_clas = DataBlock(
blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)
dls_clas.show_batch(max_n=3)
Looking at the DataBlock definition, every piece is familiar from previous data blocks we’ve built, with two important exceptions:
TextBlock.from_folder no longer has the is_lm=True parameter.
We pass the vocab we created for the language model fine-tuning.
By passing is_lm=False (or not passing is_lm at all, since it defaults to False) we tell TextBlock that we have regular labeled data, rather than using the next tokens as labels.
The sorting and padding are automatically done by the data block API for us when using a TextBlock, with is_lm=False.
# We can now create a model to classify our texts:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
metrics=accuracy).to_fp16()
# The final step prior to training the classifier is to load the encoder from our fine-tuned language model.
learn = learn.load_encoder('finetuned')
# Fine-Tuning the Classifier
learn.fit_one_cycle(1, 2e-2)
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))