Transform
# When we studied tokenization and numericalization in the last chapter, we started by grabbing a bunch of texts:
files = get_text_files(path, folders = ['train', 'test'])
txts = L(o.open().read() for o in files[:2000])
# We then showed how to tokenize them with a Tokenizer:
tok = Tokenizer.from_folder(path)
tok.setup(txts)
toks = txts.map(tok)
toks[0]
# and how to numericalize, including automatically creating the vocab for our corpus:
num = Numericalize()
num.setup(toks)
nums = toks.map(num)
nums[0][:10]
# The classes also have a decode method. For instance, Numericalize.decode gives us back the string tokens:
nums_dec = num.decode(nums[0][:10]); nums_dec
(#10) ['xxbos','xxmaj','well',',','"','cube','"','(','1997',')']
# and Tokenizer.decode turns this back into a single string
tok.decode(nums_dec)
'xxbos xxmaj well , " cube " ( 1997 )'
# decode is used by fastai’s show_batch and show_results, as well as some other inference methods,
# to convert predictions and mini-batches into a human-understandable representation.
# These steps are needed for most data preprocessing tasks, so fastai provides a class that encapsulates them.
# This is the Transform class. Both Tokenize and Numericalize are Transforms.
# In general, a Transform is an object that behaves like a function and has an optional setup method that will initialize
# some inner state (like the vocab inside num) and an optional decode that will reverse the function (this reversal may not be perfect, as we saw with tok).
# A special behavior of Transforms is that they always get applied over tuples. the transform needs to be applied (the same way) to the input and the target.
# write transform
def f(x:int): return x+1
tfm = Transform(f)
tfm(2),tfm(2.0)
# use decorator
@Transform
def f(x:int): return x+1
f(2),f(2.0)
# If you need either setup or decode, you will need to subclass Transform to implement the actual encoding behavior in encodes, then (optionally),
# the setup behavior in setups and the decoding behavior in decodes:
class NormalizeMean(Transform):
def setups(self, items): self.mean = sum(items)/len(items)
def encodes(self, x): return x-self.mean
def decodes(self, x): return x+self.mean
tfm = NormalizeMean()
tfm.setup([1,2,3,4,5])
start = 2
y = tfm(start)
z = tfm.decode(y)
tfm.mean,y,z
Pipeline
To compose several transforms together, fastai provides the Pipeline class.
We define a Pipeline by passing it a list of Transforms; it will then compose the transforms inside it.
When you call Pipeline on an object, it will automatically call the transforms inside, in order:
tfms = Pipeline([tok, num])
t = tfms(txts[0]); t[:20]
And you can call `decode` on the result of your encoding, to get back something you can display and analyze:
tfms.decode(t)[:100]
但是pipeline不会掉用transfrom的setup方法
TfmdLists
The class that groups together this Pipeline with your raw items is called TfmdLists.
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize])
# At initialization, the TfmdLists will automatically call the setup method of each Transform in order,
# providing them not with the raw items but the items transformed by all the previous Transforms in order.
# We can get the result of our Pipeline on any raw element just by indexing into the TfmdLists:
t = tls[0]; t[:20]
> tensor([ 2, 8, 91, 11, 22, 5793, 22, 37, 4910, 34, 11, 8, 13042, 23, 107, 30, 11, 25, 44, 14])
# And the TfmdLists knows how to decode for show purposes:
tls.decode(t)[:100]
> 'xxbos xxmaj well , " cube " ( 1997 ) , xxmaj vincenzo \'s first movie , was one of the most interesti'
# In fact, it even has a show method:
tls.show(t)
# handle a training and a validation set with a splits argument.
# You just need to pass the indices of which elements are in the training set, and which are in the validation set:
cut = int(len(files)*0.8)
splits = [list(range(cut)), list(range(cut,len(files)))]
tls = TfmdLists(files, [Tokenizer.from_folder(path), Numericalize],
splits=splits)
tls.valid[0][:20]
但是TfmdLists不可以针对x和y做不同的Transform。
Datasets可以。
Datasets
Datasets will apply two (or more) pipelines in parallel to the same raw object and build a tuple with the result.
Like TfmdLists, it will automatically do the setup for us, and when we index into a Datasets, it will return us a tuple with the results of each pipeline:
x_tfms = [Tokenizer.from_folder(path), Numericalize]
y_tfms = [parent_label, Categorize()]
dsets = Datasets(files, [x_tfms, y_tfms], splits=splits)
x,y = dsets.valid[0]
x[:20],y
t = dsets.valid[0]
dsets.decode(t)
# dataloader
dls = dsets.dataloaders(bs=64, before_batch=pad_input)
# As a conclusion, here is the full code necessary to prepare the data for text classification:
tfms = [[Tokenizer.from_folder(path), Numericalize], [parent_label, Categorize]]
files = get_text_files(path, folders = ['train', 'test'])
splits = GrandparentSplitter(valid_name='test')(files)
dsets = Datasets(files, tfms, splits=splits)
dls = dsets.dataloaders(dl_type=SortedDL, before_batch=pad_input)
# SortedDL constructs batches by putting samples of roughly the same lengths into batches.
# This does the exact same thing as our previous DataBlock:
path = untar_data(URLs.IMDB)
dls = DataBlock(
blocks=(TextBlock.from_folder(path),CategoryBlock),
get_y = parent_label,
get_items=partial(get_text_files, folders=['train', 'test']),
splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path)
example using midlevel API
# 判断两个图片是否同类
from fastai.vision.all import *
path = untar_data(URLs.PETS)
files = get_image_files(path/"images")
如果想使用show,那么需要为自己的数据定义一个类,并实现show方法
class SiameseImage(fastuple):
def show(self, ctx=None, **kwargs):
img1,img2,same_breed = self
if not isinstance(img1, Tensor):
if img2.size != img1.size: img2 = img2.resize(img1.size)
t1,t2 = tensor(img1),tensor(img2)
t1,t2 = t1.permute(2,0,1),t2.permute(2,0,1)
else: t1,t2 = img1,img2
line = t1.new_zeros(t1.shape[0], t1.shape[1], 10)
return show_image(torch.cat([t1,line,t2], dim=2),
title=same_breed, ctx=ctx)
img = PILImage.create(files[0])
s = SiameseImage(img, img, True)
s.show();
The important thing with transforms that we saw before is that they dispatch over tuples or their subclasses.
That’s precisely why we chose to subclass fastuple in this instance—this way we can apply any transform that works on
images to our SiameseImage and it will be applied on each image in the tuple:
def label_func(fname):
return re.match(r'^(.*)_\d+.jpg$', fname.name).groups()[0]
class SiameseTransform(Transform):
def __init__(self, files, label_func, splits):
self.labels = files.map(label_func).unique()
self.lbl2files = {l: L(f for f in files if label_func(f) == l)
for l in self.labels}
self.label_func = label_func
self.valid = {f: self._draw(f) for f in files[splits[1]]}
def encodes(self, f):
f2,t = self.valid.get(f, self._draw(f))
img1,img2 = PILImage.create(f),PILImage.create(f2)
return SiameseImage(img1, img2, t)
def _draw(self, f):
same = random.random() < 0.5
cls = self.label_func(f)
if not same:
cls = random.choice(L(l for l in self.labels if l != cls))
return random.choice(self.lbl2files[cls]),same
splits = RandomSplitter()(files)
tfm = SiameseTransform(files, label_func, splits)
tfm(files[0]).show();
tls = TfmdLists(files, tfm, splits=splits)
show_at(tls.valid, 0);
dls = tls.dataloaders(after_item=[Resize(224), ToTensor],
after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats)])