Mainly original notebook. Some notes added.
import fastbook
fastbook.setup_book()
from fastbook import *
%config Completer.use_jedi = False
def get_data(url, presize, resize):
path = untar_data(url)
return DataBlock(
blocks=(ImageBlock, CategoryBlock), get_items=get_image_files,
splitter=GrandparentSplitter(valid_name='val'),
get_y=parent_label, item_tfms=Resize(presize),
batch_tfms=[*aug_transforms(min_scale=0.5, size=resize),
Normalize.from_stats(*imagenet_stats)],
).dataloaders(path, bs=128)
dls = get_data(URLs.IMAGENETTE_160, 160, 128)
dls.show_batch(max_n=4)
Average Pooling:
def avg_pool(x): return x.mean((2,3))
Note: need to understand how
x.mean((2,3))
works.
def block(ni, nf): return ConvLayer(ni, nf, stride=2)
def get_model():
return nn.Sequential(
block(3, 16),
block(16, 32),
block(32, 64),
block(64, 128),
block(128, 256),
nn.AdaptiveAvgPool2d(1),
Flatten(),
nn.Linear(256, dls.c))
Note: breakpoint usage: https://youtu.be/2AdGJVtP3ak?t=1757
def get_learner(m):
return Learner(dls, m, loss_func=nn.CrossEntropyLoss(), metrics=accuracy
).to_fp16()
learn = get_learner(get_model())
learn.lr_find()
SuggestedLRs(valley=tensor(0.0006))
learn.fit_one_cycle(5, .0007)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 2.031526 | 1.935686 | 0.292994 | 00:05 |
1 | 1.730698 | 1.553754 | 0.493758 | 00:05 |
2 | 1.510944 | 1.496508 | 0.507516 | 00:05 |
3 | 1.377473 | 1.410298 | 0.551338 | 00:05 |
4 | 1.306587 | 1.341539 | 0.581146 | 00:05 |
first to make a very important leap:
: Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model.
As this is an academic paper this process is described in a rather inaccessible way, but the concept is actually very simple: start with a 20-layer neural network that is trained well, and add another 36 layers that do nothing at all (for instance, they could be linear layers with a single weight equal to 1, and bias equal to 0). The result will be a 56-layer network that does exactly the same thing as the 20-layer network, proving that there are always deep networks that should be at least as good as any shallow network. But for some reason, SGD does not seem able to find them.
jargon: Identity mapping: Returning the input without changing it at all. This process is performed by an identity function.
Actually, there is another way to create those extra 36 layers, which is much more interesting. What if we replaced every occurrence of conv(x)
with x + conv(x)
, where conv
is the function from the previous chapter that adds a second convolution, then a batchnorm layer, then a ReLU. Furthermore, recall that batchnorm does gamma*y + beta
. What if we initialized gamma
to zero for every one of those final batchnorm layers? Then our conv(x)
for those extra 36 layers will always be equal to zero, which means x+conv(x)
will always be equal to x
.
What has that gained us? The key thing is that those 36 extra layers, as they stand, are an identity mapping, but they have parameters, which means they are trainable. So, we can start with our best 20-layer model, add these 36 extra layers which initially do nothing at all, and then fine-tune the whole 56-layer model. Those extra 36 layers can then learn the parameters that make them most useful.
The ResNet paper actually proposed a variant of this, which is to instead "skip over" every second convolution, so effectively we get x+conv2(conv1(x))
. This is shown by the diagram in <<resnet_block>> (from the paper).
Important: BatchNorm again, i need to learn how gamma and beta works and why.
: Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.
Note: Again, this is rather inaccessible prose—so let's try to restate it in plain English! If the outcome of a given layer is
x
, when using a ResNet block that returnsy = x+block(x)
we're not asking the block to predicty
, we are asking it to predict the difference betweeny
andx
. So the job of those blocks isn't to predict certain features, but to minimize the error betweenx
and the desiredy
. A ResNet is, therefore, good at learning about slight differences between doing nothing and passing though a block of two convolutional layers (with trainable weights). This is how these models got their name: they're predicting residuals (reminder: "residual" is prediction minus target).
One key concept that both of these two ways of thinking about ResNets share is the idea of ease of learning. This is an important theme. Recall the universal approximation theorem, which states that a sufficiently large network can learn anything. This is still true, but there turns out to be a very important difference between what a network can learn in principle, and what it is easy for it to learn with realistic data and training regimes. Many of the advances in neural networks over the last decade have been like the ResNet block: the result of realizing how to make something that was always possible actually feasible.
note: True Identity Path: The original paper didn't actually do the trick of using zero for the initial value of
gamma
in the last batchnorm layer of each block; that came a couple of years later. So, the original version of ResNet didn't quite begin training with a truly identity path through the ResNet blocks, but nonetheless having the ability to "navigate through" the skip connections did indeed make it train better. Adding the batchnormgamma
init trick made the models train at even higher learning rates.
Here's the definition of a simple ResNet block (where norm_type=NormType.BatchZero
causes fastai to init the gamma
weights of the last batchnorm layer to zero):
Nice exlanations of F(x)+x is here: roll it back a couple of minutes.
youtube: https://youtu.be/2AdGJVtP3ak?t=3198
class ResBlock(Module):
def __init__(self, ni, nf):
self.convs = nn.Sequential(
ConvLayer(ni,nf),
ConvLayer(nf,nf, norm_type=NormType.BatchZero))
def forward(self, x): return x + self.convs(x)
def _conv_block(ni,nf,stride):
return nn.Sequential(
ConvLayer(ni, nf, stride=stride),
ConvLayer(nf, nf, act_cls=None, norm_type=NormType.BatchZero))
class ResBlock(Module):
def __init__(self, ni, nf, stride=1):
self.convs = _conv_block(ni,nf,stride)
self.idconv = noop if ni==nf else ConvLayer(ni, nf, 1, act_cls=None)
self.pool = noop if stride==1 else nn.AvgPool2d(2, ceil_mode=True)
def forward(self, x):
return F.relu(self.convs(x) + self.idconv(self.pool(x)))
def block(ni,nf): return ResBlock(ni, nf, stride=2)
learn = get_learner(get_model())
learn.fit_one_cycle(5, 0.0007)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 2.103873 | 1.946268 | 0.327134 | 00:08 |
1 | 1.849293 | 1.683211 | 0.445860 | 00:08 |
2 | 1.621537 | 1.475794 | 0.517452 | 00:08 |
3 | 1.427451 | 1.351159 | 0.579363 | 00:08 |
4 | 1.340495 | 1.328351 | 0.581656 | 00:08 |
def block(ni, nf):
return nn.Sequential(ResBlock(ni, nf, stride=2), ResBlock(nf, nf))
learn = get_learner(get_model())
learn.fit_one_cycle(5, 0.0007)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 2.059933 | 1.906990 | 0.352357 | 00:12 |
1 | 1.797654 | 1.609203 | 0.473885 | 00:11 |
2 | 1.509096 | 1.369431 | 0.544713 | 00:11 |
3 | 1.287784 | 1.194822 | 0.624204 | 00:11 |
4 | 1.180562 | 1.157007 | 0.636433 | 00:11 |
def _resnet_stem(*sizes):
return [
ConvLayer(sizes[i], sizes[i+1], 3, stride = 2 if i==0 else 1)
for i in range(len(sizes)-1)
] + [nn.MaxPool2d(kernel_size=3, stride=2, padding=1)]
_resnet_stem(3,32,32,64)
[ConvLayer( (0): Conv2d(3, 32, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False) (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() ), ConvLayer( (0): Conv2d(32, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() ), ConvLayer( (0): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False) (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() ), MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)]
class ResNet(nn.Sequential):
def __init__(self, n_out, layers, expansion=1):
stem = _resnet_stem(3,32,32,64)
self.block_szs = [64, 64, 128, 256, 512]
for i in range(1,5): self.block_szs[i] *= expansion
blocks = [self._make_layer(*o) for o in enumerate(layers)]
super().__init__(*stem, *blocks,
nn.AdaptiveAvgPool2d(1), Flatten(),
nn.Linear(self.block_szs[-1], n_out))
def _make_layer(self, idx, n_layers):
stride = 1 if idx==0 else 2
ch_in,ch_out = self.block_szs[idx:idx+2]
return nn.Sequential(*[
ResBlock(ch_in if i==0 else ch_out, ch_out, stride if i==0 else 1)
for i in range(n_layers)
])
rn = ResNet(dls.c, [2,2,2,2])
learn = get_learner(rn)
learn.fit_one_cycle(5,.0007)
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 1.856616 | 1.625414 | 0.467516 | 00:10 |
1 | 1.412103 | 1.376660 | 0.543694 | 00:10 |
2 | 1.141276 | 1.159298 | 0.631847 | 00:10 |
3 | 0.952444 | 0.930670 | 0.702930 | 00:10 |
4 | 0.854764 | 0.873461 | 0.727389 | 00:10 |
/home/niyazi/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /opt/conda/conda-bld/pytorch_1623448278899/work/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
def _conv_block(ni,nf,stride):
return nn.Sequential(
ConvLayer(ni, nf//4, 1),
ConvLayer(nf//4, nf//4, stride=stride),
ConvLayer(nf//4, nf, 1, act_cls=None, norm_type=NormType.BatchZero))
dls = get_data(URLs.IMAGENETTE_320, presize=320, resize=224)
File downloaded is broken. Remove /home/niyazi/.fastai/archive/imagenette2-320.tgz and try again.
rn = ResNet(dls.c, [3,4,6,3], 4)
learn = get_learner(rn)
learn.fit_one_cycle(20, .0006)
epoch | train_loss | valid_loss | accuracy | time |
---|
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) <ipython-input-46-e810235f7f83> in <module> 1 learn = get_learner(rn) ----> 2 learn.fit_one_cycle(20, .0006) ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/callback/schedule.py in fit_one_cycle(self, n_epoch, lr_max, div, div_final, pct_start, wd, moms, cbs, reset_opt) 111 scheds = {'lr': combined_cos(pct_start, lr_max/div, lr_max, lr_max/div_final), 112 'mom': combined_cos(pct_start, *(self.moms if moms is None else moms))} --> 113 self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd) 114 115 # Cell ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in fit(self, n_epoch, lr, wd, cbs, reset_opt) 219 self.opt.set_hypers(lr=self.lr if lr is None else lr) 220 self.n_epoch = n_epoch --> 221 self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup) 222 223 def _end_cleanup(self): self.dl,self.xb,self.yb,self.pred,self.loss = None,(None,),(None,),None,None ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final) 161 162 def _with_events(self, f, event_type, ex, final=noop): --> 163 try: self(f'before_{event_type}'); f() 164 except ex: self(f'after_cancel_{event_type}') 165 self(f'after_{event_type}'); final() ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in _do_fit(self) 210 for epoch in range(self.n_epoch): 211 self.epoch=epoch --> 212 self._with_events(self._do_epoch, 'epoch', CancelEpochException) 213 214 def fit(self, n_epoch, lr=None, wd=None, cbs=None, reset_opt=False): ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final) 161 162 def _with_events(self, f, event_type, ex, final=noop): --> 163 try: self(f'before_{event_type}'); f() 164 except ex: self(f'after_cancel_{event_type}') 165 self(f'after_{event_type}'); final() ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in _do_epoch(self) 204 205 def _do_epoch(self): --> 206 self._do_epoch_train() 207 self._do_epoch_validate() 208 ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in _do_epoch_train(self) 196 def _do_epoch_train(self): 197 self.dl = self.dls.train --> 198 self._with_events(self.all_batches, 'train', CancelTrainException) 199 200 def _do_epoch_validate(self, ds_idx=1, dl=None): ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final) 161 162 def _with_events(self, f, event_type, ex, final=noop): --> 163 try: self(f'before_{event_type}'); f() 164 except ex: self(f'after_cancel_{event_type}') 165 self(f'after_{event_type}'); final() ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in all_batches(self) 167 def all_batches(self): 168 self.n_iter = len(self.dl) --> 169 for o in enumerate(self.dl): self.one_batch(*o) 170 171 def _do_one_batch(self): ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in one_batch(self, i, b) 192 b = self._set_device(b) 193 self._split(b) --> 194 self._with_events(self._do_one_batch, 'batch', CancelBatchException) 195 196 def _do_epoch_train(self): ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in _with_events(self, f, event_type, ex, final) 161 162 def _with_events(self, f, event_type, ex, final=noop): --> 163 try: self(f'before_{event_type}'); f() 164 except ex: self(f'after_cancel_{event_type}') 165 self(f'after_{event_type}'); final() ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/learner.py in _do_one_batch(self) 170 171 def _do_one_batch(self): --> 172 self.pred = self.model(*self.xb) 173 self('after_pred') 174 if len(self.yb): ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1050 or _global_forward_hooks or _global_forward_pre_hooks): -> 1051 return forward_call(*input, **kwargs) 1052 # Do not call functions when jit is used 1053 full_backward_hooks, non_full_backward_hooks = [], [] ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/container.py in forward(self, input) 137 def forward(self, input): 138 for module in self: --> 139 input = module(input) 140 return input 141 ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1050 or _global_forward_hooks or _global_forward_pre_hooks): -> 1051 return forward_call(*input, **kwargs) 1052 # Do not call functions when jit is used 1053 full_backward_hooks, non_full_backward_hooks = [], [] ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/container.py in forward(self, input) 137 def forward(self, input): 138 for module in self: --> 139 input = module(input) 140 return input 141 ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1050 or _global_forward_hooks or _global_forward_pre_hooks): -> 1051 return forward_call(*input, **kwargs) 1052 # Do not call functions when jit is used 1053 full_backward_hooks, non_full_backward_hooks = [], [] <ipython-input-32-d876acbc8c1b> in forward(self, x) 6 7 def forward(self, x): ----> 8 return F.relu(self.convs(x) + self.idconv(self.pool(x))) ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1050 or _global_forward_hooks or _global_forward_pre_hooks): -> 1051 return forward_call(*input, **kwargs) 1052 # Do not call functions when jit is used 1053 full_backward_hooks, non_full_backward_hooks = [], [] ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/container.py in forward(self, input) 137 def forward(self, input): 138 for module in self: --> 139 input = module(input) 140 return input 141 ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1050 or _global_forward_hooks or _global_forward_pre_hooks): -> 1051 return forward_call(*input, **kwargs) 1052 # Do not call functions when jit is used 1053 full_backward_hooks, non_full_backward_hooks = [], [] ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/container.py in forward(self, input) 137 def forward(self, input): 138 for module in self: --> 139 input = module(input) 140 return input 141 ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs) 1049 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks 1050 or _global_forward_hooks or _global_forward_pre_hooks): -> 1051 return forward_call(*input, **kwargs) 1052 # Do not call functions when jit is used 1053 full_backward_hooks, non_full_backward_hooks = [], [] ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/conv.py in forward(self, input) 441 442 def forward(self, input: Tensor) -> Tensor: --> 443 return self._conv_forward(input, self.weight, self.bias) 444 445 class Conv3d(_ConvNd): ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/nn/modules/conv.py in _conv_forward(self, input, weight, bias) 437 weight, bias, self.stride, 438 _pair(0), self.dilation, self.groups) --> 439 return F.conv2d(input, weight, bias, self.stride, 440 self.padding, self.dilation, self.groups) 441 ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/fastai/torch_core.py in __torch_function__(self, func, types, args, kwargs) 338 convert=False 339 if _torch_handled(args, self._opt, func): convert,types = type(self),(torch.Tensor,) --> 340 res = super().__torch_function__(func, types, args=args, kwargs=kwargs) 341 if convert: res = convert(res) 342 if isinstance(res, TensorBase): res.set_meta(self, as_copy=True) ~/anaconda3/envs/fastbook/lib/python3.8/site-packages/torch/_tensor.py in __torch_function__(cls, func, types, args, kwargs) 1021 1022 with _C.DisableTorchFunction(): -> 1023 ret = func(*args, **kwargs) 1024 return _convert(ret, cls) 1025 RuntimeError: Unable to find a valid cuDNN algorithm to run convolution