!jupyter nbconvert --to markdown 8_8_6_Exercises.ipynb
[NbConvertApp] WARNING | Config option `kernel_spec_manager_class` not recognized by `NbConvertApp`. [NbConvertApp] Converting notebook 8_8_6_Exercises.ipynb to markdown [NbConvertApp] Writing 9402 bytes to 8_8_6_Exercises.md
import sys
import torch.nn as nn
import torch
import warnings
sys.path.append('/home/jovyan/work/d2l_solutions/notebooks/exercises/d2l_utils/')
import d2l
from torchsummary import summary
warnings.filterwarnings("ignore")
class Alexnet(d2l.Classifier):
def __init__(self,lr=0.1, num_classes=10):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential(nn.LazyConv2d(96, kernel_size=11, stride=4, padding=1),
nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2),
nn.LazyConv2d(256, kernel_size=5, padding=2),
nn.ReLU(), nn.MaxPool2d(kernel_size=3, stride=2),
nn.LazyConv2d(384, kernel_size=3, padding=1),nn.ReLU(),
nn.LazyConv2d(384, kernel_size=3, padding=1),nn.ReLU(),
nn.LazyConv2d(256, kernel_size=3, padding=1),nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),
nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
nn.LazyLinear(num_classes)
)
Fomula of the number of parameters of convolutions is $\sum^{layers}(c_i*c_o*k_h*k_w+c_o)$
3*96*11*11+96+96*256*5*5+256+256*384*3*3+384+384*384*3*3+384+384*256*3*3+256
3747200
Fomula of the number of parameters of fully connected is $\sum^{layers}(x_i*x_o+x_o)$
80*80*4096+4096+4096*4096+4096+4096*10+10
43040778
The fully connected layers dominates
model = Alexnet()
X = torch.randn(1,3, 224, 224)
_ = model(X)
params = {'conv':0, 'lr':0}
for idx, module in enumerate(model.net):
if type(module) not in (nn.Linear,nn.Conv2d):
continue
num = sum(p.numel() for p in module.parameters())
# print(f"Module {idx + 1}: {num} parameters type:{type(module)}")
if type(module) == nn.Conv2d:
params['conv'] += num
else:
params['lr'] += num
params
{'conv': 3747200, 'lr': 43040778}
summary(model, (3, 224, 224))
---------------------------------------------------------------- Layer (type) Output Shape Param # ================================================================ Conv2d-1 [-1, 96, 54, 54] 34,944 ReLU-2 [-1, 96, 54, 54] 0 MaxPool2d-3 [-1, 96, 26, 26] 0 Conv2d-4 [-1, 256, 26, 26] 614,656 ReLU-5 [-1, 256, 26, 26] 0 MaxPool2d-6 [-1, 256, 12, 12] 0 Conv2d-7 [-1, 384, 12, 12] 885,120 ReLU-8 [-1, 384, 12, 12] 0 Conv2d-9 [-1, 384, 12, 12] 1,327,488 ReLU-10 [-1, 384, 12, 12] 0 Conv2d-11 [-1, 256, 12, 12] 884,992 ReLU-12 [-1, 256, 12, 12] 0 MaxPool2d-13 [-1, 256, 5, 5] 0 Flatten-14 [-1, 6400] 0 Linear-15 [-1, 4096] 26,218,496 ReLU-16 [-1, 4096] 0 Dropout-17 [-1, 4096] 0 Linear-18 [-1, 4096] 16,781,312 ReLU-19 [-1, 4096] 0 Dropout-20 [-1, 4096] 0 Linear-21 [-1, 10] 40,970 ================================================================ Total params: 46,787,978 Trainable params: 46,787,978 Non-trainable params: 0 ---------------------------------------------------------------- Input size (MB): 0.57 Forward/backward pass size (MB): 10.22 Params size (MB): 178.48 Estimated Total Size (MB): 189.28 ----------------------------------------------------------------
Fomula of the computational cost for convolutions is $\sum^{layers}(c_i*c_o*k_h*k_w*h_o*w_o)$
3*96*11*11*54*54+96*256*5*5*26*26+256*384*3*3*12*12+384*384*3*3*12*12+384*256*3*3*12*12
962858112
Fomula of the computational cost for fully connected layers is $\sum^{layers}(x_i*x_o+x_o)$
80*80*4096+4096+4096*4096+4096+4096*10+10
43040778
x = torch.randn(1,3, 224, 224)
params = {'conv':0, 'lr':0}
for idx, module in enumerate(model.net):
c_i = x.shape[1]
x = module(x)
if type(module) == nn.Conv2d:
k = [p.shape for p in module.parameters()]
c_o,h_o,w_o = x.shape[1], x.shape[2], x.shape[3]
params['conv'] += c_i*c_o*h_o*w_o*k[0][-1]*k[0][-2]
if type(module) == nn.Linear:
params['lr'] += sum(p.numel() for p in module.parameters())
params
{'conv': 962858112, 'lr': 43040778}
X = torch.randn(1,3, 224, 224)
_ = model(X)
total_params = sum(p.numel() for p in model.parameters())
print("Total parameters:", total_params)
Total parameters: 46787978
Memory characteristics, including read and write bandwidth, latency, and size, have a significant impact on computation in both training and inference of neural networks. These factors can influence the overall performance, efficiency, and speed of the computation. Here's how these memory aspects affect computation and any potential differences between training and inference:
Read and Write Bandwidth:
Latency:
Memory Size:
Training and Inference Differences:
Memory Hierarchy:
In summary, memory characteristics significantly influence neural network computation. High bandwidth, low latency, sufficient memory size, and efficient memory hierarchy are all essential for achieving optimal performance in both training and inference. While there might be nuances in how these aspects affect training and inference, addressing memory-related bottlenecks is crucial for overall efficiency and speed in deep learning computations.
Optimizing the trade-off between computation and memory bandwidth in chip design is a complex task that involves careful consideration of various factors. The goal is to achieve a balance between computation speed, memory access efficiency, power consumption, chip size, and other performance metrics. Here's a step-by-step approach to optimizing this trade-off:
Define Performance Goals:
Profile Workloads:
Architectural Exploration:
Memory Hierarchy Design:
Power Efficiency:
Chip Area and Integration:
Memory Bandwidth Enhancement:
Parallelism and Pipelining:
Simulation and Modeling:
Feedback Loop:
Validation and Testing:
Real-World Constraints:
Ultimately, the optimization process involves a careful consideration of performance, power, area, and cost factors. Collaboration among chip architects, designers, and domain experts is essential to make informed decisions and strike the right balance between computation and memory bandwidth.
Aging Benchmark: AlexNet, while pioneering, was introduced in 2012, and its architecture might not represent the state-of-the-art in terms of efficiency and accuracy compared to more recent models. Newer models, architectures, and techniques have been developed that surpass the performance of AlexNet on various tasks.
Advancements in Architecture: Over the years, more advanced architectures like VGG, ResNet, Inception, and Transformer-based models (BERT, GPT, etc.) have been developed and have become more popular for benchmarking and research. These architectures often achieve better accuracy and efficiency than AlexNet.
Domain-Specific Models: Depending on the application domain, engineers might prefer to benchmark models that are tailored to specific tasks. For instance, models like EfficientNet for efficient image classification or object detection networks like Faster R-CNN might be more relevant and commonly used for specific tasks.
Diverse Benchmarks: With the increase in complexity and diversity of tasks, researchers often benchmark models across a wide range of datasets and tasks. This ensures that the performance of a model is tested across various scenarios rather than just focusing on a single benchmark.
Focus on Real-World Applications: Engineers are increasingly interested in benchmarking models that demonstrate their performance in real-world applications, such as medical image analysis, autonomous driving, natural language understanding, and more. This shift in focus might result in a move away from using AlexNet.
Evolving Hardware and Software: Performance benchmarks are often influenced by the underlying hardware (GPUs, TPUs) and software (deep learning frameworks, optimizations). As hardware and software landscapes evolve, engineers tend to benchmark newer models that are optimized for the latest hardware and software technologies.
Research Direction: The field of deep learning research has expanded, and engineers are exploring various directions such as model interpretability, robustness, fairness, and more. These aspects might take precedence over revisiting older models like AlexNet for benchmarking.
It's important to note that the above points are based on trends and developments up until September 2021. The field of deep learning is rapidly evolving, and practices may have changed since then. If you're looking for the most up-to-date information on performance benchmarks and research trends, I recommend checking recent conference proceedings, research papers, and benchmarks provided by organizations in the field.
model = Alexnet(lr=0.01)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224),num_workers=0)
trainer = d2l.Trainer(max_epochs=20)
trainer.fit(model, data)
KeyboardInterrupt
class SmallAlexnet(d2l.Classifier):
def __init__(self,lr=0.1, num_classes=10):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential(
nn.LazyConv2d(256, kernel_size=3, padding=1),nn.ReLU(),
nn.LazyConv2d(256, kernel_size=3, padding=1),nn.ReLU(),
nn.LazyConv2d(256, kernel_size=3, padding=1),nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.LazyConv2d(512, kernel_size=3, padding=1),nn.ReLU(),
nn.LazyConv2d(512, kernel_size=3, padding=1),nn.ReLU(),
nn.LazyConv2d(256, kernel_size=3, padding=1),nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2), nn.Flatten(),
nn.LazyLinear(1024), nn.ReLU(),
nn.LazyLinear(num_classes)
)
class LeNet(d2l.Classifier):
def __init__(self, lr=0.1, num_classes=10):
super().__init__()
self.save_hyperparameters()
self.net = nn.Sequential(nn.LazyConv2d(6, kernel_size=5, padding=2),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.LazyConv2d(16, kernel_size=5),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2, stride=2),
nn.Flatten(),
nn.LazyLinear(120),
nn.ReLU(), nn.Dropout(0.5),
nn.LazyLinear(84),
nn.ReLU(), nn.Dropout(0.5),
nn.LazyLinear(num_classes))
model = LeNet(lr=0.01)
data = d2l.FashionMNIST(batch_size=128, resize=(28, 28))
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
trainer.fit(model, data)
X,y = next(iter(data.get_dataloader(False)))
X = X.to('cuda')
y = y.to('cuda')
y_hat = model(X)
print(f'acc: {model.accuracy(y_hat,y).item():.2f}')
Yes, it is possible to intentionally make AlexNet overfit by manipulating certain features of the training process. Overfitting occurs when a model learns to perform exceptionally well on the training data but fails to generalize to unseen data. To make AlexNet overfit, you can modify or manipulate the following features:
Limited Training Data: Reduce the size of the training dataset significantly. With a small training dataset, the model can easily memorize the data, leading to overfitting. For example, if you have 10 classes, consider training with just a handful of images for each class.
Complex Model: Increase the model's capacity by adding more layers, units, or filters. This can lead to the model having more parameters than necessary to fit the small training data, resulting in overfitting.
Reduce Regularization: Decrease or entirely remove regularization techniques such as dropout, weight decay, and data augmentation. Regularization helps prevent overfitting by adding noise or constraints to the training process.
Lower Learning Rate: Use a very small learning rate during training. A small learning rate can cause the model to update its weights very slowly, leading to overfitting as the model doesn't generalize well.
Fewer Epochs: Train the model for a limited number of epochs. Overfitting can occur when the model hasn't had sufficient time to learn meaningful patterns in the data.
Noise-Free Labels: Provide perfectly clean and noise-free labels in the training dataset. In real-world scenarios, there are often label errors or ambiguity, which can actually help prevent overfitting.
Lack of Augmentation: Avoid data augmentation techniques like random rotations, flips, and cropping. Data augmentation introduces variability and helps the model generalize better.
Use Complex Optimizers: Opt for complex optimizers like Adam with default settings, which can lead to quicker convergence on the training data but might also cause overfitting.
By combining these changes, you can create a scenario in which AlexNet overfits the training data. Keep in mind that this exercise is done for demonstration purposes and to understand the behavior of models under different conditions. In practice, overfitting is undesirable, and the goal is to achieve a model that generalizes well to unseen data. Regularization techniques, appropriate dataset splitting, and hyperparameter tuning are used to avoid overfitting and build models with strong generalization performance.