Open the Tensorflow Playground (https://playground.tensorflow.org) and select on the left the checkerboard pattern as the data basis (see Exercise 3.3). As input features, select the two independent variables $x_1$ and $x_2$ and set the noise to $50\%$.
Choose a deep (many layers) and wide (many nodes) network and train it for more than 1000 epochs. Comment on your observations.
For very low regularization rates ($\lambda \rightarrow 0$), the L2 norm penalty is only hardly considered in the training. Thus, the network is still overfitting.
For high regularization rates ($\lambda >> 0$), almost only the L2 norm penalty is considered in training. Thus, all adaptive parameters are pushed to zero.
For moderate regularization rates, no overtraining can be observed.
Compare the effects of L1 and L2 regularization.
As was observed in Task 1 and Task 2, L2 regularization with moderate regularization rates pushes the weights to smaller values — but not to zero.
In contrast, L1 regularization pushes certain, unimportant weights to zero. Therefore, L1 regularization can, in principle, be used as a feature-selection technique.