#!/usr/bin/env python # coding: utf-8 # ### HW 1: Getting started with numpy, matplotlib, pandas and Kaggle # __Total: 25 pts__ # Start date: Tuesday Sept. 3
# Due date: Tuesday Sept. 10 # If you don't already have a version of anaconda installed, start by downloading anaconda and installing it (see for example [here](https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/)). When working on the exercises below, keep in mind that there exists a rich python documentation online. Don't hesitate to check the documentation and examples related to the functions you want to use. # __1. (4pts) Numerical Linear Algebra: Numpy__ # # - Start by building a 10 by 10 matrix of random Gaussian entries. Then compute the two largest eigenvalues of the matrix # - Reshape the matrix that you built above into a 2 by 50 array (call it $v$) first and into a single vector then (call it 'w'). Return the vector obtained by sorting the elements of $w$ in descending order # - Generate two random vectors (you can choose the distribution you use to generate the entries). Let us call those vectors $v1$ and $v2$. Stack those vectors vertically then horizontally. Store the respective results in two matrices $A$ and $B$. # - Do the same with two random arrays $C_1 \in \mathbb{R}^{n\times n}$ and $C_2^{n\times n}$. Store the results in the variables $Cv$ and $Ch$ # In[ ]: # put your code here # __2. (2pts) Towards multiclass classification: one-hot encoding__ # # - Generate a vector (let us call it $v$) of integers taking values between 0 and 9. # - Then build the vector corresponding to the one-hot encoding of each entry in $v$ (a one-hot encoding represents each categorical variable (0 to 9 digits in your vector $v$ by using binary sequences in which only one entry (for example the one corresponding to the digit that is encoded) is non zero)) # In[ ]: # put your code here # __3. (6pt) Towards regression: sampling and matplolib__ # __3a. (2pts) One dimensional__ In this exercise, we will successively generate points according to a function, sample pairs (t,f) from that distribution and plot the results # # - Using the 'linspace' function from numpy, generate $1000$ pairs $(t, f(t) = \frac{1}{1+e^{-t}})$ for values of $t$ between $-6$ and $6$. What does the function look like? # - Generate 100 random pairs $(t_i, f_i)$ from the plot. Then plot the points $(t_i,x_i)$ on top of the line $(t, f(t))$ using matplotlib (you can choose how you randomly generate the points) # - From the pairs # In[ ]: # put your code here # __3b. (4pts) The two dimensional hyperplane__ # # - An extension of the previous case, we now want to generate triples $(x,y, t)$ according to the following hyperplane: # # $$t \equiv\pi(x, y) = x + y +1$$ # # using _Axes3D_, _matplolib_ and _pyplot_, as well as the _meshgrid( )_ and _arrange( )_ functions from numpy and the _plot_surface( )_ and _scatter( )_ functions from pyplot, # # - Generate a regular grid of points $(x, y)$ covering the domain $[-20,20]\times [-20,20]$. Let us say 200 by 200. # - As in the 1D case, we now want to generate noisy samples that are lying on the plane on average. Start by generating $(50\times 50)$ triples $(x,y,\pi(x,y))$ covering the domain $[-20,20]\times [-20,20]$. # - Perturb the $50\times 50$ pairs by adding to them a random gaussian noise of amplitude no larger than $0.1$ # - Finally using the _scatter( )_ function from pyplot, plot the noisy samples on top of the plane. # # In[ ]: # put your code here # __4. (3pts) Getting started with Pandas and Kaggle datasets__ # __4a__ Download the car dataset on [Kaggle](https://www.kaggle.com/toramky/automobile-dataset/downloads/automobile-dataset.zip/2) and open this dataset with pandas. # # - Display a couple (5-10) of rows from the pandas data frame. # - Find the brand that has the highest average price across cars # - Sort the cars according to their horse power and return the corresponding panda frame. Display the first 10 lines from the frame. # # In[ ]: # put your code here. Don't hesitate to check the online # documentation on the panda library