HW 1: Getting started with numpy, matplotlib, pandas and Kaggle

Total: 25 pts

Start date: Tuesday Sept. 3
Due date: Tuesday Sept. 10

If you don't already have a version of anaconda installed, start by downloading anaconda and installing it (see for example here). When working on the exercises below, keep in mind that there exists a rich python documentation online. Don't hesitate to check the documentation and examples related to the functions you want to use.

1. (4pts) Numerical Linear Algebra: Numpy

  • Start by building a 10 by 10 matrix of random Gaussian entries. Then compute the two largest eigenvalues of the matrix
  • Reshape the matrix that you built above into a 2 by 50 array (call it $v$) first and into a single vector then (call it 'w'). Return the vector obtained by sorting the elements of $w$ in descending order
  • Generate two random vectors (you can choose the distribution you use to generate the entries). Let us call those vectors $v1$ and $v2$. Stack those vectors vertically then horizontally. Store the respective results in two matrices $A$ and $B$.
  • Do the same with two random arrays $C_1 \in \mathbb{R}^{n\times n}$ and $C_2^{n\times n}$. Store the results in the variables $Cv$ and $Ch$
In [ ]:
# put your code here

2. (2pts) Towards multiclass classification: one-hot encoding

  • Generate a vector (let us call it $v$) of integers taking values between 0 and 9.
  • Then build the vector corresponding to the one-hot encoding of each entry in $v$ (a one-hot encoding represents each categorical variable (0 to 9 digits in your vector $v$ by using binary sequences in which only one entry (for example the one corresponding to the digit that is encoded) is non zero))
In [ ]:
# put your code here

3. (6pt) Towards regression: sampling and matplolib

3a. (2pts) One dimensional In this exercise, we will successively generate points according to a function, sample pairs (t,f) from that distribution and plot the results

  • Using the 'linspace' function from numpy, generate $1000$ pairs $(t, f(t) = \frac{1}{1+e^{-t}})$ for values of $t$ between $-6$ and $6$. What does the function look like?
  • Generate 100 random pairs $(t_i, f_i)$ from the plot. Then plot the points $(t_i,x_i)$ on top of the line $(t, f(t))$ using matplotlib (you can choose how you randomly generate the points)
  • From the pairs
In [ ]:
# put your code here

3b. (4pts) The two dimensional hyperplane

  • An extension of the previous case, we now want to generate triples $(x,y, t)$ according to the following hyperplane:
$$t \equiv\pi(x, y) = x + y +1$$

using Axes3D, matplolib and pyplot, as well as the meshgrid( ) and arrange( ) functions from numpy and the _plotsurface( ) and scatter( ) functions from pyplot,

  • Generate a regular grid of points $(x, y)$ covering the domain $[-20,20]\times [-20,20]$. Let us say 200 by 200.
  • As in the 1D case, we now want to generate noisy samples that are lying on the plane on average. Start by generating $(50\times 50)$ triples $(x,y,\pi(x,y))$ covering the domain $[-20,20]\times [-20,20]$.
  • Perturb the $50\times 50$ pairs by adding to them a random gaussian noise of amplitude no larger than $0.1$
  • Finally using the scatter( ) function from pyplot, plot the noisy samples on top of the plane.
In [ ]:
# put your code here

4. (3pts) Getting started with Pandas and Kaggle datasets

4a Download the car dataset on Kaggle and open this dataset with pandas.

  • Display a couple (5-10) of rows from the pandas data frame.
  • Find the brand that has the highest average price across cars
  • Sort the cars according to their horse power and return the corresponding panda frame. Display the first 10 lines from the frame.
In [ ]:
# put your code here. Don't hesitate to check the online 
# documentation on the panda library