In [1]:

using Gadfly

Parallel computing¶

The dataset being analysed is a big data set the which is collectively about ~70GB. The first barrier before doing any analysis was to be able to read the data quickly and efficiently.

We started off by experimenting with the *DataFrames* package the output dataframe had all the required properties and was easy to manipulate but the reading of the file took to long also the DateTime type was being read as string and extracting information such as time/date using string manipulation only made the process really slow. Hence we switched to *CSV* module, it had a better performance in reading the data but the output dataframe had the type of NullableArrays. This was not in our favour as most of the functions to manipulate the dataframes do not work with the collumns which had Nullable type. Hence we wrote a *type* which could read a dataframe using *CSV.read()* and the convert it into another DataFrame in which columns were tyepe stable and not the NullableArray. This gave us overall better performance than using only DataFrames.

Now even loading all the data sequentially would have been very time consuming as each file is big enough to take a significant amount of time. This is where the parallel computing feature of JULIA comes in handy.We introduced the *@everywhere* macros and used *SharedArrays* and *DistributedArrays (DArrays)* which allows to map the function on different processors of the system and collect output from these parallel process.Hence it helps us to avoid following the sequential order and do it one by one hence we were able to load all the data in about as much time as it takes to load one dataset with maximum loading time.

This feature also helped us to parallelize many function which used multiple files for the analysis and plotting of the graph. Also, in one of the analysis we were to cluster the latitudes and longitudes we used *Clustering* packge for this and used the inbuilt K-means function to implement the k-means algorithm, we were able to put all the files in parallel for the process of finding the k-means hence were able to save time there too by avoiding the sequential process.

Here we demonstrate the parallel computing capabilitiesof *JULIA* by calculating the value of pi using *Monte Carlo* simulation with *10 Billion* sample points. The number of workers assigned was made to vary to measure the exectution time :

In [3]:

#this script is to estimate pi using monte Carlo simulation
function parallel_findpi(n)
    inside =  @parallel (+) for i = 1:n
        x, y = rand(2)
        x^2 + y^2 <= 1 ? 1 : 0
    end
    4 * inside / n
end

WARNING: Method definition parallel_findpi(Any) in module Main at In[2]:3 overwritten at In[3]:3.

Out[3]:

parallel_findpi (generic function with 1 method)

In [5]:

#code to produce the graph-->
#time = [893.081954, 437.424609, 288.011919, 219.332380,
#    111.764335,98.156851,63.623772,51.070448,43.202897,34.187504,28.481205, 26.948234,26.782099]

#nprocs=[1,2,3,4,8,10,16,20,24,32,48,64,80]

#plot(x=nprocs,y=time, Geom.point, Geom.line, Theme(default_color=color("orange")),
#Guide.xlabel("no. of processors"), Guide.ylabel("Execution time"),
#Guide.title("Measurement of time in Parallel processing"))

HTML("""
<img src="Images/output_4_0.svg"></img>
""")

Out[5]:

Maps¶

The following section has been plotted using hexbins feature of *Gadfly-Julia*.The plot consists of latitudes on Y-axis and longitudes on X-axis.These maps plots pickup and dropoff location of every taxi ride taken between Aug'13 to Dec'2015. Using this data we can see that we can infer about the map of NY city, without atually using any map features/API.

Pickup for Green Taxi¶

The Map below shows pickup for green taxi. The map has a small density in lower Manhattan region for green taxis pickup implying that not many green taxis are active in this region. Further most of the pickups are in Brooklyn and in upper Manhattan region.

In [29]:

HTML("""<img src="Images/green_pickup.png"></img> """)

Out[29]:

Dropoff for Green taxi¶

The Map below shows dropoff for green taxi. The map shows that dropoff's for green taxis has been in the entire city. The map also shows the bridges connecting the islands.Also most of the dropoff's are concentrated within the city.

In [23]:

HTML("""<img src="Images/green_dropoff.png"></img> """)

Out[23]:

Pickup for Yellow Taxis¶

We can see that the number of yellow taxis having pickups in Manhataan region is very large. Also the number of pickups for Yellow taxis at JFK and Laguardia Airport is large.

In [24]:

HTML("""<img src="Images/yellow_pickup.png"></img> """)

Out[24]:

Dropoff for Yellow Taxis¶

We can see that the number of yellow taxis having dropoff's in Manhataan region is very large. Also the number of dropoff's for Yellow taxis at JFK and Laguardia Airport is large. These facts suggest that mostly Yellow taxis operate at the two airports in Newyork. Also number of taxis operating out of Manhattan region is low.

In [21]:

HTML("""<img src="Images/yellow_dropoff.png"></img> """)

Out[21]:

Pick-ups analysis¶

In the following graph we anallyse the pick up trends for the yellow taxis, which has two vendors. Few intresting features crop up such as there is a clear dropping in the pick ups for both the vendors over time implying the degradation in popularity of yellow taxis over time. Also both the pick ups for both the vendors seems to co-related implying that number of taxis related depends mostly on external conditions such as year and temperature. For a given year the number of taxi pickups are more during Feb-April and then during October

In [6]:

#code to produce graph---> include("Graphs/yellow_taxi_monthly_analysis.jl")
HTML("""
<img src="Images/output_11_0.svg"></img>
""")

Out[6]:

In the following graph we analyse the pickup trends for the green taxi which were introduced in August 2013 in NY. The analysis is over few months from 2013 to 2015. There are two vendors for the green taxi labelled as Vendor1 and Vendor2. As can be seen Vendor2 has much higher growth rate in the early stages. This has led to more number of the pickups by vendor2 over time. Both the vendors tend to stabalise about an equillibrium value over time.

In [7]:

#code to generate graph--> include("Graphs/green_taxi_monthly_analysis.jl")
HTML("""
<img src="Images/output_13_0.svg"></img>
""")

Out[7]:

This graph shows the comparison between vendor1 and vendor2 of yellow taxi
over all days of week for the month of january 2015. It clearly projects the domination of vendor2 over vendor1.
Another fact which crops up is the frequencies of taxies rises as the weekdays past by
it is peaked on saturday and it drops sharply on *sunday (as it is a holiday)*.

In [8]:

#code--> include("Graphs/vendorweekly.jl")
HTML("""
<img src="Images/output_15_0.svg"></img>
""")

Out[8]:

The following graph presents the comparison of the two vendors of the yellow taxi in per hour basis. Rush hours can be deduced from the graph it is from about 6:00 PM to 9:00 PM. This can be justified as it is the time around which most of the people will be coming back from the office.
The global minimum at about 6:00 AM in the morning gives an idea that very small fraction of the working class actually have to start travelling as early as 6 in the morning.

In [9]:

#code--> include("Graphs/vendorperhour.jl")
HTML("""
<img src="Images/output_17_0.svg"></img>
""")

Out[9]:

The following graph shows the comparison between vendor1 and vendor2 on the weekdays. As can be seen the peak hours for both the vendors is from 6:00PM to 9:00PM and it, decreases afterwards hitting a minima at early morning.

In [10]:

#code--> include("Graphs/vendorperhourweekdays.jl")
HTML("""
<img src="Images/output_19_0.svg"></img>
""")

Out[10]:

The following graph shows the comparison between the two vendors of the yellow taxi on the weekends.
A clear feature which crops up which is different than the normal weekdays is during dayhours the frequency
of both the vendors become more or less stagnant.
Also contrary to weekdays we see a sharp rise in the taxi uses for both the vendors late in night
(12:00 AM to 1:00 AM) on weekends, this takes into account the fact that many people on weekends stay out till late in night in new york.

In [11]:

#code-->include("Graphs/vendorperhourweekend.jl")
HTML("""
<img src="Images/output_21_0.svg"></img>
""")

Out[11]:

Here we present the one to one comparison for the total number of pickups for the yellow taxis as a whole and green taxis as a whole. As can be seen from the graph, after the introduction of the green taxis in Aug'13 they have steadily increased and saturated about an equillibrium value (look at the gif providing the visuals of the spreading of the green taxis), which is way below the mean value of the yellow taxi pick ups which apart from some seasonal trends are more or less have decreased a bit due to introduction of Green taxis and saturated.

In [12]:

#code--> include("Graphs/total_pickups.jl")
HTML("""
<img src="Images/output_23_0.svg"></img>
""")

Out[12]:

In [13]:

HTML("""
<img src="Images/green_pickup_heatmap.gif"></img>
""")

Out[13]:

The following is the gif which shows the variation of dropoff locations for the green taxis over years

In [14]:

HTML("""
<img src="Images/green_dropoff_heatmap.gif"></img>
""")

Out[14]:

To extract some of the prime locations (hotspots) around which most of the pickups and dropoffs are concentrated,
we clustered the latitudes and longitudes of all the pickup and drop off points, using the K-means algorithm (a type of unsupervised ML algorithm).
The following Clusters appeared :

In [2]:

HTML("""
<img src="Images/JFK-mean-center.png"></img>
""")

Out[2]:

Airport Analysis¶

On the basis of the prime location extracted from the above clustering, we analyse the *JFK AIRPORT* (as it is the international airport), the following analysis is for the month of january,2015 :

Airport pickup by days of week¶

The following graph shows the pick ups and drop offs at JFK airport in new york on weekdays.
Few things that can be inferred from the graph is pickups are always significantly higher
than the dropoffs. Also the dropoffs and pickups seems correlated, as both follow the same trend.
As can be observed the peak for both dropoffs and pickups is on Friday and the minimum is on Tuesdays.
We can deduce that minimum number of flights to and from NY from JFK aiports is scheduled on Tuesdays.

In [16]:

#code-->include("Graphs/JFKportperday.jl")
HTML("""
<img src="Images/output_31_0.svg"></img>
""")

Out[16]:

Airport Pickup by hour¶

The following graph is the analysis of pick ups and drop offs with respect to daily hours (all days cumullitive).
Some points which can be deduced from the graph is most of the flights are scheduled to leave from JFK at near
about 6:00 AM and 3:00 PM. Also this is also the time when significant amount of flights arrive at JFK airport.
(as these points corresponds to local maximum).
Another feature which can be extracted is that late in night almost no flights are scheduled to take off from
JFK but there are relatively more flights landing on JFK airport at late night.
Also the decreasing drop offs at the airport after 3:00PM tells about the decreasing schedule of flights to
take off from JFK in nooon.

In [17]:

#code--> include("Graphs/JFKportperhour.jl")
HTML("""
<img src="Images/output_33_0.svg"></img>
""")

Out[17]:

Influx at JFK Airport¶

In the following graph we have analysed the net influx of the people from JFK Airport. The influx has been defined as the difference between the pickups and dropoffs, offcourse this measure can not be taken as the exact number of the people entering NY on any given month, but it gives an idea about how many people had entered in that particular month.

An intersting feature that crops up is a shrap drop in the influx in the month of *February 2014, Investigation about this led us to the winter storm that hit New York in Feb'14. This storm has been reported to be the 7th snowiest on record* in New york city.

In [18]:

#code--> include("Graphs/netfluxJFK_graph.jl")
HTML("""
<img src="Images/output_35_0.svg"></img>
""")

Out[18]:

* Card Vs. Cash*¶

From the available data we plotted the percentage of users who paid with cash and who paid with card in every month from 2013 to 2015. It can be seen clearly in the graph that the percentage of people paying through cards is steadily increasing and percentage of people paying through cosh is steadily decreasing.

Percentage of cash and card used for Yellow taxi's¶

The use of Card payment has grown from around 54% to 63% and cash payments have reduced proportionally.This could be due ease of payments that cards offer as compared to cash.

In [19]:

#code--> include("Graphs/mode_count_graph.jl")
HTML("""
<img src="Images/output_38_0.svg"></img>
""")

Out[19]:

Number of tips given with cash and card payment¶

It is intresting to note that people tend to tip much more when they pay by card.This could be attributed to the fact that online payment portal has tip option included and it is easier to transact with cards. Also many vendors may not be reporting the tips they have received in cash and this could justify for very few tips given while paying with cash.

In [20]:

#code--> include("Graphs/tip_cash_card_graph.jl")
HTML("""
<img src="Images/output_40_0.svg"></img>
""")

Out[20]:

Regression of cost with distance and time¶

We performed the regression on the cost, trip distance and the journey time at different time of day to come up with a regressive model to predict the cost of the journey on the basis of the distance of the journey anf the travel time. At different time in days the contribution of time in the model would be expected to vary hence we performed the regression after segregating the 24 hours in the buckets and performing regression on those buckets.

Here we present the crossection of the regression model obtained for the month of january 2015 in the tine slot of 9:00AM to 12:00PM. To visualise the variation with journey distance, journey time hass been kept constant and to visualise the variation with journey time, journey distance has been kept constant. An interesting feature which appears is that for vendor1 distance seems to play a dominant part in deciding cost rather than time and for vendor2 time seems to play a dominant part rather than distance. It seems both the vendor use different charging model.

In [21]:

#code--> include("Graphs/vendor_regression_plot_time.jl")
HTML("""
<img src="Images/output_42_0.svg"></img>
""")

Out[21]:

In [23]:

#code--> include("Graphs/vendor_regression_plot_dist.jl")
HTML("""
<img src="Images/output_43_0.svg"></img>
""")

Out[23]:

A 3D plot of the regression model obtained for vendor1 in the above mentioned bucket :

In [1]:

HTML("""
<img src="Images/figure_1.jpg"></img>
""")

Out[1]: