Bike Rental

This data set is from https://www.lyft.com/bikes/bay-wheels/system-data. This data set is about Bike rental sharing from January to December of 2018.

Each trip is anonymized and includes:
Trip Duration (seconds)
Start Time and Date
End Time and Date
Start Station ID
Start Station Name
Start Station Latitude
Start Station Longitude
End Station ID
End Station Name
End Station Latitude
End Station Longitude
Bike ID
User Type (Subscriber or Customer – “Subscriber” = Member or “Customer” = Casual)
Member Year of Birth
Member Gender

Trip Duration is the most important feature related to rental fees. I'll explore the other features related to the Trip Duration.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

Load Data

In [2]:
# load all of 2018 trip data
bike_rental_mon1 = pd.read_csv('./201801-fordgobike-tripdata.csv')
bike_rental_mon2 = pd.read_csv('./201802-fordgobike-tripdata.csv')
bike_rental_mon3 = pd.read_csv('./201803-fordgobike-tripdata.csv')
bike_rental_mon4 = pd.read_csv('./201804-fordgobike-tripdata.csv')
bike_rental_mon5 = pd.read_csv('./201805-fordgobike-tripdata.csv')
bike_rental_mon6 = pd.read_csv('./201806-fordgobike-tripdata.csv')
bike_rental_mon7 = pd.read_csv('./201807-fordgobike-tripdata.csv')
bike_rental_mon8 = pd.read_csv('./201808-fordgobike-tripdata.csv')
bike_rental_mon9 = pd.read_csv('./201809-fordgobike-tripdata.csv')
bike_rental_mon10 = pd.read_csv('./201810-fordgobike-tripdata.csv')
bike_rental_mon11 = pd.read_csv('./201811-fordgobike-tripdata.csv')
bike_rental_mon12 = pd.read_csv('./201812-fordgobike-tripdata.csv')
In [3]:
bike_rental_mon12.head()
Out[3]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 68529 2018-12-31 20:03:11.7350 2019-01-01 15:05:21.5580 217.0 27th St at MLK Jr Way 37.817015 -122.271761 217.0 27th St at MLK Jr Way 37.817015 -122.271761 3305 Customer NaN NaN No
1 63587 2018-12-31 19:00:32.1210 2019-01-01 12:40:19.3660 NaN NaN 37.400000 -121.940000 NaN NaN 37.400000 -121.940000 4281 Customer 1995.0 Male No
2 64169 2018-12-31 15:09:01.0820 2019-01-01 08:58:30.0910 NaN NaN 37.400000 -121.940000 NaN NaN 37.400000 -121.940000 4267 Customer 1988.0 Male No
3 30550 2018-12-31 19:26:20.7750 2019-01-01 03:55:30.7930 13.0 Commercial St at Montgomery St 37.794231 -122.402923 19.0 Post St at Kearny St 37.788975 -122.403452 5422 Subscriber 1986.0 Male Yes
4 2150 2018-12-31 23:59:12.0970 2019-01-01 00:35:02.1530 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 368.0 Myrtle St at Polk St 37.785434 -122.419622 4820 Customer NaN NaN No

This data frame has 16 features. Bike rental duration is related to the rental fee so that I'll figure out the relationship between duration_sec and other features.

Preliminary Wrangling

Concatenate 12 dataframes into 1 dataframe

In [4]:
bike_rental = pd.concat([bike_rental_mon1, bike_rental_mon2, bike_rental_mon3, bike_rental_mon4, bike_rental_mon5, bike_rental_mon6,
                         bike_rental_mon7, bike_rental_mon8, bike_rental_mon9, bike_rental_mon10, bike_rental_mon11, bike_rental_mon12])
In [5]:
bike_rental.reset_index(drop=True, inplace=True)
bike_rental.head()
Out[5]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 75284 2018-01-31 22:52:35.2390 2018-02-01 19:47:19.8240 120.0 Mission Dolores Park 37.761420 -122.426435 285.0 Webster St at O'Farrell St 37.783521 -122.431158 2765 Subscriber 1986.0 Male No
1 85422 2018-01-31 16:13:34.3510 2018-02-01 15:57:17.3100 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 2815 Customer NaN NaN No
2 71576 2018-01-31 14:23:55.8890 2018-02-01 10:16:52.1160 304.0 Jackson St at 5th St 37.348759 -121.894798 296.0 5th St at Virginia St 37.325998 -121.877120 3039 Customer 1996.0 Male No
3 61076 2018-01-31 14:53:23.5620 2018-02-01 07:51:20.5000 75.0 Market St at Franklin St 37.773793 -122.421239 47.0 4th St at Harrison St 37.780955 -122.399749 321 Customer NaN NaN No
4 39966 2018-01-31 19:52:24.6670 2018-02-01 06:58:31.0530 74.0 Laguna St at Hayes St 37.776435 -122.426244 19.0 Post St at Kearny St 37.788975 -122.403452 617 Subscriber 1991.0 Male No
In [6]:
print(bike_rental.shape)
print(bike_rental.dtypes)
(1863721, 16)
duration_sec                 int64
start_time                  object
end_time                    object
start_station_id           float64
start_station_name          object
start_station_latitude     float64
start_station_longitude    float64
end_station_id             float64
end_station_name            object
end_station_latitude       float64
end_station_longitude      float64
bike_id                      int64
user_type                   object
member_birth_year          float64
member_gender               object
bike_share_for_all_trip     object
dtype: object

check if there are duplicated rows

In [7]:
bike_rental[bike_rental.duplicated()]
Out[7]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip

There is no duplicated row in the data frame

drop null values

In [8]:
# drop null values
bike_rental.dropna(inplace=True)
In [9]:
bike_rental.reset_index(drop=True, inplace=True)
bike_rental
Out[9]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 75284 2018-01-31 22:52:35.2390 2018-02-01 19:47:19.8240 120.0 Mission Dolores Park 37.761420 -122.426435 285.0 Webster St at O'Farrell St 37.783521 -122.431158 2765 Subscriber 1986.0 Male No
1 71576 2018-01-31 14:23:55.8890 2018-02-01 10:16:52.1160 304.0 Jackson St at 5th St 37.348759 -121.894798 296.0 5th St at Virginia St 37.325998 -121.877120 3039 Customer 1996.0 Male No
2 39966 2018-01-31 19:52:24.6670 2018-02-01 06:58:31.0530 74.0 Laguna St at Hayes St 37.776435 -122.426244 19.0 Post St at Kearny St 37.788975 -122.403452 617 Subscriber 1991.0 Male No
3 453 2018-01-31 23:53:53.6320 2018-02-01 00:01:26.8050 110.0 17th & Folsom Street Park (17th St at Folsom St) 37.763708 -122.415204 134.0 Valencia St at 24th St 37.752428 -122.420628 3571 Subscriber 1988.0 Male No
4 180 2018-01-31 23:52:09.9030 2018-01-31 23:55:10.8070 81.0 Berry St at 4th St 37.775880 -122.393170 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 1403 Subscriber 1980.0 Male No
5 996 2018-01-31 23:34:56.0040 2018-01-31 23:51:32.6740 134.0 Valencia St at 24th St 37.752428 -122.420628 4.0 Cyril Magnin St at Ellis St 37.785881 -122.408915 3675 Subscriber 1987.0 Male Yes
6 825 2018-01-31 23:34:14.0270 2018-01-31 23:47:59.8090 305.0 Ryland Park 37.342725 -121.895617 317.0 San Salvador St at 9th St 37.333955 -121.877349 1453 Subscriber 1994.0 Female Yes
7 432 2018-01-31 23:34:26.4840 2018-01-31 23:41:39.2970 89.0 Division St at Potrero Ave 37.769218 -122.407646 43.0 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 2928 Subscriber 1993.0 Male No
8 601 2018-01-31 23:29:46.8320 2018-01-31 23:39:48.0000 223.0 16th St Mission BART Station 2 37.764765 -122.420091 86.0 Market St at Dolores St 37.769305 -122.426826 3016 Subscriber 1957.0 Male No
9 887 2018-01-31 23:24:16.3570 2018-01-31 23:39:04.1230 308.0 San Pedro Square 37.336802 -121.894090 297.0 Locust St at Grant St 37.322980 -121.887931 55 Subscriber 1976.0 Female Yes
10 210 2018-01-31 23:33:03.0460 2018-01-31 23:36:33.7040 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 186.0 Lakeside Dr at 14th St 37.801319 -122.262642 2602 Subscriber 1976.0 Male No
11 188 2018-01-31 23:30:58.1360 2018-01-31 23:34:06.3910 98.0 Valencia St at 16th St 37.765052 -122.421866 76.0 McCoppin St at Valencia St 37.771662 -122.422423 2556 Subscriber 1964.0 Female No
12 808 2018-01-31 23:19:58.6030 2018-01-31 23:33:27.5310 67.0 San Francisco Caltrain Station 2 (Townsend St... 37.776639 -122.395526 98.0 Valencia St at 16th St 37.765052 -122.421866 3041 Subscriber 1976.0 Male Yes
13 378 2018-01-31 23:23:23.0680 2018-01-31 23:29:42.0440 80.0 Townsend St at 5th St 37.775306 -122.397380 78.0 Folsom St at 9th St 37.773717 -122.411647 546 Subscriber 1995.0 Female No
14 686 2018-01-31 23:07:15.3130 2018-01-31 23:18:41.5580 312.0 San Jose Diridon Station 37.329732 -121.901782 317.0 San Salvador St at 9th St 37.333955 -121.877349 1886 Subscriber 1997.0 Female No
15 450 2018-01-31 23:07:13.0630 2018-01-31 23:14:43.8140 241.0 Ashby BART Station 37.852477 -122.270213 157.0 65th St at Hollis St 37.846784 -122.291376 3583 Subscriber 1994.0 Male No
16 294 2018-01-31 23:08:12.0000 2018-01-31 23:13:06.6360 239.0 Bancroft Way at Telegraph Ave 37.868813 -122.258764 244.0 Shattuck Ave at Hearst Ave 37.873792 -122.268618 2144 Subscriber 1983.0 Male No
17 150 2018-01-31 23:10:09.5860 2018-01-31 23:12:40.3330 182.0 19th Street BART Station 37.809013 -122.268247 183.0 Telegraph Ave at 19th St 37.808702 -122.269927 3468 Subscriber 1945.0 Male Yes
18 462 2018-01-31 23:03:48.9400 2018-01-31 23:11:31.0290 119.0 18th St at Noe St 37.761047 -122.432642 134.0 Valencia St at 24th St 37.752428 -122.420628 1432 Subscriber 1971.0 Male Yes
19 379 2018-01-31 23:04:27.7010 2018-01-31 23:10:46.9760 176.0 MacArthur BART Station 37.828410 -122.266315 189.0 Genoa St at 55th St 37.839649 -122.271756 997 Subscriber 1975.0 Male No
20 880 2018-01-31 22:53:41.3020 2018-01-31 23:08:21.4430 123.0 Folsom St at 19th St 37.760594 -122.414817 145.0 29th St at Church St 37.743684 -122.426806 3725 Subscriber 1986.0 Male No
21 1210 2018-01-31 22:45:37.1250 2018-01-31 23:05:47.5760 285.0 Webster St at O'Farrell St 37.783521 -122.431158 133.0 Valencia St at 22nd St 37.755213 -122.420975 1059 Subscriber 1991.0 Male No
22 259 2018-01-31 23:01:12.7920 2018-01-31 23:05:32.1570 239.0 Bancroft Way at Telegraph Ave 37.868813 -122.258764 266.0 Parker St at Fulton St 37.862464 -122.264791 1208 Subscriber 1994.0 Male No
23 592 2018-01-31 22:53:27.7790 2018-01-31 23:03:20.2900 202.0 Washington St at 8th St 37.800754 -122.274894 195.0 Bay Pl at Vernon St 37.812314 -122.260779 1834 Customer 1978.0 Male No
24 1059 2018-01-31 22:45:16.5700 2018-01-31 23:02:56.2850 141.0 Valencia St at Cesar Chavez St 37.747998 -122.420219 79.0 7th St at Brannan St 37.773492 -122.403673 1248 Subscriber 1988.0 Male No
25 375 2018-01-31 22:56:31.0820 2018-01-31 23:02:46.5830 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 6.0 The Embarcadero at Sansome St 37.804770 -122.403234 3401 Subscriber 1988.0 Male No
26 300 2018-01-31 22:57:24.0420 2018-01-31 23:02:24.2720 114.0 Rhode Island St at 17th St 37.764478 -122.402570 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 3224 Subscriber 1981.0 Male No
27 2219 2018-01-31 22:24:39.9430 2018-01-31 23:01:39.5710 30.0 San Francisco Caltrain (Townsend St at 4th St) 37.776598 -122.395282 81.0 Berry St at 4th St 37.775880 -122.393170 1757 Subscriber 1991.0 Female No
28 330 2018-01-31 22:55:48.0730 2018-01-31 23:01:18.5060 99.0 Folsom St at 15th St 37.767037 -122.415443 124.0 19th St at Florida St 37.760447 -122.410807 3379 Subscriber 1983.0 Male No
29 870 2018-01-31 22:45:38.2350 2018-01-31 23:00:09.0340 285.0 Webster St at O'Farrell St 37.783521 -122.431158 106.0 Sanchez St at 17th St 37.763242 -122.430675 1503 Customer 1990.0 Female No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1741526 331 2018-12-01 00:48:27.5290 2018-12-01 00:53:59.4950 70.0 Central Ave at Fell St 37.773311 -122.444293 39.0 Scott St at Golden Gate Ave 37.778999 -122.436861 968 Subscriber 1994.0 Male No
1741527 310 2018-12-01 00:45:44.8680 2018-12-01 00:50:55.7970 252.0 Channing Way at Shattuck Ave 37.865847 -122.267443 238.0 MLK Jr Way at University Ave 37.871719 -122.273068 367 Subscriber 1991.0 Male No
1741528 1338 2018-12-01 00:27:24.8750 2018-12-01 00:49:43.3490 371.0 Lombard St at Columbus Ave 37.802746 -122.413579 104.0 4th St at 16th St 37.767045 -122.390833 1985 Subscriber 1986.0 Male No
1741529 154 2018-12-01 00:44:24.8380 2018-12-01 00:46:59.3350 70.0 Central Ave at Fell St 37.773311 -122.444293 52.0 McAllister St at Baker St 37.777416 -122.441838 449 Subscriber 1994.0 Male No
1741530 862 2018-12-01 00:32:11.0630 2018-12-01 00:46:33.7900 4.0 Cyril Magnin St at Ellis St 37.785881 -122.408915 86.0 Market St at Dolores St 37.769305 -122.426826 4408 Customer 1975.0 Male No
1741531 1310 2018-12-01 00:23:53.3420 2018-12-01 00:45:43.5880 85.0 Church St at Duboce Ave 37.770083 -122.429156 13.0 Commercial St at Montgomery St 37.794231 -122.402923 1525 Subscriber 1994.0 Male Yes
1741532 230 2018-12-01 00:41:31.9550 2018-12-01 00:45:22.4160 312.0 San Jose Diridon Station 37.329732 -121.901782 314.0 Santa Clara St at Almaden Blvd 37.333988 -121.894902 1379 Subscriber 1957.0 Male No
1741533 2071 2018-12-01 00:09:55.5800 2018-12-01 00:44:26.6290 36.0 Folsom St at 3rd St 37.783830 -122.398870 97.0 14th St at Mission St 37.768265 -122.420110 316 Customer 1991.0 Male No
1741534 1958 2018-12-01 00:11:35.0220 2018-12-01 00:44:13.6320 44.0 Civic Center/UN Plaza BART Station (Market St ... 37.781074 -122.411738 5.0 Powell St BART Station (Market St at 5th St) 37.783899 -122.408445 4395 Subscriber 1993.0 Male No
1741535 268 2018-12-01 00:37:23.1340 2018-12-01 00:41:51.9670 147.0 29th St at Tiffany Ave 37.744067 -122.421472 121.0 Mission Playground 37.759210 -122.421339 4451 Subscriber 1989.0 Male No
1741536 1679 2018-12-01 00:13:35.1430 2018-12-01 00:41:34.8600 240.0 Haste St at Telegraph Ave 37.866043 -122.258804 245.0 Downtown Berkeley BART 37.870139 -122.268422 3701 Customer 1998.0 Female No
1741537 1658 2018-12-01 00:13:33.1240 2018-12-01 00:41:11.7630 240.0 Haste St at Telegraph Ave 37.866043 -122.258804 245.0 Downtown Berkeley BART 37.870139 -122.268422 3625 Customer 1998.0 Female No
1741538 293 2018-12-01 00:36:01.5250 2018-12-01 00:40:55.2850 95.0 Sanchez St at 15th St 37.766219 -122.431060 72.0 Page St at Scott St 37.772406 -122.435650 1043 Subscriber 1981.0 Male No
1741539 426 2018-12-01 00:32:13.1250 2018-12-01 00:39:19.8710 50.0 2nd St at Townsend St 37.780526 -122.390288 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 3154 Subscriber 1986.0 Male No
1741540 447 2018-12-01 00:31:36.1480 2018-12-01 00:39:03.3910 368.0 Myrtle St at Polk St 37.785434 -122.419622 17.0 Embarcadero BART Station (Beale St at Market St) 37.792251 -122.397086 1677 Subscriber 1980.0 Other Yes
1741541 1694 2018-12-01 00:09:17.1590 2018-12-01 00:37:31.2400 14.0 Clay St at Battery St 37.795001 -122.399970 147.0 29th St at Tiffany Ave 37.744067 -122.421472 209 Subscriber 1993.0 Male No
1741542 269 2018-12-01 00:31:00.0910 2018-12-01 00:35:29.8710 98.0 Valencia St at 16th St 37.765052 -122.421866 121.0 Mission Playground 37.759210 -122.421339 3147 Subscriber 1993.0 Male No
1741543 685 2018-12-01 00:21:16.2400 2018-12-01 00:32:42.0000 109.0 17th St at Valencia St 37.763316 -122.421904 60.0 8th St at Ringold St 37.774520 -122.409449 3247 Subscriber 1987.0 Female No
1741544 681 2018-12-01 00:19:41.3830 2018-12-01 00:31:02.6050 15.0 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 323.0 Broadway at Kearny 37.798014 -122.405950 3460 Subscriber 1959.0 Male No
1741545 1166 2018-12-01 00:11:04.8640 2018-12-01 00:30:31.3480 160.0 West Oakland BART Station 37.805318 -122.294837 155.0 Emeryville Public Market 37.840521 -122.293528 1579 Subscriber 1997.0 Male No
1741546 763 2018-12-01 00:17:34.4970 2018-12-01 00:30:18.1070 73.0 Pierce St at Haight St 37.771793 -122.433708 121.0 Mission Playground 37.759210 -122.421339 90 Subscriber 1991.0 Female No
1741547 760 2018-12-01 00:17:29.9600 2018-12-01 00:30:10.1780 73.0 Pierce St at Haight St 37.771793 -122.433708 121.0 Mission Playground 37.759210 -122.421339 2758 Subscriber 1991.0 Male No
1741548 538 2018-12-01 00:16:34.0850 2018-12-01 00:25:32.4550 5.0 Powell St BART Station (Market St at 5th St) 37.783899 -122.408445 39.0 Scott St at Golden Gate Ave 37.778999 -122.436861 4384 Subscriber 1991.0 Male No
1741549 671 2018-12-01 00:12:49.6400 2018-12-01 00:24:01.5120 34.0 Father Alfred E Boeddeker Park 37.783988 -122.412408 92.0 Mission Bay Kids Park 37.772301 -122.393028 4377 Subscriber 1972.0 Male Yes
1741550 498 2018-12-01 00:14:41.7250 2018-12-01 00:23:00.4080 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 214.0 Market St at Brockhurst St 37.823321 -122.275732 2236 Subscriber 1992.0 Male No
1741551 1137 2018-12-01 00:01:49.6930 2018-12-01 00:20:47.5190 73.0 Pierce St at Haight St 37.771793 -122.433708 50.0 2nd St at Townsend St 37.780526 -122.390288 273 Subscriber 1990.0 Male No
1741552 473 2018-12-01 00:11:54.8110 2018-12-01 00:19:48.5470 345.0 Hubbell St at 16th St 37.766474 -122.398295 81.0 Berry St at 4th St 37.775880 -122.393170 3035 Subscriber 1982.0 Female No
1741553 841 2018-12-01 00:02:48.7260 2018-12-01 00:16:49.7660 10.0 Washington St at Kearny St 37.795393 -122.404770 58.0 Market St at 10th St 37.776619 -122.417385 2034 Subscriber 1999.0 Female No
1741554 260 2018-12-01 00:05:27.6150 2018-12-01 00:09:47.9560 245.0 Downtown Berkeley BART 37.870139 -122.268422 255.0 Virginia St at Shattuck Ave 37.876573 -122.269528 2243 Subscriber 1991.0 Male No
1741555 292 2018-12-01 00:03:06.5490 2018-12-01 00:07:59.0800 93.0 4th St at Mission Bay Blvd S 37.770407 -122.391198 126.0 Esprit Park 37.761634 -122.390648 545 Subscriber 1963.0 Male No

1741556 rows × 16 columns

change data type

I'll change the date type properly to analyze well

start_time: object -> datatime
end_time: object -> datatime
start_station_id: float -> int
end_station_id: float -> int
user_type: object -> CategoricalDtype
member_birth_year: float -> int
member_gender: object -> CategoricalDtype
bike_share_for_all_trip: object -> CategoricalDtype

In [10]:
numeric_column_dtype = {'start_time': 'datetime64',
                        'end_time': 'datetime64',
                        'start_station_id': 'int64',
                        'end_station_id': 'int64',
                        'member_birth_year': 'int64'}

catergorical_column_dtype = {'user_type': ['Subscriber', 'Customer'],
                             'member_gender': ['Male', 'Female'],
                             'bike_share_for_all_trip': ['Yes', 'No']}

def change_into_dtype(dataframe, column, dtype):
    '''change the column type of dataframe into dtype'''
    dataframe[column] = dataframe[column].astype(dtype)

def iterate_dict_and_change_numeric_dtype(dataframe):
    '''change numeric columns type of dataframe into dtype'''
    for column in numeric_column_dtype:
        change_into_dtype(dataframe, column, numeric_column_dtype[column])
        
def iterate_dict_and_change_categorical_dtype(dataframe):
    '''change categorical columns type of dataframe into dtype'''
    for column in catergorical_column_dtype:
        target_dtype = pd.api.types.CategoricalDtype(ordered = True,
                                                    categories = catergorical_column_dtype[column])
        dataframe[column] = dataframe[column].astype(target_dtype)
In [11]:
iterate_dict_and_change_numeric_dtype(bike_rental)
iterate_dict_and_change_categorical_dtype(bike_rental)
In [12]:
bike_rental.dtypes
Out[12]:
duration_sec                        int64
start_time                 datetime64[ns]
end_time                   datetime64[ns]
start_station_id                    int64
start_station_name                 object
start_station_latitude            float64
start_station_longitude           float64
end_station_id                      int64
end_station_name                   object
end_station_latitude              float64
end_station_longitude             float64
bike_id                             int64
user_type                        category
member_birth_year                   int64
member_gender                    category
bike_share_for_all_trip          category
dtype: object

What is the structure of your dataset?

This data set has 1,741,556 rows × 16 columns

What is/are the main feature(s) of interest in your dataset?

The main focus is on the duration seconds because it is highly related with rental fees.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think start hour, month, user type, gender, birth year is the related features

Visualization

Univariate plots

duration_sec histogram

duration_sec is the most important feature because it is highly related to the rental fee. I'll start by looking at the distribution of the main variable of duration sec

In [13]:
# create duration_sec bins
duration_bins = np.arange(50, bike_rental.duration_sec.max(), 200)
duration_bins
Out[13]:
array([   50,   250,   450,   650,   850,  1050,  1250,  1450,  1650,
        1850,  2050,  2250,  2450,  2650,  2850,  3050,  3250,  3450,
        3650,  3850,  4050,  4250,  4450,  4650,  4850,  5050,  5250,
        5450,  5650,  5850,  6050,  6250,  6450,  6650,  6850,  7050,
        7250,  7450,  7650,  7850,  8050,  8250,  8450,  8650,  8850,
        9050,  9250,  9450,  9650,  9850, 10050, 10250, 10450, 10650,
       10850, 11050, 11250, 11450, 11650, 11850, 12050, 12250, 12450,
       12650, 12850, 13050, 13250, 13450, 13650, 13850, 14050, 14250,
       14450, 14650, 14850, 15050, 15250, 15450, 15650, 15850, 16050,
       16250, 16450, 16650, 16850, 17050, 17250, 17450, 17650, 17850,
       18050, 18250, 18450, 18650, 18850, 19050, 19250, 19450, 19650,
       19850, 20050, 20250, 20450, 20650, 20850, 21050, 21250, 21450,
       21650, 21850, 22050, 22250, 22450, 22650, 22850, 23050, 23250,
       23450, 23650, 23850, 24050, 24250, 24450, 24650, 24850, 25050,
       25250, 25450, 25650, 25850, 26050, 26250, 26450, 26650, 26850,
       27050, 27250, 27450, 27650, 27850, 28050, 28250, 28450, 28650,
       28850, 29050, 29250, 29450, 29650, 29850, 30050, 30250, 30450,
       30650, 30850, 31050, 31250, 31450, 31650, 31850, 32050, 32250,
       32450, 32650, 32850, 33050, 33250, 33450, 33650, 33850, 34050,
       34250, 34450, 34650, 34850, 35050, 35250, 35450, 35650, 35850,
       36050, 36250, 36450, 36650, 36850, 37050, 37250, 37450, 37650,
       37850, 38050, 38250, 38450, 38650, 38850, 39050, 39250, 39450,
       39650, 39850, 40050, 40250, 40450, 40650, 40850, 41050, 41250,
       41450, 41650, 41850, 42050, 42250, 42450, 42650, 42850, 43050,
       43250, 43450, 43650, 43850, 44050, 44250, 44450, 44650, 44850,
       45050, 45250, 45450, 45650, 45850, 46050, 46250, 46450, 46650,
       46850, 47050, 47250, 47450, 47650, 47850, 48050, 48250, 48450,
       48650, 48850, 49050, 49250, 49450, 49650, 49850, 50050, 50250,
       50450, 50650, 50850, 51050, 51250, 51450, 51650, 51850, 52050,
       52250, 52450, 52650, 52850, 53050, 53250, 53450, 53650, 53850,
       54050, 54250, 54450, 54650, 54850, 55050, 55250, 55450, 55650,
       55850, 56050, 56250, 56450, 56650, 56850, 57050, 57250, 57450,
       57650, 57850, 58050, 58250, 58450, 58650, 58850, 59050, 59250,
       59450, 59650, 59850, 60050, 60250, 60450, 60650, 60850, 61050,
       61250, 61450, 61650, 61850, 62050, 62250, 62450, 62650, 62850,
       63050, 63250, 63450, 63650, 63850, 64050, 64250, 64450, 64650,
       64850, 65050, 65250, 65450, 65650, 65850, 66050, 66250, 66450,
       66650, 66850, 67050, 67250, 67450, 67650, 67850, 68050, 68250,
       68450, 68650, 68850, 69050, 69250, 69450, 69650, 69850, 70050,
       70250, 70450, 70650, 70850, 71050, 71250, 71450, 71650, 71850,
       72050, 72250, 72450, 72650, 72850, 73050, 73250, 73450, 73650,
       73850, 74050, 74250, 74450, 74650, 74850, 75050, 75250, 75450,
       75650, 75850, 76050, 76250, 76450, 76650, 76850, 77050, 77250,
       77450, 77650, 77850, 78050, 78250, 78450, 78650, 78850, 79050,
       79250, 79450, 79650, 79850, 80050, 80250, 80450, 80650, 80850,
       81050, 81250, 81450, 81650, 81850, 82050, 82250, 82450, 82650,
       82850, 83050, 83250, 83450, 83650, 83850, 84050, 84250, 84450,
       84650, 84850, 85050, 85250, 85450, 85650, 85850, 86050, 86250],
      dtype=int64)
In [14]:
# plot histogram of duration_sec
base_color = sns.color_palette()[0]
plt.hist(data=bike_rental, x='duration_sec', color=base_color, bins=duration_bins)
plt.xlim(0, 4000)
plt.title('Distribution of Rental Duration')
plt.xlabel('Rental Duration (sec)')
plt.ylabel('Counts');

There are many people who ride bikes about 500 seconds. This graph is highly right skewed so that I'll apply log transform

log transformation of duration_sec histogram

In [15]:
np.log10(bike_rental.duration_sec.describe())
Out[15]:
count    6.240937
mean     2.888102
std      3.288484
min      1.785330
25%      2.536558
50%      2.734800
75%      2.923762
max      4.935915
Name: duration_sec, dtype: float64
In [16]:
# plot histogram of duration_sec log transformation
log_duration_bins = 10 ** np.arange(1, 5.0+0.1, 0.1)
ticks = [10, 30, 100, 300, 1000, 3000, 10000, 30000, 100000]
labels = ['{}'.format(val) for val in ticks]

plt.hist(data=bike_rental, x='duration_sec', bins=log_duration_bins)
plt.xscale('log')
plt.xticks(ticks, labels)
plt.title('Distribution of Rental Duration (log scale)')
plt.xlabel('Log of Rental Duration (sec)')
plt.ylabel('Counts');

It's a normal distribution

countplot of member_gender

In [17]:
sns.countplot(data=bike_rental, x='member_gender', color=base_color);
plt.title('Distribution of member gender');

Male are three times more than female

countplot of user_type

In [18]:
sns.countplot(data=bike_rental, x='user_type', color=base_color)
plt.title('Distribution of user type');

Subscribers are about seven times more than customers

histogram of member_birth_year

In [19]:
bike_rental_cut_duration_outliers = bike_rental.query('duration_sec < 2500')
bike_rental_cut_duration_outliers.reset_index(drop=True, inplace=True)
bike_rental_cut_duration_outliers
Out[19]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 453 2018-01-31 23:53:53.632 2018-02-01 00:01:26.805 110 17th & Folsom Street Park (17th St at Folsom St) 37.763708 -122.415204 134 Valencia St at 24th St 37.752428 -122.420628 3571 Subscriber 1988 Male No
1 180 2018-01-31 23:52:09.903 2018-01-31 23:55:10.807 81 Berry St at 4th St 37.775880 -122.393170 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 1403 Subscriber 1980 Male No
2 996 2018-01-31 23:34:56.004 2018-01-31 23:51:32.674 134 Valencia St at 24th St 37.752428 -122.420628 4 Cyril Magnin St at Ellis St 37.785881 -122.408915 3675 Subscriber 1987 Male Yes
3 825 2018-01-31 23:34:14.027 2018-01-31 23:47:59.809 305 Ryland Park 37.342725 -121.895617 317 San Salvador St at 9th St 37.333955 -121.877349 1453 Subscriber 1994 Female Yes
4 432 2018-01-31 23:34:26.484 2018-01-31 23:41:39.297 89 Division St at Potrero Ave 37.769218 -122.407646 43 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 2928 Subscriber 1993 Male No
5 601 2018-01-31 23:29:46.832 2018-01-31 23:39:48.000 223 16th St Mission BART Station 2 37.764765 -122.420091 86 Market St at Dolores St 37.769305 -122.426826 3016 Subscriber 1957 Male No
6 887 2018-01-31 23:24:16.357 2018-01-31 23:39:04.123 308 San Pedro Square 37.336802 -121.894090 297 Locust St at Grant St 37.322980 -121.887931 55 Subscriber 1976 Female Yes
7 210 2018-01-31 23:33:03.046 2018-01-31 23:36:33.704 7 Frank H Ogawa Plaza 37.804562 -122.271738 186 Lakeside Dr at 14th St 37.801319 -122.262642 2602 Subscriber 1976 Male No
8 188 2018-01-31 23:30:58.136 2018-01-31 23:34:06.391 98 Valencia St at 16th St 37.765052 -122.421866 76 McCoppin St at Valencia St 37.771662 -122.422423 2556 Subscriber 1964 Female No
9 808 2018-01-31 23:19:58.603 2018-01-31 23:33:27.531 67 San Francisco Caltrain Station 2 (Townsend St... 37.776639 -122.395526 98 Valencia St at 16th St 37.765052 -122.421866 3041 Subscriber 1976 Male Yes
10 378 2018-01-31 23:23:23.068 2018-01-31 23:29:42.044 80 Townsend St at 5th St 37.775306 -122.397380 78 Folsom St at 9th St 37.773717 -122.411647 546 Subscriber 1995 Female No
11 686 2018-01-31 23:07:15.313 2018-01-31 23:18:41.558 312 San Jose Diridon Station 37.329732 -121.901782 317 San Salvador St at 9th St 37.333955 -121.877349 1886 Subscriber 1997 Female No
12 450 2018-01-31 23:07:13.063 2018-01-31 23:14:43.814 241 Ashby BART Station 37.852477 -122.270213 157 65th St at Hollis St 37.846784 -122.291376 3583 Subscriber 1994 Male No
13 294 2018-01-31 23:08:12.000 2018-01-31 23:13:06.636 239 Bancroft Way at Telegraph Ave 37.868813 -122.258764 244 Shattuck Ave at Hearst Ave 37.873792 -122.268618 2144 Subscriber 1983 Male No
14 150 2018-01-31 23:10:09.586 2018-01-31 23:12:40.333 182 19th Street BART Station 37.809013 -122.268247 183 Telegraph Ave at 19th St 37.808702 -122.269927 3468 Subscriber 1945 Male Yes
15 462 2018-01-31 23:03:48.940 2018-01-31 23:11:31.029 119 18th St at Noe St 37.761047 -122.432642 134 Valencia St at 24th St 37.752428 -122.420628 1432 Subscriber 1971 Male Yes
16 379 2018-01-31 23:04:27.701 2018-01-31 23:10:46.976 176 MacArthur BART Station 37.828410 -122.266315 189 Genoa St at 55th St 37.839649 -122.271756 997 Subscriber 1975 Male No
17 880 2018-01-31 22:53:41.302 2018-01-31 23:08:21.443 123 Folsom St at 19th St 37.760594 -122.414817 145 29th St at Church St 37.743684 -122.426806 3725 Subscriber 1986 Male No
18 1210 2018-01-31 22:45:37.125 2018-01-31 23:05:47.576 285 Webster St at O'Farrell St 37.783521 -122.431158 133 Valencia St at 22nd St 37.755213 -122.420975 1059 Subscriber 1991 Male No
19 259 2018-01-31 23:01:12.792 2018-01-31 23:05:32.157 239 Bancroft Way at Telegraph Ave 37.868813 -122.258764 266 Parker St at Fulton St 37.862464 -122.264791 1208 Subscriber 1994 Male No
20 592 2018-01-31 22:53:27.779 2018-01-31 23:03:20.290 202 Washington St at 8th St 37.800754 -122.274894 195 Bay Pl at Vernon St 37.812314 -122.260779 1834 Customer 1978 Male No
21 1059 2018-01-31 22:45:16.570 2018-01-31 23:02:56.285 141 Valencia St at Cesar Chavez St 37.747998 -122.420219 79 7th St at Brannan St 37.773492 -122.403673 1248 Subscriber 1988 Male No
22 375 2018-01-31 22:56:31.082 2018-01-31 23:02:46.583 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 6 The Embarcadero at Sansome St 37.804770 -122.403234 3401 Subscriber 1988 Male No
23 300 2018-01-31 22:57:24.042 2018-01-31 23:02:24.272 114 Rhode Island St at 17th St 37.764478 -122.402570 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 3224 Subscriber 1981 Male No
24 2219 2018-01-31 22:24:39.943 2018-01-31 23:01:39.571 30 San Francisco Caltrain (Townsend St at 4th St) 37.776598 -122.395282 81 Berry St at 4th St 37.775880 -122.393170 1757 Subscriber 1991 Female No
25 330 2018-01-31 22:55:48.073 2018-01-31 23:01:18.506 99 Folsom St at 15th St 37.767037 -122.415443 124 19th St at Florida St 37.760447 -122.410807 3379 Subscriber 1983 Male No
26 870 2018-01-31 22:45:38.235 2018-01-31 23:00:09.034 285 Webster St at O'Farrell St 37.783521 -122.431158 106 Sanchez St at 17th St 37.763242 -122.430675 1503 Customer 1990 Female No
27 2032 2018-01-31 22:25:00.185 2018-01-31 22:58:53.071 182 19th Street BART Station 37.809013 -122.268247 266 Parker St at Fulton St 37.862464 -122.264791 2169 Subscriber 1990 Male No
28 196 2018-01-31 22:55:25.413 2018-01-31 22:58:42.142 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 81 Berry St at 4th St 37.775880 -122.393170 1403 Subscriber 1980 Male No
29 1499 2018-01-31 22:33:06.643 2018-01-31 22:58:06.304 44 Civic Center/UN Plaza BART Station (Market St ... 37.781074 -122.411738 144 Precita Park 37.747300 -122.411403 2528 Subscriber 1970 Female No
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1706901 331 2018-12-01 00:48:27.529 2018-12-01 00:53:59.495 70 Central Ave at Fell St 37.773311 -122.444293 39 Scott St at Golden Gate Ave 37.778999 -122.436861 968 Subscriber 1994 Male No
1706902 310 2018-12-01 00:45:44.868 2018-12-01 00:50:55.797 252 Channing Way at Shattuck Ave 37.865847 -122.267443 238 MLK Jr Way at University Ave 37.871719 -122.273068 367 Subscriber 1991 Male No
1706903 1338 2018-12-01 00:27:24.875 2018-12-01 00:49:43.349 371 Lombard St at Columbus Ave 37.802746 -122.413579 104 4th St at 16th St 37.767045 -122.390833 1985 Subscriber 1986 Male No
1706904 154 2018-12-01 00:44:24.838 2018-12-01 00:46:59.335 70 Central Ave at Fell St 37.773311 -122.444293 52 McAllister St at Baker St 37.777416 -122.441838 449 Subscriber 1994 Male No
1706905 862 2018-12-01 00:32:11.063 2018-12-01 00:46:33.790 4 Cyril Magnin St at Ellis St 37.785881 -122.408915 86 Market St at Dolores St 37.769305 -122.426826 4408 Customer 1975 Male No
1706906 1310 2018-12-01 00:23:53.342 2018-12-01 00:45:43.588 85 Church St at Duboce Ave 37.770083 -122.429156 13 Commercial St at Montgomery St 37.794231 -122.402923 1525 Subscriber 1994 Male Yes
1706907 230 2018-12-01 00:41:31.955 2018-12-01 00:45:22.416 312 San Jose Diridon Station 37.329732 -121.901782 314 Santa Clara St at Almaden Blvd 37.333988 -121.894902 1379 Subscriber 1957 Male No
1706908 2071 2018-12-01 00:09:55.580 2018-12-01 00:44:26.629 36 Folsom St at 3rd St 37.783830 -122.398870 97 14th St at Mission St 37.768265 -122.420110 316 Customer 1991 Male No
1706909 1958 2018-12-01 00:11:35.022 2018-12-01 00:44:13.632 44 Civic Center/UN Plaza BART Station (Market St ... 37.781074 -122.411738 5 Powell St BART Station (Market St at 5th St) 37.783899 -122.408445 4395 Subscriber 1993 Male No
1706910 268 2018-12-01 00:37:23.134 2018-12-01 00:41:51.967 147 29th St at Tiffany Ave 37.744067 -122.421472 121 Mission Playground 37.759210 -122.421339 4451 Subscriber 1989 Male No
1706911 1679 2018-12-01 00:13:35.143 2018-12-01 00:41:34.860 240 Haste St at Telegraph Ave 37.866043 -122.258804 245 Downtown Berkeley BART 37.870139 -122.268422 3701 Customer 1998 Female No
1706912 1658 2018-12-01 00:13:33.124 2018-12-01 00:41:11.763 240 Haste St at Telegraph Ave 37.866043 -122.258804 245 Downtown Berkeley BART 37.870139 -122.268422 3625 Customer 1998 Female No
1706913 293 2018-12-01 00:36:01.525 2018-12-01 00:40:55.285 95 Sanchez St at 15th St 37.766219 -122.431060 72 Page St at Scott St 37.772406 -122.435650 1043 Subscriber 1981 Male No
1706914 426 2018-12-01 00:32:13.125 2018-12-01 00:39:19.871 50 2nd St at Townsend St 37.780526 -122.390288 21 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 3154 Subscriber 1986 Male No
1706915 447 2018-12-01 00:31:36.148 2018-12-01 00:39:03.391 368 Myrtle St at Polk St 37.785434 -122.419622 17 Embarcadero BART Station (Beale St at Market St) 37.792251 -122.397086 1677 Subscriber 1980 NaN Yes
1706916 1694 2018-12-01 00:09:17.159 2018-12-01 00:37:31.240 14 Clay St at Battery St 37.795001 -122.399970 147 29th St at Tiffany Ave 37.744067 -122.421472 209 Subscriber 1993 Male No
1706917 269 2018-12-01 00:31:00.091 2018-12-01 00:35:29.871 98 Valencia St at 16th St 37.765052 -122.421866 121 Mission Playground 37.759210 -122.421339 3147 Subscriber 1993 Male No
1706918 685 2018-12-01 00:21:16.240 2018-12-01 00:32:42.000 109 17th St at Valencia St 37.763316 -122.421904 60 8th St at Ringold St 37.774520 -122.409449 3247 Subscriber 1987 Female No
1706919 681 2018-12-01 00:19:41.383 2018-12-01 00:31:02.605 15 San Francisco Ferry Building (Harry Bridges Pl... 37.795392 -122.394203 323 Broadway at Kearny 37.798014 -122.405950 3460 Subscriber 1959 Male No
1706920 1166 2018-12-01 00:11:04.864 2018-12-01 00:30:31.348 160 West Oakland BART Station 37.805318 -122.294837 155 Emeryville Public Market 37.840521 -122.293528 1579 Subscriber 1997 Male No
1706921 763 2018-12-01 00:17:34.497 2018-12-01 00:30:18.107 73 Pierce St at Haight St 37.771793 -122.433708 121 Mission Playground 37.759210 -122.421339 90 Subscriber 1991 Female No
1706922 760 2018-12-01 00:17:29.960 2018-12-01 00:30:10.178 73 Pierce St at Haight St 37.771793 -122.433708 121 Mission Playground 37.759210 -122.421339 2758 Subscriber 1991 Male No
1706923 538 2018-12-01 00:16:34.085 2018-12-01 00:25:32.455 5 Powell St BART Station (Market St at 5th St) 37.783899 -122.408445 39 Scott St at Golden Gate Ave 37.778999 -122.436861 4384 Subscriber 1991 Male No
1706924 671 2018-12-01 00:12:49.640 2018-12-01 00:24:01.512 34 Father Alfred E Boeddeker Park 37.783988 -122.412408 92 Mission Bay Kids Park 37.772301 -122.393028 4377 Subscriber 1972 Male Yes
1706925 498 2018-12-01 00:14:41.725 2018-12-01 00:23:00.408 7 Frank H Ogawa Plaza 37.804562 -122.271738 214 Market St at Brockhurst St 37.823321 -122.275732 2236 Subscriber 1992 Male No
1706926 1137 2018-12-01 00:01:49.693 2018-12-01 00:20:47.519 73 Pierce St at Haight St 37.771793 -122.433708 50 2nd St at Townsend St 37.780526 -122.390288 273 Subscriber 1990 Male No
1706927 473 2018-12-01 00:11:54.811 2018-12-01 00:19:48.547 345 Hubbell St at 16th St 37.766474 -122.398295 81 Berry St at 4th St 37.775880 -122.393170 3035 Subscriber 1982 Female No
1706928 841 2018-12-01 00:02:48.726 2018-12-01 00:16:49.766 10 Washington St at Kearny St 37.795393 -122.404770 58 Market St at 10th St 37.776619 -122.417385 2034 Subscriber 1999 Female No
1706929 260 2018-12-01 00:05:27.615 2018-12-01 00:09:47.956 245 Downtown Berkeley BART 37.870139 -122.268422 255 Virginia St at Shattuck Ave 37.876573 -122.269528 2243 Subscriber 1991 Male No
1706930 292 2018-12-01 00:03:06.549 2018-12-01 00:07:59.080 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 126 Esprit Park 37.761634 -122.390648 545 Subscriber 1963 Male No

1706931 rows × 16 columns

In [20]:
birth_bins = np.arange(bike_rental_cut_duration_outliers.member_birth_year.min(), bike_rental_cut_duration_outliers.member_birth_year.max(), 5)
birth_bins
Out[20]:
array([1881, 1886, 1891, 1896, 1901, 1906, 1911, 1916, 1921, 1926, 1931,
       1936, 1941, 1946, 1951, 1956, 1961, 1966, 1971, 1976, 1981, 1986,
       1991, 1996], dtype=int64)
In [21]:
plt.hist(data=bike_rental_cut_duration_outliers, x='member_birth_year', bins=birth_bins);
plt.title('Distribution of user\'s birth year')
plt.xlabel('User\'s birth year')
plt.ylabel('Count');

This histogram is highly left-skewed which means young people usually use rental bikes.

histogram of start time (hours)

In [22]:
# extract start hour from start_time
bike_rental_cut_duration_outliers['start_hour'] = bike_rental_cut_duration_outliers.start_time.apply(lambda x: x.time().hour)
bike_rental_cut_duration_outliers.head()
C:\Users\weroo\Anaconda\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[22]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_hour
0 453 2018-01-31 23:53:53.632 2018-02-01 00:01:26.805 110 17th & Folsom Street Park (17th St at Folsom St) 37.763708 -122.415204 134 Valencia St at 24th St 37.752428 -122.420628 3571 Subscriber 1988 Male No 23
1 180 2018-01-31 23:52:09.903 2018-01-31 23:55:10.807 81 Berry St at 4th St 37.775880 -122.393170 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 1403 Subscriber 1980 Male No 23
2 996 2018-01-31 23:34:56.004 2018-01-31 23:51:32.674 134 Valencia St at 24th St 37.752428 -122.420628 4 Cyril Magnin St at Ellis St 37.785881 -122.408915 3675 Subscriber 1987 Male Yes 23
3 825 2018-01-31 23:34:14.027 2018-01-31 23:47:59.809 305 Ryland Park 37.342725 -121.895617 317 San Salvador St at 9th St 37.333955 -121.877349 1453 Subscriber 1994 Female Yes 23
4 432 2018-01-31 23:34:26.484 2018-01-31 23:41:39.297 89 Division St at Potrero Ave 37.769218 -122.407646 43 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 2928 Subscriber 1993 Male No 23
In [23]:
hour_bins = np.arange(0, 23+1, 1)
plt.hist(data=bike_rental_cut_duration_outliers, x='start_hour', color=base_color, bins=hour_bins)
plt.title('Distribution of start hours')
plt.xlabel('Start hours')
plt.ylabel('Counts');

Lots of people ride bikes from 8:00 to 9:00 and from 17:00 to 18:00 (morning and evening)

histogram of end time (hours)

In [24]:
# extract end hour from end_time
bike_rental_cut_duration_outliers['end_hour'] = bike_rental_cut_duration_outliers.end_time.apply(lambda x: x.time().hour)
bike_rental_cut_duration_outliers.head()
C:\Users\weroo\Anaconda\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
Out[24]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_hour end_hour
0 453 2018-01-31 23:53:53.632 2018-02-01 00:01:26.805 110 17th & Folsom Street Park (17th St at Folsom St) 37.763708 -122.415204 134 Valencia St at 24th St 37.752428 -122.420628 3571 Subscriber 1988 Male No 23 0
1 180 2018-01-31 23:52:09.903 2018-01-31 23:55:10.807 81 Berry St at 4th St 37.775880 -122.393170 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 1403 Subscriber 1980 Male No 23 23
2 996 2018-01-31 23:34:56.004 2018-01-31 23:51:32.674 134 Valencia St at 24th St 37.752428 -122.420628 4 Cyril Magnin St at Ellis St 37.785881 -122.408915 3675 Subscriber 1987 Male Yes 23 23
3 825 2018-01-31 23:34:14.027 2018-01-31 23:47:59.809 305 Ryland Park 37.342725 -121.895617 317 San Salvador St at 9th St 37.333955 -121.877349 1453 Subscriber 1994 Female Yes 23 23
4 432 2018-01-31 23:34:26.484 2018-01-31 23:41:39.297 89 Division St at Potrero Ave 37.769218 -122.407646 43 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 2928 Subscriber 1993 Male No 23 23
In [25]:
plt.hist(data=bike_rental_cut_duration_outliers, x='end_hour', color=base_color, bins=hour_bins)
plt.title('Distribution of end hours')
plt.xlabel('End hours')
plt.ylabel('Counts');

This graph shows lots of people also ride bikes from 8:00 to 9:00 and from 17:00 to 18:00 (morning and evening)

historam of month

In [26]:
bike_rental_cut_duration_outliers['month'] = bike_rental_cut_duration_outliers.start_time.apply(lambda x: x.date().month)
bike_rental_cut_duration_outliers.head()
C:\Users\weroo\Anaconda\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
Out[26]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip start_hour end_hour month
0 453 2018-01-31 23:53:53.632 2018-02-01 00:01:26.805 110 17th & Folsom Street Park (17th St at Folsom St) 37.763708 -122.415204 134 Valencia St at 24th St 37.752428 -122.420628 3571 Subscriber 1988 Male No 23 0 1
1 180 2018-01-31 23:52:09.903 2018-01-31 23:55:10.807 81 Berry St at 4th St 37.775880 -122.393170 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 1403 Subscriber 1980 Male No 23 23 1
2 996 2018-01-31 23:34:56.004 2018-01-31 23:51:32.674 134 Valencia St at 24th St 37.752428 -122.420628 4 Cyril Magnin St at Ellis St 37.785881 -122.408915 3675 Subscriber 1987 Male Yes 23 23 1
3 825 2018-01-31 23:34:14.027 2018-01-31 23:47:59.809 305 Ryland Park 37.342725 -121.895617 317 San Salvador St at 9th St 37.333955 -121.877349 1453 Subscriber 1994 Female Yes 23 23 1
4 432 2018-01-31 23:34:26.484 2018-01-31 23:41:39.297 89 Division St at Potrero Ave 37.769218 -122.407646 43 San Francisco Public Library (Grove St at Hyde... 37.778768 -122.415929 2928 Subscriber 1993 Male No 23 23 1
In [27]:
month_bins = np.arange(1, 13+1, 1)
plt.hist(data=bike_rental_cut_duration_outliers, x='month', color=base_color, bins=month_bins)
plt.title('Distribution of month')
plt.xlabel('Month')
plt.ylabel('Counts');

There are more rental user from May to October than from November to April

countplot of bike share for all trip

In [28]:
sns.countplot(data=bike_rental_cut_duration_outliers, x='bike_share_for_all_trip', color=base_color)
plt.title('Distribution of bike share for all trip');

Bivariate plots

To start off with, I want to look at the pairwise correlations present between features in the data

In [29]:
numeric_vars = ['duration_sec', 'start_hour', 'end_hour', 'member_birth_year', 'month'] 
categoric_vars = ['user_type', 'member_gender', 'bike_share_for_all_trip']
In [30]:
# correlation heatmap plot between numeric variables
plt.figure(figsize = [8, 5]) 
sns.heatmap(bike_rental_cut_duration_outliers[numeric_vars].corr(), annot = True, fmt = '.2f',
            cmap = 'vlag_r', center = 0)
plt.show()

The start_hour and end_hour are highly related factors as we can think, but the other numeric variables have no correlations each other. In order to see the visual relationship, let's draw scatter plot

In [31]:
# correlation scatter plot between numeric variables
samples = np.random.choice(bike_rental_cut_duration_outliers.shape[0], 500, replace=False)
bike_samples = bike_rental_cut_duration_outliers.loc[samples, :]

g = sns.PairGrid(data=bike_samples, vars=numeric_vars)
g = g.map_diag(plt.hist)
g.map_offdiag(plt.scatter);

As you saw already, the start_hour and end_hour are highly related factors.

Let's move on to looking at how duration sec correlate with the categorical variables.

In [32]:
# plot matrix of numeric features against categorical features
samples = np.random.choice(bike_rental_cut_duration_outliers.shape[0], 1000, replace=False)
bike_samples = bike_rental_cut_duration_outliers.loc[samples, :]

def boxgrid(x, y, **kwargs):
    defualt_color = sns.color_palette()[0]
    sns.boxplot(x, y, color=defualt_color)

plt.figure(figsize=[10, 10])
g = sns.PairGrid(data=bike_samples, y_vars='duration_sec', x_vars=categoric_vars, height=3, aspect=1.5)
g.map(boxgrid);
<Figure size 720x720 with 0 Axes>

It shows that Custmoers more likely to ride bikes longer than Subscribers. Duration seconds of Female slightly longer than Male. This aspect is same as bike share for all trip

Let's look at more detail of each relationship

violin plot (duration_sec against member_gender)

In [33]:
sns.violinplot(data=bike_rental, x='member_gender', y='duration_sec', color=base_color)
plt.title('duration seconds vs member gender');

There are so many outliers over the upper limits so that I cut the outliers

In [34]:
sns.violinplot(data=bike_rental_cut_duration_outliers, x='member_gender', y='duration_sec', color=base_color)
plt.title('duration seconds vs member gender except outliers');

Female ride bikes longer than male

In [35]:
sns.violinplot(data=bike_rental_cut_duration_outliers, x='user_type', y='duration_sec', color=base_color)
plt.title('duration seconds vs user type');

Customers ride bikes longer than subscribers

Let's look at relationships between the three categorical features.

In [36]:
plt.figure(figsize = [8, 10])

plt.subplot(3, 1, 1)
sns.countplot(data=bike_rental_cut_duration_outliers, x='user_type', hue='member_gender', palette='Blues')

ax = plt.subplot(3, 1, 2)
sns.countplot(data=bike_rental_cut_duration_outliers, x='user_type', hue='bike_share_for_all_trip', palette='Blues')

ax = plt.subplot(3, 1, 3)
sns.countplot(data=bike_rental_cut_duration_outliers, x='bike_share_for_all_trip', hue='member_gender', palette='Blues')
ax.legend(ncol=2)
Out[36]:
<matplotlib.legend.Legend at 0x1f21b37d780>

The subscribers are more than customers, and male are more than female. No bike share for all trip are more than bike share for all trip.

duration_sec against member_birth_year

regplot

In [37]:
sns.regplot(data=bike_rental, x='member_birth_year', y='duration_sec', 
            fit_reg=False, scatter_kws={'alpha': 0.2});
plt.title('duration seconds vs member birth year');

It is hard to figure out the relationship between member birth year and duration seconds so that I'll plot a heatmap

heatmap

In [38]:
birth_bins_x = np.arange(1900, bike_rental_cut_duration_outliers.member_birth_year.max()+5, 5)
birth_bins_y = np.arange(0, bike_rental_cut_duration_outliers.duration_sec.max()+200, 200)
plt.hist2d(data=bike_rental_cut_duration_outliers, x='member_birth_year', y='duration_sec', bins=[birth_bins_x, birth_bins_y], 
           cmap = 'viridis_r')
plt.colorbar()
plt.title('duration seconds vs member birth year');
plt.xlabel('member birth year')
plt.ylabel('duration seconds');

There are extreme many members who ride bikes for about 500 seconds, and born in 1990s

heatmap(duration_sec against start_hour)

In [39]:
hour_bins_x = np.arange(0, bike_rental_cut_duration_outliers.start_hour.max()+1, 1)
hour_bins_y = np.arange(0, bike_rental_cut_duration_outliers.duration_sec.max()+200, 200)
plt.hist2d(data=bike_rental_cut_duration_outliers, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
          cmap='viridis_r')
plt.colorbar()
plt.title('duration seconds vs start hours');
plt.xlabel('start hours')
plt.ylabel('duration seconds');

There are extreme many members who ride bikes at 8:00 in the morning and 17:00 in the afternoon for about 300~500 seconds

heatmap (duration_sec against month)

In [40]:
hour_bins_x = np.arange(0, bike_rental_cut_duration_outliers.month.max()+2, 1)
hour_bins_y = np.arange(0, bike_rental_cut_duration_outliers.duration_sec.max()+200, 200)
plt.hist2d(data=bike_rental_cut_duration_outliers, x='month', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
          cmap='viridis_r', cmin=1000)
plt.colorbar()
plt.title('duration seconds vs month');
plt.xlabel('motnh')
plt.ylabel('duration seconds');

As expected, there are extreme many members who ride bikes from May to October

heatmap (member_gender against user_type)

In [41]:
type_counts = bike_rental_cut_duration_outliers.groupby(['user_type', 'member_gender']).size()
type_counts = type_counts.reset_index(name='count')
type_counts = type_counts.pivot(index='member_gender', columns='user_type', values='count')
sns.heatmap(type_counts, annot = True, fmt = 'd', cmap='viridis_r')
plt.title('member gender vs user type');

Male subscriber are three times more than female subscriber

Talk about some of the relationships you observed in this part of the investigation. How did the feature of interest vary with other features in the dataset?

There are numeric variables and categorical variables. Numberics are duration_sec, member_birth_year, month, start_hour and Categoricals are user_type, member_gender, bike_share_for_all_trip. Variable of interest is durations_sec beacuse it is highly related with rental fees.

member_birth_year: There are extreme many members who ride bikes for about 500 seconds, and born in 1990s
month: There are extreme many members who ride bikes from May to October
start hour: There are extreme many members who ride bikes at 8:00 in the morning and 17:00 in the afternoon for about 300~500 seconds
user_type: It shows that Custmoers more likely to ride bikes longer than Subscribers
member_gender: Duration seconds of Female slightly longer than Male
biek_share_for_all_trip: Duration seconds of No slightly longer than Yes

Multivariate plots

duration_sec, start_hour, member_gender

In [42]:
g = sns.FacetGrid(data=bike_rental_cut_duration_outliers, hue='member_gender', height=7)
g.map(plt.scatter, 'start_hour', 'duration_sec')
g.add_legend();
plt.title('')
Out[42]:
Text(0.5, 1.0, '')

It is hard to figure out so I'll use heatmap of each gender

In [43]:
bike_rental_cut_duration_outliers_male = bike_rental_cut_duration_outliers.query('member_gender == "Male"')
bike_rental_cut_duration_outliers_female = bike_rental_cut_duration_outliers.query('member_gender == "Female"')

plt.figure(figsize = [12, 5])

plt.subplot(1, 2, 1)
plt.hist2d(data=bike_rental_cut_duration_outliers_male, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
          cmap='viridis_r', cmin=50)
plt.colorbar()
plt.xlabel('start hours of male')
plt.ylabel('duration seconds');

plt.subplot(1, 2, 2)
plt.hist2d(data=bike_rental_cut_duration_outliers_female, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
          cmap='viridis_r', cmin=50)
plt.colorbar()
plt.xlabel('start hours of female')
plt.ylabel('duration seconds');

Bivariate plot shows that duration seconds of Female slightly longer than Male as you can see in previous graph. But this graph tells me another information that male ride bikes at dawn more than female

duration_sec, start_hour, user_type

In [44]:
bike_rental_cut_duration_outliers_sub = bike_rental_cut_duration_outliers.query('user_type == "Subscriber"')
bike_rental_cut_duration_outliers_cus = bike_rental_cut_duration_outliers.query('user_type == "Customer"')

plt.figure(figsize = [12, 5])

plt.subplot(1, 2, 1)
plt.hist2d(data=bike_rental_cut_duration_outliers_sub, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
          cmap='viridis_r', cmin=50)
plt.colorbar()
plt.xlabel('start hours of Subscriber')
plt.ylabel('duration seconds');

plt.subplot(1, 2, 2)
plt.hist2d(data=bike_rental_cut_duration_outliers_cus, x='start_hour', y='duration_sec', bins=[hour_bins_x, hour_bins_y],
          cmap='viridis_r', cmin=50)
plt.colorbar()
plt.xlabel('start hours of Customer')
plt.ylabel('duration seconds');

Subscriber ride bikes more regularly at around 8:00 and 17:00 than Customer. In Customer graph, this tendency disappears

duration_sec, month, member_gender

In [45]:
# poin plot of 2 numeric variables and 1 categorical variable
sns.pointplot(data=bike_rental_cut_duration_outliers, x='month', y='duration_sec', hue='member_gender',
              palette = 'Blues', linestyles='', dodge=0.4)
plt.title('duration seconds vs month based on gender');

It shows that Female more likely to ride bikes longer than Male especially on summer season

duration_sec, month, user_type

In [46]:
# poin plot of 2 numeric variables and 1 categorical variable
sns.pointplot(data=bike_rental_cut_duration_outliers, x='month', y='duration_sec', hue='user_type',
              palette = 'Blues', linestyles='', dodge=0.4)
plt.title('duration seconds vs month based on subscribers');

It shows that Custmoers more likely to ride bikes longer than Subscribers especially on summer season

Talk about some of the relationships you observed in this part of the investigation. How did the feature of interest vary with other features in the dataset?

Main ideas are quite same as bivariated plots. Bivariate plot shows that duration seconds of Female slightly longer than Male as you can see in previous graph. But multivariate plots tells me another information that male ride bikes at dawn more than female
Also, subscriber ride bikes more regularly at around 8:00 and 17:00 than customer. Finally, it shows that Female (Customer) more likely to ride bikes longer than Male (Subscriber) especially on summer season.