ASSIGNMENT 2 : COMPLETING THE TEST Idea Younger persons tend to ride bikes longer than older persons \(H_0\) : \({}}} <= {}}\) \(H_a\) : \({}}} > {}}\) Null hypothesis The average trip duration for a month for persons from age group < 25 years is less than or same as the average trip duration for persons > 25 years of age We download the CitiBike data for testing the idea [commandchars=\\\{\}] {In [{91}]:} {from} {\PYZus{}\PYZus{}future\PYZus{}\PYZus{}} {import} {print\PYZus{}function}{,} {division} {import} {pylab} {as} {pl} {import} {pandas} {as} {pd} {import} {numpy} {as} {np} {import} {os} {from} {scipy} {import} {stats} {from} {statsmodels.stats.proportion} {import} {proportions\PYZus{}ztest} {\PYZpc{}}{pylab} inline {if} {os}{.}{getenv} {(}{\PYZsq{}}{PUIDATA}{\PYZsq{}}{)} {is} {None}{:} {print} {(}{\PYZdq{}}{Must set env variable PUIdata}{\PYZdq{}}{)} {if} {os}{.}{getenv} {(}{\PYZsq{}}{PUIDATA}{\PYZsq{}}{)} {is} {None}{:} {print} {(}{\PYZdq{}}{Must set env variable PUIdata}{\PYZdq{}}{)} {import} {os} {import} {json} {\PYZsh{}s = json.load( open(os.getenv(\PYZsq{}PUI2016\PYZsq{}) + \PYZdq{}/fbb\PYZus{}matplotlibrc.json\PYZdq{}) )} {\PYZsh{}pl.rcParams.update(s)} {np}{.}{random}{.}{seed}{(}{2002}{)} [commandchars=\\\{\}] Populating the interactive namespace from numpy and matplotlib [commandchars=\\\{\}] {In [{30}]:} {def} {getCitiBikeCSV}{(}{datestring}{)}{:} {print} {(}{\PYZdq{}}{Downloading}{\PYZdq{}}{,} {datestring}{)} {\PYZsh{}\PYZsh{}\PYZsh{} First I will heck that it is not already there} {if} {not} {os}{.}{path}{.}{isfile}{(}{os}{.}{getenv}{(}{\PYZdq{}}{PUIDATA}{\PYZdq{}}{)} {+} {\PYZdq{}}{/}{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.csv}{\PYZdq{}}{)}{:} {if} {os}{.}{path}{.}{isfile}{(}{datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.csv}{\PYZdq{}}{)}{:} {\PYZsh{} if in the current dir just move it} {if} {os}{.}{system}{(}{\PYZdq{}}{mv }{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.csv }{\PYZdq{}} {+} {os}{.}{getenv}{(}{\PYZdq{}}{PUIDATA}{\PYZdq{}}{)}{)}{:} {print} {(}{\PYZdq{}}{Error moving file!, Please check!}{\PYZdq{}}{)} {\PYZsh{}otherwise start looking for the zip file} {else}{:} {if} {not} {os}{.}{path}{.}{isfile}{(}{os}{.}{getenv}{(}{\PYZdq{}}{PUIDATA}{\PYZdq{}}{)} {+} {\PYZdq{}}{/}{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.zip}{\PYZdq{}}{)}{:} {if} {not} {os}{.}{path}{.}{isfile}{(}{datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.zip}{\PYZdq{}}{)}{:} {os}{.}{system}{(}{\PYZdq{}}{curl \PYZhy{}O https://s3.amazonaws.com/tripdata/}{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.zip}{\PYZdq{}}{)} {\PYZsh{}\PYZsh{}\PYZsh{} To move it I use the os.system() functions to run bash commands with arguments} {os}{.}{system}{(}{\PYZdq{}}{mv }{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.zip }{\PYZdq{}} {+} {os}{.}{getenv}{(}{\PYZdq{}}{PUIDATA}{\PYZdq{}}{)}{)} {\PYZsh{}\PYZsh{}\PYZsh{} unzip the csv } {os}{.}{system}{(}{\PYZdq{}}{unzip }{\PYZdq{}} {+} {os}{.}{getenv}{(}{\PYZdq{}}{PUIDATA}{\PYZdq{}}{)} {+} {\PYZdq{}}{/}{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.zip}{\PYZdq{}}{)} {\PYZsh{}\PYZsh{} NOTE: old csv citibike data had a different name structure. } {if} {\PYZsq{}}{2014}{\PYZsq{}} {in} {datestring}{:} {os}{.}{system}{(}{\PYZdq{}}{mv }{\PYZdq{}} {+} {datestring}{[}{:}{4}{]} {+} {\PYZsq{}}{\PYZhy{}}{\PYZsq{}} {+} {datestring}{[}{4}{:}{]} {+} {\PYZdq{}}{\PYZbs{}}{ \PYZhy{}}{\PYZbs{}}{ Citi}{\PYZbs{}}{ Bike}{\PYZbs{}}{ trip}{\PYZbs{}}{ data.csv }{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.csv}{\PYZdq{}}{)} {os}{.}{system}{(}{\PYZdq{}}{mv }{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.csv }{\PYZdq{}} {+} {os}{.}{getenv}{(}{\PYZdq{}}{PUIDATA}{\PYZdq{}}{)}{)} {\PYZsh{}\PYZsh{}\PYZsh{} One final check:} {if} {not} {os}{.}{path}{.}{isfile}{(}{os}{.}{getenv}{(}{\PYZdq{}}{PUIDATA}{\PYZdq{}}{)} {+} {\PYZdq{}}{/}{\PYZdq{}} {+} {datestring} {+} {\PYZdq{}}{\PYZhy{}citibike\PYZhy{}tripdata.csv}{\PYZdq{}}{)}{:} {print} {(}{\PYZdq{}}{WARNING!!! something is wrong: the file is not there!}{\PYZdq{}}{)} {else}{:} {print} {(}{\PYZdq{}}{file in place, you can continue}{\PYZdq{}}{)} Acquiring the bike data for 10/2016 and moving it to PUIdata directory [commandchars=\\\{\}] {In [{31}]:} {datestring} {=} {\PYZsq{}}{201610}{\PYZsq{}} {getCitiBikeCSV}{(}{datestring}{)} [commandchars=\\\{\}] Downloading 201610 file in place, you can continue [commandchars=\\\{\}] {In [{35}]:} {df} {=} {pd}{.}{read\PYZus{}csv}{(}{os}{.}{getenv}{(}{\PYZdq{}}{PUIDATA}{\PYZdq{}}{)} {+} {\PYZdq{}}{/}{\PYZdq{}} {+} {datestring} {+} {\PYZsq{}}{\PYZhy{}citibike\PYZhy{}tripdata.csv}{\PYZsq{}}{)} [commandchars=\\\{\}] {In [{36}]:} {df}{.}{shape} [commandchars=\\\{\}] {Out[{36}]:} (1573872, 15) [commandchars=\\\{\}] {In [{37}]:} {df}{.}{head}{(}{)} [commandchars=\\\{\}] {Out[{37}]:} Trip Duration Start Time Stop Time Start Station ID \textbackslash{} 0 328 2016-10-01 00:00:07 2016-10-01 00:05:35 471 1 398 2016-10-01 00:00:11 2016-10-01 00:06:49 3147 2 430 2016-10-01 00:00:14 2016-10-01 00:07:25 345 3 351 2016-10-01 00:00:21 2016-10-01 00:06:12 3307 4 2693 2016-10-01 00:00:21 2016-10-01 00:45:15 3428 Start Station Name Start Station Latitude Start Station Longitude \textbackslash{} 0 Grand St \& Havemeyer St 40.712868 -73.956981 1 E 85 St \& 3 Ave 40.778012 -73.954071 2 W 13 St \& 6 Ave 40.736494 -73.997044 3 West End Ave \& W 94 St 40.794165 -73.974124 4 8 Ave \& W 16 St 40.740983 -74.001702 End Station ID End Station Name End Station Latitude \textbackslash{} 0 3077 Stagg St \& Union Ave 40.708771 1 3140 1 Ave \& E 78 St 40.771404 2 470 W 20 St \& 8 Ave 40.743453 3 3357 W 106 St \& Amsterdam Ave 40.800836 4 3323 W 106 St \& Central Park West 40.798186 End Station Longitude Bike ID User Type Birth Year Gender 0 -73.950953 25254 Subscriber 1992.0 1 1 -73.953517 17810 Subscriber 1988.0 2 2 -74.000040 20940 Subscriber 1965.0 1 3 -73.966449 19086 Subscriber 1993.0 1 4 -73.960591 26502 Subscriber 1991.0 1 [commandchars=\\\{\}] {In [{38}]:} {df}{.}{columns} [commandchars=\\\{\}] {Out[{38}]:} Index([u'Trip Duration', u'Start Time', u'Stop Time', u'Start Station ID', u'Start Station Name', u'Start Station Latitude', u'Start Station Longitude', u'End Station ID', u'End Station Name', u'End Station Latitude', u'End Station Longitude', u'Bike ID', u'User Type', u'Birth Year', u'Gender'], dtype='object') Keeping the columns that will be used to test our hypothesis. Only trip duration, birth year and gender are relevant for this [commandchars=\\\{\}] {In [{39}]:} {df}{.}{drop}{(}{[}{\PYZsq{}}{Start Time}{\PYZsq{}}{,} {\PYZsq{}}{Stop Time}{\PYZsq{}}{,} {\PYZsq{}}{Start Station ID}{\PYZsq{}}{,} {\PYZsq{}}{Start Station Name}{\PYZsq{}}{,} {\PYZsq{}}{Start Station Latitude}{\PYZsq{}}{,} {\PYZsq{}}{Start Station Longitude}{\PYZsq{}}{,} {\PYZsq{}}{End Station ID}{\PYZsq{}}{,} {\PYZsq{}}{End Station Name}{\PYZsq{}}{,} {\PYZsq{}}{End Station Latitude}{\PYZsq{}}{,} {\PYZsq{}}{End Station Longitude}{\PYZsq{}}{,} {\PYZsq{}}{Bike ID}{\PYZsq{}}{,} {\PYZsq{}}{User Type}{\PYZsq{}}{]}{,} {axis}{=}{1}{,} {inplace}{=}{True}{)} [commandchars=\\\{\}] {In [{40}]:} {df}{.}{head}{(}{)} [commandchars=\\\{\}] {Out[{40}]:} Trip Duration Birth Year Gender 0 328 1992.0 1 1 398 1988.0 2 2 430 1965.0 1 3 351 1993.0 1 4 2693 1991.0 1 [commandchars=\\\{\}] {In [{41}]:} {print}{(}{max}{(}{df}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{)}{)} [commandchars=\\\{\}] 8933552 [commandchars=\\\{\}] {In [{10}]:} {len}{(}{df}{)} [commandchars=\\\{\}] {Out[{10}]:} 1573872 Dropping all the rows with empty fields in one or more columns [commandchars=\\\{\}] {In [{42}]:} {df} {=} {df}{.}{dropna}{(}{)} {len}{(}{df}{)} [commandchars=\\\{\}] {Out[{42}]:} 1406424 Taking the observations with the birth year >1966 (people younger than 50 years of age) [commandchars=\\\{\}] {In [{43}]:} {df} {=} {df}{[}{df}{[}{\PYZsq{}}{Birth Year}{\PYZsq{}}{]} {\PYZgt{}} {1966}{]} {len}{(}{df}{)} [commandchars=\\\{\}] {Out[{43}]:} 1153343 [commandchars=\\\{\}] {In [{44}]:} {df0} {=} {df}{[}{df}{[}{\PYZsq{}}{Birth Year}{\PYZsq{}}{]} {\PYZgt{}} {1991}{]} {df1} {=} {df}{[}{df}{[}{\PYZsq{}}{Birth Year}{\PYZsq{}}{]} {\PYZlt{}} {1991}{]} [commandchars=\\\{\}] {In [{45}]:} {print}{(}{len}{(}{df0}{)}{,} {len}{(}{df1}{)}{)} [commandchars=\\\{\}] 132726 973270 [commandchars=\\\{\}] {In [{17}]:} {df0}{.}{head}{(}{)} [commandchars=\\\{\}] {Out[{17}]:} Trip Duration Birth Year Gender 0 328 1992.0 1 3 351 1993.0 1 5 513 1995.0 1 9 269 1993.0 2 22 755 1996.0 1 Removing the extreme values which are not practical of trip durations from the data [commandchars=\\\{\}] {In [{46}]:} {df0} {=} {df0}{[}{(}{df0}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{\PYZlt{}}{18000}{)}{]} {df1} {=} {df1}{[}{(}{df1}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{\PYZlt{}}{18000}{)}{]} Plotting the whole data for visualising [commandchars=\\\{\}] {In [{69}]:} {df2} {=} {df0}{.}{append}{(}{df1}{)} [commandchars=\\\{\}] {In [{75}]:} {df2}{.}{head}{(}{)} [commandchars=\\\{\}] {Out[{75}]:} Trip Duration Birth Year Gender 0 328 1992.0 1 3 351 1993.0 1 5 513 1995.0 1 9 269 1993.0 2 22 755 1996.0 1 [commandchars=\\\{\}] {In [{104}]:} {plt}{.}{figure}{(}{figsize}{=}{(}{10}{,}{10}{)}{)} {plt}{.}{plot}{(}{df2}{[}{\PYZsq{}}{Birth Year}{\PYZsq{}}{]}{,} {df2}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{,} {\PYZsq{}}{.}{\PYZsq{}}{)} {plt}{.}{ylim}{(}{0}{,} {30000}{)} {plt}{.}{xlabel}{(}{\PYZsq{}}{Birth year}{\PYZsq{}}{)} {plt}{.}{ylabel}{(}{\PYZsq{}}{trip duration}{\PYZsq{}}{)} {plt}{.}{title}{(}{\PYZsq{}}{Plotting trip duration with the ages of people}{\PYZsq{}}{)} {plt}{.}{plot}{(}{)} [commandchars=\\\{\}] {Out[{104}]:} [] [commandchars=\\\{\}] {In [{62}]:} {trips1} {=} {df0}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{.}{values} {print}{(}{trips1}{)} {print}{(}{trips1}{.}{sum}{(}{)}{/}{len}{(}{trips1}{)}{)} [commandchars=\\\{\}] [ 328 351 513 {\ldots}, 1716 1463 1889] 729.136518514 [commandchars=\\\{\}] {In [{81}]:} {len}{(}{df0}{)} [commandchars=\\\{\}] {Out[{81}]:} 132656 [commandchars=\\\{\}] {In [{102}]:} {plt}{.}{figure}{(}{figsize}{=}{(}{10}{,}{10}{)}{)} {plt}{.}{hist}{(}{df0}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{,} {bins}{=}{1000}{)} {plt}{.}{xlim}{(}{0}{,}{7500}{)} {plt}{.}{xlabel}{(}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{)} {plt}{.}{ylabel}{(}{\PYZsq{}}{Frequency}{\PYZsq{}}{)} {plt}{.}{title}{(}{\PYZsq{}}{Histogram of trip duration of younger persons (\PYZlt{}25 years)}{\PYZsq{}}{)} [commandchars=\\\{\}] {Out[{102}]:} <matplotlib.text.Text at 0x7f962ac416d0> [commandchars=\\\{\}] {In [{108}]:} {print}{(}{\PYZsq{}}{Mean of young age group trip duration:}{\PYZsq{}}{,} {df0}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{.}{sum}{(}{)}{/}{len}{(}{df0}{)}{)} [commandchars=\\\{\}] Mean of young age group trip duration: 729.136518514 [commandchars=\\\{\}] {In [{103}]:} {plt}{.}{figure}{(}{figsize}{=}{(}{10}{,}{10}{)}{)} {plt}{.}{hist}{(}{df1}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{,} {bins}{=}{1000}{)} {plt}{.}{xlim}{(}{0}{,}{7500}{)} {plt}{.}{xlabel}{(}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{)} {plt}{.}{ylabel}{(}{\PYZsq{}}{Frequency}{\PYZsq{}}{)} {plt}{.}{title}{(}{\PYZsq{}}{Histogram of trip duration of older persons (\PYZgt{}25 years)}{\PYZsq{}}{)} [commandchars=\\\{\}] {Out[{103}]:} <matplotlib.text.Text at 0x7f9612616a90> [commandchars=\\\{\}] {In [{106}]:} {print}{(}{\PYZsq{}}{Mean of old age group trip duration:}{\PYZsq{}}{,} {df1}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{.}{sum}{(}{)}{/}{len}{(}{df1}{)}{)} [commandchars=\\\{\}] Mean of old age group trip duration: 751.582487934 Welch’s t-test [commandchars=\\\{\}] {In [{97}]:} {stats}{.}{ttest\PYZus{}ind}{(}{df1}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{.}{values}{,} {df0}{[}{\PYZsq{}}{Trip Duration}{\PYZsq{}}{]}{.}{values}{,} {equal\PYZus{}var}{=}{False}{)} [commandchars=\\\{\}] {Out[{97}]:} Ttest\_indResult(statistic=12.507329800214857, pvalue=7.0584133715259319e-36) Since the p value is very low, we can easily reject the hypothesis that the two distributions are statistically same. But, on comparing the means of the trip durations of two age groups, we find that the average trip duration of young persons is actually less than that from older persons. Thus our original hypothesis which claims that young persons ride for longer duration should be rejected because we found that the older people tend to ride longer.