\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage[round]{natbib}
\let\cite\citep
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\begin{document}
\title{Women actually bike more at morning ----~ Citibike data analysis}
\author[1]{Zhiao Zhou}%
\affil[1]{NYU Center for Urban Science \& Progress}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\sloppy
\section*{Citibike Dataset Analysis (For NYU CUSP PUI2017
HW7)}
{\label{327201}}
\subsection*{\textless{}Zhiao Zhou,
zz1749\textgreater{}}
{\label{480246}}\par\null
\subsection*{Abstract}
{\label{757756}}
New York Times has reported that there were more male bike-share members
in NYC where about a third of members were female, who~cared more about
safety and convenience. However, it was also mentioned that quite a few
women~liked biking to work.\citep{gap} So it would be interesting
to find out if the ratio of men biking at morning (commuting period for
most people) over man biking the whole day is smaller than the ratio of
women which would help balance the gender disparity.~~
Here we carried out a z test between proportions in iPython notebook to
test my hypothesis using a sample of 201706 Citibike (The most popular
bike-sharing system in NYC) public datasets. It turned out that the
Z-score is 9.9977 and the p-value is 7.7958e-24. So we could accept our
alternative hypothesis~that women actually bike more at morning which
would be useful for future analysis since the existing gender disparity
seems to result from lack of infrastructure and safety for women.~
\subsection*{Introduction}
{\label{179927}}
First launched in 2013, Citibike has now totals of 706 stations and
12,000 bikes which pushed itself to become the biggest bike-sharing
system in the USA.\citep{wikipedia}~ However, Citibike has been
struggling to figure out why men far outnumber women in using their
services, with the number of men riders double that of women riders, as
Sarah M. Kaufman,~ the assistant director of tech programming at the
Rudin Center for Transportation at NYU, said that women became early
indicators of a successful bike system which means that if you had more
women riders, it means that it would be convenient and
safe.~\citep{fitzsimmons2015}~This phenomenon also emerged in Chicago and
Washington where bike-sharing systems attracted more men. And till now
it's still not solved yet what triggers this gender disparity.~ The
Citibike company was trying to introduce new stylish bikes or add new
stations to woo women.
What was fun was that there seemed to be a number of women who loved to
commute by public-sharing bike. If we could find out that in fact, women
bike more than men at morning, the company could focus more on service
for women during a commute. Additionally, this hypothesis was untested,
we could easily test it using z-test, nonetheless.~ Figure 1 shows my
null hypothesis and its corresponding maths expression as well as my
significance level.
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/snipaste-20171107-112033/snipaste-20171107-112033}
\caption{{Null hypothesis and the significance level chosen
{\label{719456}}%
}}
\end{center}
\end{figure}
After doing the z-test for two proportions, we found that our hypothesis
was justified. So regardless of men riders outnumbering women riders,
women are more willing to bike at morning, which Citibike company could
attach importance to if they tried to lure more women subscribers.~
\subsection*{Data}
{\label{798963}}
The datasets used for the test was 201706
Citibike~\href{https://datahub.cusp.nyu.edu/dataset}{dataset within the
CUSP data facility (DF)}, which we could easily access working on NYU
CUSP compute platform.~ Then I used iPython notebook to process my
datasets (you could access the original notebook file click on the upper
left side of Figure 2).~
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/snipaste-20171107-022335/snipaste-20171107-022335}
\caption{{The columns of the original datasets
{\label{203457}}%
}}
\end{center}
\end{figure}
The original datasets contained a lot of datasets as in Fig. 2which we
didn't need so~I dropped most of them and kept only two of them that
needed as in Fig. 3in that we were trying to analyze the fraction of
the~number of rides at morning based on gender.
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/snipaste-20171107-023323/snipaste-20171107-023323}
\caption{{Dropping useless columns
{\label{284507}}%
}}
\end{center}
\end{figure}
Now that the starttime column was in the type of string which would be
hard to process so we converted it into datetime type which contains
methods for users to simply extract hour, minute, seconds and so on as
in Fig. 4. What's more, I changed the name of that column into ``date''
for clarification.
\subsection*{}
{\label{681628}}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/snipaste-20171107-023850/snipaste-20171107-023850}
\caption{{Converting string of dates into datetime
{\label{425552}}%
}}
\end{center}
\end{figure}
Till now, the datasets were finally cleaned so that we could move to our
methodology.
\subsection*{Methodology}
{\label{681628}}
First, we visualized the distribution of Citibike~users' fraction of the
frequency of ridings on 24 hours a day by gender in order to first have
a glimpse if our hypothesis would make sense from a plot.~ Then we got
Fig. 5 which showed us that from 5:00 to 12:00 it seemed that women tend
to bike more thus our hypothesis made sense and we~could do a
statistical test now, which was also why I finally determined the time
period of a morning between 7:00 to 12:00 as was questionned~in the peer
review.\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/snipaste-20171107-094414/snipaste-20171107-094414}
\caption{{Distribution of Citibike users' frequency of rides by gender in June
2017, normalized
{\label{739306}}%
}}
\end{center}
\end{figure}
Here we could use both parametric~or non-parametric~tests. Thanks to my
peer Unisse Chua (uc288) 's review, I found her recommendation very
useful and resonable. Since the sample came from the same population of
Citibike users with paired data and there were at least 30 observations
per sample, we could apply a z-test for two proportions to test the
hypothesis.~ However, now that the comparison given was between ratios
and a priori expectation is being done, based on the chart from the
slides, the test to be used could also be the chi-squared goodness of
fit test with Yate's correlation or Fischer's exact test.~ And for a
chi-square test for equality of two proportions is exactly the same
thing as a~Z-test and due to function convenience in iPython, I was
using a z-test.
First, I calculated the number of rides at morning and the total number
of rides a whole day based on different genders using following codes as
in Fig. 6(same as men):
counts\_w = data.date{[}data.gender ==
2{]}.groupby({[}data.date.dt.hour{]}).count()
norm\_w = counts\_w.sum()
w\_morning = counts\_w.loc{[}5:12{]}.sum()
\par\null\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/snipaste-20171107-030514/snipaste-20171107-030514}
\caption{{Calculation of necessary arguments for the test
{\label{636050}}%
}}
\end{center}
\end{figure}
Then I used the~\textbf{proportions\_ztest} method imported from~
\textbf{statsmodels.stats.proportion~} ~which could output the z
statistics and p value after inputing the counts and total numbers of
two samples as follows where value means the difference between the
proportions and alternative means if I want to apply a two-sided test or
one-sided one.
stat, pval = proportions\_ztest(counts, nobs,value=0,
alternative='larger')
\subsection*{Conclusions}
{\label{746968}}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.70\columnwidth]{figures/snipaste-20171107-030152/snipaste-20171107-030152}
\caption{{Result of the Z-test
{\label{207410}}%
}}
\end{center}
\end{figure}
We could see from Fig. 7 that p-value is way smaller than 1 - our
significance level ---- 5\%, so we could confidently reject our null
hypothesis and conclude that the ratio of men biking at morning over man
biking the whole day is smaller than the ratio of woman biking at
morning over woman biking the whole day. This was really an unexpected
outcome since people always think that men bike more than women do.~
The Z-test led to a mighty credible outcome at last. The Citibike
company could take into consideration in the future to add more stations
in office areas or reduce the rent rate at morning~and the reason for
this phenomenon might also be women are more concerned about safety
issue after the morning in that road harassment or reckless driving are
still kind of common in the city.~
There was as well some weakness in this mini-project that needed to be
solved in future studies. First, the sample was only one-month datasets
which could have seasonal effects. Second, the sample was still not big
enough.~ But this is now still a good gap filling.
\selectlanguage{english}
\FloatBarrier
\bibliographystyle{plainnat}
\bibliography{bibliography/converted_to_latex.bib%
}
\end{document}