\documentclass[10pt]{article}
\usepackage{fullpage}
\usepackage{setspace}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage[section]{placeins}
\usepackage{xcolor}
\usepackage{breakcites}
\usepackage{lineno}
\usepackage{hyphenat}
\PassOptionsToPackage{hyphens}{url}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage{etoolbox}
\makeatletter
\patchcmd\@combinedblfloats{\box\@outputbox}{\unvbox\@outputbox}{}{%
\errmessage{\noexpand\@combinedblfloats could not be patched}%
}%
\makeatother
\usepackage[round]{natbib}
\let\cite\citep
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[greek,english]{babel}
\begin{document}
\title{Exploratory Analysis: Are `Customer' More Likely to Use CitiBike during
Working Hours than `Subscriber'?}
\author[1]{Lingyi Zhang}%
\affil[1]{NYU Center for Urban Science \& Progress}%
\vspace{-1em}
\date{\today}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\selectlanguage{english}
\begin{abstract}
CitiBike is a privately owned public bicycle sharing system.~ In this
study, we analyzed whether the `Customer' are more likely than
`Subscriber' to use Citibike during working hours (9:00 - 17:00). Using
one-tailed Z-test and Chisq test, we found that the percentage of riders
used CitiBike during working hours is significantly higher for
`Customer' than for `Subscriber'.%
\end{abstract}%
\sloppy
\section*{}
{\label{353859}}
\section*{1. Introduction}
{\label{353859}}
CitiBike is a privately owned public bicycle sharing system serving New
York City and Jersey City, New Jersey.\cite{wikipedia} It is the
nation's largest bike share program, with 10,000 bikes and 600 stations.
There are two user types, `Customer'(mainly visitors and tourists) and
`Subscriber' (mainly New York locals).\cite{nyc}The question we
want to answer in this article is that whether the `Customer' are more
likely than `Subscriber' to use CitiBike during working hours (9:00 -
17:00). To know that can be helpful for CitiBike owner company (NYC Bike
Share, LLC) to make their sales and operation strategies. For example,
If we want to provide some targeted services or advertisement to the
`Customer'(mainly visitors and tourists), this analysis can give the
best time span of providing.
\par\null
\section*{2. Data}
{\label{426590}}
\href{https://s3.amazonaws.com/tripdata/index.html}{CitiBike
Tripdata}~we used in this analysis is the dataset of January 2016. We
converted the datatype of ``starttime'' column into datetime, which can
be easily processed in pandas. We then aggregate the trip counts by
hours, calculating the respective percentage of trips started during
working hours (9:00 - 17:00) and resting hours (18:00 - 8:00) by
``Subscriber'' and ``Customer'' (Table 1).\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=1.00\columnwidth]{figures/Fig1/Fig1}
\caption{{Normalized distribution of CitiBike rider by user-type in different
hours. From this figure, we can have a taste of the CitiBike usage
pattern between two user-types. ``Customer'' have a higher percentage of
rides from 10:00 to 17:00 and from 0:00 to 3:00 compared to
``Subscriber'', while ``Subscriber'' have a higher percentage right
before and after the working hours.
{\label{122465}}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{table}[h!]
\centering
\large\begin{tabulary}{1.0\textwidth}{CCC}
& working hours & resting hours \\
Subscriber & 0.55 & 0.45 \\
Customer & 0.79 & 0.21 \\
\end{tabulary}
\caption{{The percentage of trips using Citibike during working hours and resting
hours for `subscriber' and `customer'.~
{\label{992813}}%
}}
\end{table}\section*{}
{\label{732685}}
\section*{3. Methodology}
{\label{457088}}
In both tests below, we assign~ \selectlanguage{greek}α\selectlanguage{english}=0.05.~Here we incorporated
\href{https://github.com/lingyielia/PUI2017_lz1714/blob/master/HW3_lz1714/CitibikeReview_fb55.md}{Federica's
suggestion} using Chisq test for proportion because it is appropriate
for testing hypotheses about proportions.
We also did one-tailed Z-test. Because when the sample size is large
enough (defined as both np and n(1-p) are greater than or equal to 5),
the binomial distribution comes to resemble the normal
distribution.~\citep{boslaugh2012}
In this analysis, we finally didn't use the
t-test~\href{https://github.com/lingyielia/PUI2017_lz1714/blob/master/HW3_lz1714/CitibikeReview_ixx200.md}{suggested
by Ian}, because the sample size is large enough. We used Z-test
instead.~
\subsection*{3.1 Chisq test}
{\label{145990}}
\textbf{Null hypothesis:}~The percentage of trips using Citibike during
working hours is the same for `subscriber' as for `customer'. (Using
Citibike during working hours and the user type are independent.)
\[H_0 : \frac{Cust_{\mathrm{WorkingTime}}}{Cust_{\mathrm{All}}} = \frac{Subs_{\mathrm{WorkingTime}}}{Subs_{\mathrm{All}}}\]
\[H_a : \frac{Cust_{\mathrm{WorkingTime}}}{Cust_{\mathrm{All}}} \neq \frac{Subs_{\mathrm{WorkingTime}}}{Subs_{\mathrm{All}}}\]
Cust\textsubscript{WorkingTime}:\textsubscript{} The counts of
`Customer' using CitiBike during working hours.
Subs\textsubscript{WorkingTime}: The counts of `Subscriber' using
CitiBike during working hours.
Cust\textsubscript{All}: The counts of `Customer' using CitiBike during
the whole day.
Subs\textsubscript{All}: The counts of `Subscriber' using CitiBike
during the whole day.\selectlanguage{english}
\begin{table}[h!]
\centering
\normalsize\begin{tabulary}{1.0\textwidth}{CCCC}
& working hours & resting hours & summary \\
subscriber & 0.55*484935 & 0.45*484935 & 484935 \\
customer & 0.79*24543 & 0.21*24543 & 24543 \\
total & 286782 & 222696 & 509478 \\
\end{tabulary}
\caption{{Contingency table for the Chisq test.
{\label{502452}}%
}}
\end{table}
Using~\href{https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html\#scipy.stats.chi2_contingency}{scipy.stats.chi2\_contingency}
for Chisq test, the~chisq test statistic is 5198.869 with
\(Pvalue\ =\ 0.000\). We can reject the Null hypothesis. The percentage of
trips using Citibike during working hours for `subscriber' and
`customer' are not the same.
\subsection*{3.2 one-tailed Z test}
{\label{441865}}
\textbf{Null hypothesis:} The percentage of trips using CitiBike during
working hours is the same or lower for `Customer' than for `Subscriber'.
\[H_0 : \frac{Cust_{\mathrm{WorkingTime}}}{Cust_{\mathrm{All}}} <= \frac{Subs_{\mathrm{WorkingTime}}}{Subs_{\mathrm{All}}}\]
\[H_a : \frac{Cust_{\mathrm{WorkingTime}}}{Cust_{\mathrm{All}}} > \frac{Subs_{\mathrm{WorkingTime}}}{Subs_{\mathrm{All}}}\]
Using~\href{http://www.statsmodels.org/dev/generated/statsmodels.stats.proportion.proportions_ztest.html}{statsmodels.stats.proportion.proportions\_ztest}
for Z-test, the~Z-test statistic is 72.386 with~\(Pvalue\ =\ 0.000\). We
can reject the Null hypothesis. The percentage of trips using CitiBike
during working hours is higher for `Customer' than for
`Subscriber'.~\href{https://github.com/lingyielia/PUI2017_lz1714/blob/master/HW7_lz1714/Assignment1.ipynb}{{[}Source
code{]}}
\section*{4. Conclusions}
{\label{508246}}
`Customer' are more likely than `Subscriber' to use CitiBike during
working hours. It might be because most of the `Customer' are tourists,
while most of the `Subscriber' are local labor force. During working
hours, a higher proportion of `Subscriber' are staying in office, while
`Customer' are touring in the city.
The implication is that if we want to do advertisement targeting the
`Customer', the best time to display the advertisement would be the
working hours because a higher proportion of `Customer' can see the
advertisement, while a lower portion of `Subscriber' will be disturbed
by useless information as for them. Regarding the types of
advertisement, promotion for annual membership would not be a good
choice, because most of the~'Customer' are temporally visiting. Instead,
promotions about short time discount might be feasible, such as the
one-day or three-day pass.
The weakness of this analysis is that we did not consider the
seasonality. The conclusion we drew based on data of January might not
able to apply to Summer.
\selectlanguage{english}
\FloatBarrier
\bibliographystyle{plainnat}
\bibliography{bibliography/converted_to_latex.bib%
}
\end{document}