\documentclass{article}
\usepackage{fullpage}
\usepackage{parskip}
\usepackage{titlesec}
\usepackage{xcolor}
\usepackage[colorlinks = true,
linkcolor = blue,
urlcolor = blue,
citecolor = blue,
anchorcolor = blue]{hyperref}
\usepackage[natbibapa]{apacite}
\usepackage{eso-pic}
\AddToShipoutPictureBG{\AtPageLowerLeft{\includegraphics[scale=0.7]{powered-by-Authorea-watermark.png}}}
\renewenvironment{abstract}
{{\bfseries\noindent{\abstractname}\par\nobreak}\footnotesize}
{\bigskip}
\titlespacing{\section}{0pt}{*3}{*1}
\titlespacing{\subsection}{0pt}{*2}{*0.5}
\titlespacing{\subsubsection}{0pt}{*1.5}{0pt}
\usepackage{authblk}
\usepackage{graphicx}
\usepackage[space]{grffile}
\usepackage{latexsym}
\usepackage{textcomp}
\usepackage{longtable}
\usepackage{tabulary}
\usepackage{booktabs,array,multirow}
\usepackage{amsfonts,amsmath,amssymb}
\providecommand\citet{\cite}
\providecommand\citep{\cite}
\providecommand\citealt{\cite}
% You can conditionalize code for latexml or normal latex using this.
\newif\iflatexml\latexmlfalse
\providecommand{\tightlist}{\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}%
\AtBeginDocument{\DeclareGraphicsExtensions{.pdf,.PDF,.eps,.EPS,.png,.PNG,.tif,.TIF,.jpg,.JPG,.jpeg,.JPEG}}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\begin{document}
\title{\section*{Exploring the relationship between the usage of the CitiBike(s)
when used by Customers and
Subscribers}\label{exploring-the-relationship-between-the-usage-of-the-citibikes-when-used-by-customers-and-subscribers}}
\author[ ]{Achilles Edwin Alfred Saxby}
\author[ ]{Anastasia Shegay}
\author[ ]{Priyanshi Singh}
\author[ ]{Aaron DSouza}
\author[ ]{Vishwajeet Shelar}
\author[ ]{Akshay Penmatcha}
\affil[ ]{}
\vspace{-1em}
\date{}
\begingroup
\let\center\flushleft
\let\endcenter\endflushleft
\maketitle
\endgroup
\section*{Co-Authors (Team)}\label{co-authors-team}
\begin{itemize}
\tightlist
\item
Anastasia Shegay\\
\item
Priyanshi Singh\\
\item
Aaron D'Souza\\
\item
Vishwajeet Shelar\\
\item
Akshay Penmatcha\\
\item
Achilles Saxby\\
\end{itemize}
\section*{Abstract}\label{auto-label-section-570691}
This report is drawn on the Homework\_3 for PUI2016 (Assignment-2). The
goal is to explore, create an idea and prove if there exists a
difference between the time of use of the CitiBike(s) between
Customers(one-time users) and Subscribers(multiple-time users who have
weekly/monthly/yearly usage plans) in terms of the MTP (Mean Trip
Duration). The topic is to define the days that the CitiBike(s) are used
- mainly in one single week and check to see if there is a noticeable
difference between the mean times of usage for both - the customers and
the subscribers.\\
A hypothesis is stated~and tested using:(The test below is suggested by
the peer reviewers for the co-authors mentioned above)\\
\begin{quote}
\emph{Two-sided t-test(variance of both the samples are not equal}\\
\end{quote}
Additional Tests were tried and tested (as instructed by reviewers -
mainly Instructor Federica Bianco). The test and results found indicates
that there is indeed a difference.\\
\section*{Keywords}\label{keywords}
Two-sided t-test, CitiBike Data, Data Wrangling, Null Hypothesis,
Alternate Hypothesis, Statistical Significance Level\\
\section*{Hypothesis}\label{hypothesis}
\subsubsection*{Null Hypothesis}\label{null-hypothesis}
The mean trip duration of a single-time user(customer) over a week is
less than or equal to the mean trip duration of the subscribers over a
week\\
\(H_0\Rightarrow T_{Customer}\le T_{Subscriber}\)\\
\subsubsection*{Alternate Hypothesis}\label{alternate-hypothesis}
The mean trip duration of a single-time user(customer) over a week is
greater than the mean trip duration of the subscribers over the same
week\\
\(H_a\Rightarrow T_{Customer}>T_{Subscriber}\)\\
\subsubsection*{Statistical Significance
Level}\label{statistical-significance-level}
We choose a significance level alpha(\(\alpha\)) so as to select
how significant the hypothesis testing will be when the test/experiment
is completed\\
\(\alpha\)~= 0.05\\
\(\alpha\)~= 5\%\\
\section*{Data}\label{data}
The data used in this experiment was collected using the code given by
Federica Bianco in the skeleton program to help with the downloading
process of the data.~\\
This data is collected from the
\href{https://www.citibikenyc.com/system-data}{CitiBike\_Data\_Website}~where
we collected data specifically related to the MTD(Mean Trip Duration) of
both sections of clients for the CitiBike Services namely -
Customers(one-time users) and Subscribers(many-time users).~\\
This data is instrumental in helping us learn which section of clients
take the most trips or take the longest/shortest trips or spend more
time on the trips.\\
The idea is to figure out how and why there is a difference between the
two sections of clients in this aspect:\\
\begin{itemize}
\tightlist
\item
Analysis was performed by selecting the data related to what we needed
to check and analyze.\\
\item
The analysis was performed using the~IPython notebook which is also
attached with this paper.\\
\item
Aspects like Pandas, DateTime format, DataFrames and MatPlotLib of the
Python language were used to clean, organize, select, analyze, plot
and visualize the data.\\
\end{itemize}
\section*{Analysis}\label{analysis}
To analyze the data - a Null and Alternate hypothesis with a statistical
significance level was formed before going through with the data
analysis. Data collected was first stored using the instructors code,
tabulated, cleaned, and reshaped in order to answer the questions that
the experiment formed. The Null hypothesis is rejected after analyzing
the data and in doing this the alternate hypothesis is proven to be true
which is the true essence of the experiment.\\
The analysis is conducted mainly by using Pandas and DataFrames from the
Python environment to get the mean trip duration of the two separated
sections of the data - Customers and Subscribers respectively. The
figures are then plotted using MatPlotLib to show the data collected and
cleaned with respect to a specific week in question(here week 1 of June,
in the year 2016) in relation to the above mentioned two sections of
clients for the CitiBike data.\\
Post analysis and data visualization - the distribution of the desired
data is then subjected to a two-sided~t-test. The t-test is to prove if
the data is~an analysis of two populations means through the use of
statistical examination, a t-test with two samples is commonly used with
small sample sizes(like the ones being tested here) and mainly used for
testing the difference between the samples when the variances of two
normal distributions are not known. Hence we check first if the
standard\_deviation of both the samples are known or not to proceed with
the right form of the t-test.\\
After the checks mentioned above,~the two-sided t-test (where the
variance of both the samples are not known) was tested for the data.\\\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.28\columnwidth]{figures/3/3}
\caption{{\hypertarget{auto-label-caption-188242}{}
\textbf{The Samples from the Population - Number of Customers and
Subscribers}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.28\columnwidth]{figures/2/2}
\caption{{\hypertarget{auto-label-caption-383944}{}
\textbf{The distribution of the Subscribers' to Customers' Mean Trip
Duration during the week}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.28\columnwidth]{figures/1/1}
\caption{{\hypertarget{auto-label-caption-485399}{}
\textbf{The distribution of the Subscribers' to Customers' Mean Trip
Duration during the weekends}%
}}
\end{center}
\end{figure}\selectlanguage{english}
\begin{figure}[h!]
\begin{center}
\includegraphics[width=0.07\columnwidth]{figures/jupyter-logo/jupyter-logo}
\caption{{\hypertarget{auto-label-caption-725168}{}
\textbf{IPython Notebook Attached with Image above}%
}}
\end{center}
\end{figure}
\section*{Results/Conclusion}\label{resultsconclusion}
The IPython notebook is attached to this paper for further
reproducibility of the research, data analysis and testing procedures.\\
Post Data Analysis and hypothesis testing, the conclusion of the
experiment and the results obtained justified and validated the setting
of the experiment and its hypothesis.\\
Based on the results obtained by the T-test, where the
\(H_0\)~was proven to have a significance close to zero(0)
(\(\text{1.20026487e-06}\)~to be exact) and thus rejected under the statistical
significance clause mentioned in the start of the test - we can now
conclude - Customers who are mostly one-time users ride the CitiBike
longer or spend more time riding the bike than the Subscribers who are
mainly many-time users with a subscription.\\
Also, as an extra aspect of the analysis other visualizations plotted
show the difference in the two sections when compared to weekdays and
weekends which also shows the same results as anticipated in the
experiment.\\
During the experiment and checking for the Effect Size between the
Customers and Subscribers - the analysis shows us that the effect size
is very large, and thus can be said that the difference in the mean of
the MTD is very large.\\
This can be the result of one or two reasons:\\
\begin{enumerate}
\tightlist
\item
The effect size calculated shows us how significantly different the
Customers and Subscribers the mean of the sample is.\\
\item
In the effect size calculation, the general assumption is the std\_dev
of the population is taken, here that is not possible so the std\_dev
of the two samples (customers and subscribers) are taken and
calculated.\\
\item
Since there are large number of data points, we can assume this is a
Gaussian distribution (Taking about the data that we are using here)
and due to this we can use the effect size test (Cohen's D-Test) to
quantify the significance value.\\
\end{enumerate}
The reasons behind the result might vary, for example, the Customers may
be new at the CitiBike services or may look farther for parking and
stations to store the bikes when compared to subscribers who are on many
occasions already knowledgeable in this aspect or Subscribers may have a
destination set before starting the ride so as to not waste time in the
travel which may not be true for a lot of Customers.\\
Further research can focus on extracting data from different months to
compare and conclude with certainty about the reason behind the
difference in the MTD or the Mean Trip Duration's Travel Time data.\\
\section*{References}\label{references}
\begin{enumerate}
\tightlist
\item
Github - Federica Bianco - Instructors code to download/store/and use
the data.\\[2\baselineskip]
\item
Hypothesis Testing,T-Test and Statistical Tests from the text -
Statistics in a Nutshell by the O'Reilly Publishing
House.\\[2\baselineskip]
\item
Principles of Urban Informatics classroom sessions at NYU-CUSP under
the guidance of Federica Bianco(instructor)\\[2\baselineskip]
\item
GitHub repositories of the co-authors (team members):\\
\item
\href{https://github.com/as10790/PUI2016_as10790}{Anastasia\_Shegay}\\
\item
\href{https://github.com/priyanshi09/PUI2016_ps3369}{Priyanshi\_Singh}\\
\item
\href{https://github.com/vishelar/PUI2016_vys217}{Vishwajeet\_Shelar}\\
\item
\href{https://github.com/aaron-15/PUI2016_ajd629}{Aaron\_D'Souza}\\
\item
\href{https://github.com/akpen/PUI2016_akp418}{Akshay\_Penmatcha}\\
\item
\href{https://github.com/achillessaxby/PUI2016_aes807}{Achilles\_Saxby}\\
\end{enumerate}
\selectlanguage{english}
\FloatBarrier
\end{document}