WiFind: Analyzing Wi-Fi Density around NYCHA Housing Projects

Executive Summary

Today, online access is crucial to people who need to improve education, employment status, and business opportunities. However, limitations do exist as to who can connect to the internet for the reasons mentioned above. Wi-Fi, wireless internet, has become a wide-spreaded way to connect to the Internet and a utility for many households. Although Wi-Fi cost have remained competitive, accessibility to Wi-Fi still remains a challenge and an interesting research topic.

The motivation behind our project is to explore the difference in Wi-Fi density among public and private residential areas across New York City. We focused on five New York City Housing Authority (NYCHA) [1] public housing project areas, and collected Wi-Fi data for all of them and their adjacent private residential neighborhoods. We then set out to explore the correlation between median household income and Wi-Fi densities among these two groups (public and private). All data was collected at the census block group level.

We found a difference in the shape of the distributions for open networks among public and private housing areas. However, there is no significant difference in the shape of the distributions for closed networks in public and private housing areas. While this insight into our data is interesting, the nature of our limited dataset proved to have many limitations. More data collection and a larger sample size is recommended to help answer, with statistical significance, if there is a difference in Wi-Fi density between public and private housing areas for both open and closed networks.

Research question & Hypothesis

Definition

Wi-Fi density is the unique counts of networks divided by the population at census block group level.

Research Question

Is there a difference in Wi-Fi density between public housing projects and their adjacent private residential areas?

Hypothesis

  1. 1.

    The distributions of Wi-Fi density are different among public residential areas and private ones.

  2. 2.

    There exists a correlation between median household incomes and Wi-Fi densities among studied areas.

Literature review & Previous work

Literature Review

The study “The Failure of Public Wi-Fi” written by Eric M. Fraser (2007) [2] explained the reason why major public Wi-Fi projects were destined for failure. There is no denying that public Wi-Fi was supposed to be irreplaceable in connectivity, allowing residents to get internet connection anywhere. Public Wi-fi was to bridge the digital divide between those who have access to technology and those who do not.

However, even though Wi-Fi has been widely spread and integrated in people’s daily life, major public Wi-Fi projects collapsed in 2007 and comprehensive citywide public Wi-Fi networks could not be delivered, even in New York City. On one hand, the technical and regulatory limitations together require access points at least every few hundred feet outside and closer indoors; this requires high-touch, high-density installations and leads to high cost. As a result, municipalities could at best roll out costly but limited networks. On the other hand, users also lacked motivation to connect to the limited networks that municipalities delivered. For outdoor wireless access, cellular companies could offer high-speed 3G wireless data networks using technologies better suited for widespread coverage. Understanding the strengths and limitations to municipal Wi-Fi can help our strengthen our understanding of the needs for public Wi-Fi within the scope of our research [2].

According to a report from the New York City Comptroller Office (2014 Scott Stringer) [3], households with slow or no access to speedy Internet are more likely to live in neighborhoods across the Bronx, Central Brooklyn, East Harlem and Lower East Side Manhattan, all which have lower income levels. Neighborhoods with the most high-speed access to Internet are mostly spread in upscale neighborhoods of Manhattan, which have higher income levels. The report also stated 27% of households in New York City endore slow downloading speeds. The gap is most obvious in the Bronx, where 33.3% of households lack Internet access at home, while the number is much lower in Manhattan (21%).

As is stated in the report, online access is crucial to people who need to improve education, employment status and business opportunities, and thus it is undeniable that lack of access would make people’s life much more difficult in many ways.

The paper ‘Wifi Hotspots in 100 U.S. Cities and Policy Implications’ written by Youngsun Kwon and Hong-Kyu Lee (2009)[4] argues for the importance of Internet accessibility in society and it shows the number of Wi-Fi hotspots in 100 U.S. cities is significantly related to a city’s population, population density, household density, and household median income. The paper recommended further research on analyzing Wi-Fi counts as a function of accessibility and usability in low-income communities as well as on the characteristics of public Wi-Fi usage and users, which is highly linked to our research topic.

Regarding the low income and policy implications, the author recommended higher resolution data to analyze. Compared to to the data set used in the paper, our project has more precise Wi-Fi location information, which elevates the accuracy of our analysis. The author used a regression model to estimate the Wi-Fi counts based on population data at the city level. Moreover, our project excels in ability to collect location data at latitude and longitude level. As a result, we are better suited to analyze the relationship between Wi-Fi distribution and income.

The study ‘The Emerging Ethics of Humancentric GPS Tracking and Monitoring’ written by Katina Michael, Andrew McNamee and M.G Michael (2016) [5] provides a four-point ethics-based conceptual model. This model address ethical concerns around 1) human purpose, 2) morality, 3) justice, and 4) principles. Keeping these four points in mind, the authors recommend addressing ethical issues by critically thinking about participants, people’s concerns, cultural values, among others.

When applying this conceptual thought process to our research project, we one major ethical problem in our analysis. Our project identifies Wi-Fi hotspots and their signal strengths throughout New York City. We’re able to collect location based Wi-Fi signal strength data via a mobile app. This can unearth implications such as re-identification of app users from data available via our website. One can obtain data by device or mac address and easily identify a user movements throughout New York City. Prohibiting users from obtaining these specific variables in our open data can easily fix this problem.

Previous work

As part of the 2017 summer capstone projects, team Wifind is task with advancing the work of past cohorts in developing a robust framework for collecting and analysing Wi-Fi strength signals across New York City. Past cohorts have contributed to this mission with the development of a mobile application able to collect Wi-Fi strength signals from users as they commute in their cities. The app detects how fast the user moves and adjust data collection speed accordingly. As Figure 1 shows, past cohorts have also developed a comprehensive website where the data collected is visualized via maps, available for download via API, and ready for analysis.
WiFind Interface

Data Description

Data Collection

In late June 2017, our team traversed five different public housing project neighborhoods and their adjacent private housing neighborhoods to collect Wi-Fi signal strength data.

The team used an android application called WiFind to collect data, which could transmit collected data to a CUSP server and store the data in a MySQL database. Once equipped with the app, every team member went to their assigned housing project neighborhoods and walked around the housing project block to collect data.

Once we collected our data, we used SQL to extract records by device type from our MySQL database. Afterwards, we used a Linux command order to store records in CSV format. There are five data files. Each file contains data regarding different housing project neighborhoods.

The figure below shows our five targeted study areas.

Research areas

Datasets Description

Our project used four datasets to explore our research question. The first one, Wi-Fi dataset, contains location, device information, and Wi-Fi information. We obtained Wi-Fi dataset from our MySQL database which was deployed in an NYU CUSP server. Table one shows detailed feature descriptions.
Summary of Wi-Fi data set

The second dataset used contained population counts of NYC by census block group. We downloaded the dataset from https://www.socialexplorer.com/. Table two shows detailed feature descriptions.

Summary of population data set

The third dataset contained median household income of NYC by census block group. We downloaded the dataset from https://www.socialexplorer.com/. Table three shows detailed feature descriptions.

Summary of median household income data set

Methodology

Visualization pipeline

Phase 1 - Input: There are two sources to extract data as input. One is to extract data from the WIFIND MySQL database, which is located in a CUSP server. The other way is to fetch data through the WiFind API.

Phase 2 - Pre-processing: To understand the patterns of Wi-Fi signal density around target research areas, we take the counts of unique “Basic Service Set Identifier” (bssid) as a measurement to draw choropleth maps. Besides, to find out the spatial patterns of open Wi-Fi density, we also filter the selected data by their “Service Set Identifier” (ssid). The list of open Wi-Fi ssid is provided by last year’s team and updated by our team as well.

Phase 3 - Geo-processing: First, we drew a grid map as basemap which covers the whole target areas. The default size of each grid cell is 50x50 sqft. Next, we spatially join all data records (points) with the grid map (polygons) and calculated the number of unique bssid for each grid cell.

Phase 4 - Plot Carto is one of the most popular geospatial visualization tools nowadays. To better demonstrate the Wi-Fi signal density data after geoprocessing, we customize a carto map template. In addition, we connected with the carto server, then uploaded and updated dataset via Carto SQL API. With the template, choropleth maps are able to generated automatically for corresponding input data.

Phase 5 - Output The outputs include maps for all detected Wi-Fi density and open Wi-Fi density respectively. Both maps are demonstrated in public carto websites separately. Besides, the outputs also include brief introductions and basic comparisons between Wi-Fi density around housing project areas and their adjacent residential areas. To build this automatic visualization pipeline, we mainly used libraries like pandas, geopandas, shapely and Carto SQL API in python 2.7.

Phase 6 - Analysis Obtain processed data and run statistical analysis to answer our research question. Our findings are discussed in detail in section six.

Results

Visualization

We explored five public housing projects with their adjacent private housing areas, and visualized collected data in grid maps. Taking one project for instance, Figure six shows the counts of Wi-Fi access points around both public and private residential areas in Chelsea. When comparing counts, the counts of open Wi-Fi access points take up less than all the Wi-Fi access points.

Counts of Wi-Fi access points around Chelsea. (a) All Wi-Fi (b) Open Wi-Fi

Differences of normalized counts in different areas (public and private)

The following figure shows the differences in normalized counts among our five targeted areas.
Bar Plot of Network Counts by Targeted Study Area

When we initiated our research, we decided to target five different public residential areas and five adjacent private residential areas. The above plot displays normalized Wi-Fi counts for each target area broken down by open Wi-Fi counts and private residential area, open Wi-Fi counts and public residential area, closed Wi-Fi counts and private residential area, and finally closed Wi-Fi counts and public residential area. Please note Wi-Fi counts are normalized by population within given area at census block group level.

The following figure shows the differences of normalized counts in log scale among different areas. In terms of the normalized count of open Wi-Fi access points, both the public housing areas and private housing areas get more open networks per capita than private ones.

Boxplots of Different Wi-Fi Access Counts in Log Scale
Comparison of Wi-Fi Access Counts by Area

Results of Fittings

Below you’ll find two plots displaying Wi-Fi density (controlled by population) among public residential areas and private residential areas for both open and closed Wi-Fi networks. The first plot (figure 11) shows a moderate positive correlation between median household income and Wi-Fi density for open networks among public residential areas and private residential areas. The p-value for this correlation is statistically significant at a 0.05 alpha value threshold.

The second plot (figure 12) shows a weak to moderate positive correlation between median household income and Wi-Fi density for closed networks among public residential areas and private residential areas. The p-value for this correlation is not statistically significant at a 0.05 alpha value threshold.

We can interpret these moderate relationships as median household income increases, Wi-Fi density also increases. However, there can be other influences in play such as commercial activity and external factors we don’t have data to account for. More data collection is recommended.

Regression Analysis in Log Scale for Open Wi-Fi Access Points in Public and Private Residential Census Block Groups
Regression Analysis in Log Scale for Closed Wi-Fi Access Points in Public and Private Residential Census Block Groups

Factoring for Building Height

Data was collected at ground level and we know both population and Wi-Fi networks exist at different altitudes. In order to account for this limitation, we multiplied Wi-Fi counts by the mean number of floors for buildings in each census block group. The p-value for this correlation is not statistically significant at a 0.05 alpha value threshold. As you can see below, factoring for building height for the Wi-Fi networks we could not collect data for did not improve our results. However, data collection at different altitudes is recommended.
Regression Analysis in Log Scale of Wi-Fi Access Points for Public and Private Residential Census Block Groups Factoring for Building Height

Statistical Test Results

As mentioned before, our unit of analysis is a census block group and our results above do show a moderate positive relationship between median household income and normalized Wi-Fi counts. We also discussed factoring in building height by multiplying mean number of floors with private network counts for each census block group. We did this only for private networks since it’s highly unlikely open networks exist at higher altitudes.

As you saw in figure 10, our data does not follow a normal distribution; parametric test (as first assumed) will not help in answering our research question. After careful research and study of our data, we found the Kolmogorov-Smirnov test to be the most adequate to test our research question and well suited for the nature of our data. However, more data collection is recommended.

Data Distributions: Below you’ll notice two plots (figure 14 and 15). Figure 14 shows the distributions for open Wi-Fi networks. Figure 15 shows the distributions for private networks. To test whether our independent samples for open and private networks come from similar distributions, we will use the Kolmogorov-Smirnov two-sample test (KS test). One KS test will be performed for each plot (i.e. open Wi-Fi distributions and private Wi-Fi distributions).

First KS test Ho: Open normalized Wi-Fi counts for public areas and open normalized Wi-Fi counts for non-public areas come from the same distribution.

Ha: Open normalized Wi-Fi counts for public areas and open normalized Wi-Fi counts for non-public areas do not come from the same distribution.

After running the KS test, the p-value returned (statistic=0.46, p-value=0.04) was less than the 0.05 alpha value threshold, thus we can reject the null and conclude open normalized Wi-Fi counts for public areas and open normalized Wi-Fi counts for non-public areas do not come from the same distribution (with a 5% chance this is a type one error).

Open Wi-Fi Distributions for Public and Private Housing Areas

Second KS test Ho: Private normalized Wi-Fi counts for public areas and private normalized Wi-Fi counts for non-public areas come from the same distribution.

Ha: Private normalized Wi-Fi counts for public areas and private normalized Wi-Fi counts for non-public areas do not come from the same distribution.

After running the KS test, the p-value returned (statistic=0.21, p-value=0.82) was not less than the 0.05 alpha value threshold, thus we cannot reject the null and conclude closed normalized Wi-Fi counts for public areas and closed normalized Wi-Fi counts for non-public areas do not come from the same distribution (with a 5% chance this is a type two error). Please note, building height was accounted for in this statistical test.

In summary, we ran a test to see if the observed differences between Wi-Fi density distributions in public and private housing areas were statistically significant. We ran that test once for open Wi-Fi networks and once for closed networks. Our final result show open networks in public and private residential areas have different distributions; closed networks do not. Although this insight is interesting and helps our research, it does not answer our research question. In order to test statistical difference between public and private Wi-Fi counts for open and closed networks, more data is needed.

Closed Wi-Fi Distributions for Public and Private Housing Areas

Conclusion and Implications

Based on the analysis, our team has two primary conclusions as the response to two hypothesis. Regarding hypothesis one, ‘The distribution of Wi-Fi densities are different among public residential areas and private ones’, our team found that there is a significant difference in the shape of open Wi-Fi density distributions between public and private housing areas, while there is no significant difference for closed Wi-Fi density distributions. Regarding to Hypothesis two, ‘There exists a correlation between median household income and Wi-Fi densities around studied areas’, our team found a positive moderate relationship. More data collection is recommended.

Further research is desired to conclude the specific difference in distribution of Wi-Fi density between public and private residential areas. Future researchers should collect more data, which may lead to a correlation between Wi-Fi densities and demographic features.

Meanwhile, as is stated in the literature review, the difficulties on expanding open networks in New York City are mainly due to the technical and regulatory limitations from municipality and the unmatch between the high cost and estimated insufficient utilization rate. We recommend potential open Wi-FI vendors (e.g. Link-NYC) to take these factors into consideration and strengthen the cooperation with both government and private sectors while implementing open Wi-Fi points kiosk and expanding across the city.

Limitations

There are mainly two limitations in our project. First, as our team only collected data in five public residential areas and their adjacent private residential neighborhood, the sample size of our data is limited, and the regression and statistical test results could be less significant. Second, as Wi-Fi data was collected at ground level and we know both population and Wi-Fi networks exist at different altitudes, our analysis does not perfectly account for Wi-Fi networks in higher building floors.

Contribution

Dongjie was responsible for writing the python scripts to build visualization pipeline. Also he worked with Xiaomeng to clean the collected Wi-Fi data and plotted maps.

Kai was responsible for data collection and finding relevant data sets in block group level. Also he wrote mid-term and final report.

Christian was responsible for the statistical analysis, quality checked data collected, and ensured all moving elements came together concisely.

Jie was responsible for data collection, literature review, as well as ethic report, mid-term and final report writing.

References

[1] NYC 311. (2017). New York City Housing Authority (NYCHA) Public Housing. [online] Available at: http://www1.nyc.gov/nyc-resources/service/2286/new-york-city-housing-authority-nycha-public-housing [Accessed 30 Jul. 2017].

[2] M. Fraser, E. (2007). The failure of public Wi-Fi. JOURNAL OF TECHNOLOGY LAW & POLICY, 14.

[3] Smith, G. and Smith, G. (2017). Poor NYC areas have slow or no access to Internet: report. [online] NY Daily News. Available at: http://www.nydailynews.com/new-york/poor-nyc-areas-slow-no-access-internet-report-article-1.2036599 [Accessed 21 Jul. 2017].

[4] Anon, (2017). [online] Available at: http://koasas.kaist.ac.kr/bitstream/10203/223002/1/43737.pdf [Accessed 21 Jul. 2017].

[5] Ro.uow.edu.au. (2017). Cite a Website - Cite This For Me. [online] Available at: http://ro.uow.edu.au/cgi/viewcontent.cgi?article=10215&context=infopapers [Accessed 21 Jul. 2017].

[Code Reference]

Dongjie: https://github.com/djfan/wifind

Christian: https://github.com/cor215/WiFind2017

Xiaomeng: https://github.com/xd515/Wifind_file

Kai: https://github.com/kq320/capstone

[Someone else is editing this]

You are editing this file