As part of the urban metabolism, city buildings consume resources and use energy, producing environmental impacts on the surrounding air by emitting plumes of pollution. Plumes that have been observed in Manhattan range from water vapor emitted from heating and cooling systems’ steam vents to CO2 and dangerous chemical compounds (e.g. ammonia, methane). City agencies are interested in detecting and tracking these plumes as they provide evidence for signs of urban activity, cultivation of living and working spaces and can support the provision of services whilst monitoring environmental impacts. The Urban Observatory at New York University’s Center for Urban Science and Progress (CUSP-UO) continuously images the Manhattan skyline at 0.1 Hz, and day-time images can be used to detect and characterize plumes from buildings in the scene. This project built and trained a deep convolutional neural network for detection and tracking of these plumes in near real-time. The project created a large training set of over 1,100 actual plumes as well as sources of contamination such as clouds, shadows and lights, and applied the relevant network architecture for training of the model. The trained convolutional neural network was applied to the archival Urban Observatory data between two time periods: 26th October-31st December 2013 and 1st January-13th March 2015 to generate detections of building plume activity during those time periods. Buildings with high plume ejection rates were identified, and all plumes could be classified by their color (i.e. carbon vs water vapor). The final result was a detection of plumes emitted during the time periods that the dataset spans.
Abstract Recently many women have come forward telling their stories of sexual assault to raise awareness and empower other women to do the same. Regardless, society is still a toxic place for victims of sexual assault and often blames women for the assault that they endured. This research looks at the language used in tweets surrounding the Harvey Weinstein scandal and looks specifically at tweets about him and a few of the women that came forward. A heuristic is developed to measure the level of sexism contained in a tweet. The tweets are clustered using their GloVe word embeddings and the groups of terms are visualized. Introduction In the past few months, a barrage of sexual assault allegations have been flying left and right, and Men’s Rights Activists have been crying out about how they are being victimized and are painting the women coming forward as predatory (oh the irony). They point to the fact that so many women are coming forward at once as evidence that it is some crusade to unjustly attack men. However, the answers to why this is happening now and all at once can’t be seen through the lens of male victimhood. The main reason that so many women are coming forward now is just that so many other women are doing it, so it’s more difficult for public perception to focus its crosshairs on each individual woman. Forever, society has used slut shaming and victim blaming as a way to rationalize assault and to shame victims into silence. But just because more women are feeling empowered to come forward, that doesn’t mean that the victim blaming has suddenly stopped. This analysis looks into attempting to extract and measure the language that is used to talk about both the women and the assaulter in the wake of sexual assault allegations. Specifically, this paper looks into tweets referencing Harvey Weinstein and the victims of sexual assault who came forward between the dates 10/5/2017 and 10/28/2017.MethodologyDataThe data used for this project are various collections of tweets. First a sample of tweets is drawn from over the course of 2015 and 2016 to establish a baseline for word frequencies so that more common words can be normalized by their typical frequency so that the results aren’t filled with only common words. Next, tweets were pulled from Twitter search results containing the names Harvey Weinstein, Annabella Sciorra, Zoe Brock, Asia Argento, and Louisette Geiss, between the dates 10/5/2017 and 10/28/2017.ProcessingSpaCy, a natural language library, was used for text processing and tokenization, as well as for extracting word embeddings. The embeddings used are GloVe word embeddings, provided by Stanford NLP Group. These embeddings are 384 dimensional vectors that represents a compressed semantic representation of a word based on word cooccurrences in the training dataset. The embeddings are a fascinating and rich representation of words, allowing for some interesting semantic manipulations of word relationships (e.g. \(vector[‘bird’]-vector[‘air’]+vector[‘water’] \simeq vector[‘fish’]\)). Sentiment analysis is computed using a pre-trained model provided by the Python library NLTK, Vader Sentiment Intensity Analyzer, which is specifically trained to detect sentiments expressed in social media.A metric that represents the level of sexism in a block of text is difficult to achieve. For the purposes of this research, sexism is gauged using two factors: the gender polarity and the text's sentiment. The gender polarity is a measure of how close a word is associated with either the word "woman" or the word "man". The metric is calculated as follows:\(genderpolarity=\ln(cosinedistance(word,"woman")/cosinedistance(word,"man"))\)where a polarity greater than zero is more closely associated with women and a value less than zero is more closely associated with men. The sentiment score is calculated as the compound polarity score from the Vader Sentiment Analyzer. The sexism score is then calculated as follows: \(sexismscore = - genderpolarity * (wordsentiment + sentencesentiment + tweetsentiment) / 3\) where values greater than zero are considered sexist and values less than zero are considered not sexist.This sexism score only accounts for malevolent sexism towards women and benevolent sexism towards men, meaning that it won’t measure the inverse. This is a known limitation, but cannot be solved without a more complex model with a generous amount of training data. AnalysisThe data is first preprocessed and tokenized using spaCy and the sexism scores are calculated. N-grams are extracted for groups of 1, 2, and 3 words and the occurrences of each n-gram are counted. The counts are then normalized by the counts for that n-gram found in the baseline dataset. Words that do not appear in the baseline dataset are assigned a baseline occurrence of 1. The most frequent terms are then extracted and their word embeddings are compressed into 2 dimensions using T-SNE dimensionality reduction. This allows the data to be plotted as a scatter plot as well as improving clustering, as clustering algorithms do not typically perform well in high dimensional data. The n-grams are then clustered using the 2 dimensional transformation using K-Means with a cluster size of 10. Results Conclusions This One major limitation in this research is the quantification of sexism. The current implementation tries to create a heuristic that approximates certain forms of sexism, however it has several drawbacks. First, it operates under the assumption that any tweet that is posted about that person in this timeframe is in reference to the sexual assault allegations and that they have some degree of sexist charge to them if they use gender-directed terms (terms that have a gender polarity) and have a strong sentiment. If either the gender polarity or the sentiment is close to zero, then the phrase would not be considered sexist, however this is not always the case with non-sexist tweets. As it was mentioned at the CUSP Hackathon, this method does not account for a negative tweet that is expressing sympathy towards women. In order to do this, a context-aware model, such as an LSTM neural network could be used to provide a better sexism metric. This would require a very large dataset considering the complexity of the problem, which would be a large feat to assemble. Referenceshttps://github.com/t-davidson/hate-speech-and-offensive-language https://github.com/pinkeshbadjatiya/twitter-hatespeech http://www.markhneedham.com/blog/2015/01/19/pythonnltk-finding-the-most-common-phrases-in-how-i-met-your-mother/ http://sacraparental.com/2016/05/14/everyday-misogyny-122-subtly-sexist-words-women/ https://www.wired.com/story/machines-taught-by-photos-learn-a-sexist-view-of-women/ Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Ben Steers bensteers bs3639Problem DescriptionExcessive noise pollution + health blah citation. For that reason, gaining an understanding of the level of noise in a city context can provide information about the health of citizens who live and (attempt to) sleep there. A predictive noise model will be created using traffic and business data in order to be used to estimate noise levels where sensors are not available to measure it. This model attempts to estimate noise levels using measures of just anthrophonic and automotive noise sources, ignoring other sources like biophonic and geophonic noise. DataFor base truth noise levels, SONYC sound pressure level (SPL) in the for of LAeq (equivalent A-weighted sound level) will be used. This data is densely available around Washington Square Park, but is also available in various other locations around Manhattan and Brooklyn. Historical data is available for approximately the last two years, but the amount of data available depends on the specific deployment date. Once I finally get access to the SPL data server, I can get exact timeframes for each sensor. Traffic counts will be used as a measure of traffic flow and are available via NYC Open Data.Business location will be gathered using Yelp data, based on the assumption that business density can be used as a proxy for estimating the level of human activity. Because business density is a constant value regardless of the time of day/year, business customer flows may be incorporated using Google Maps popular times data to give temporal characteristics to the business activity. AnalysisA regression model will be constructed to estimate the sound level on an hourly basis using the data described above. Because it is anticipated that the data being used may not be able to provide an accurate measure at hourly intervals, another regression will be attempted to predict only the maximum LAeq for that day. ReferencesKing et al. performed a statistical assessment of road traffic noise to approximate Lmax over a defined time period, for which Monte Carlo analysis was used. DeliverableThe end deliverable of this project will be a predictive noise model that can estimate the noise level based on traffic and business data. The results of the noise model will be displayed on a map showing locations of higher and lower noise levels.
AbstractCiti Bike is a bike sharing company operating in NYC that has made their bike usage data publically available. This article investigates the question of whether or not the relative usage of customers on the weekend compared to the usage during the week is higher than that for subscribers. Pearson's Chi-Squared test is used to compare the usage frequencies.IntroductionCiti Bike is a privately-owned bike sharing company that operates in New York City and New Jersey. The service can either be used as either pay-per-ride or via a subscription service. The two types of users are classed as customers and subscribers, respectively. The research question is based on the idea that subscribers tend to use Citi Bike for commuting, where pay-per-ride customers would be used more for leisure, thus concentrating the customer usage to the weekends.DataThe Citi Bike data is given monthly. The data can be found at https://s3.amazonaws.com/tripdata/%Y%m-citibike-tripdata.zip, replacing the date format codes for the desired month and year. For this research, May through August 2015 was used. Each row in the dataset represents a single trip taken by a user. The relevant columns in the dataset are the date that the ride was taken and the user type, either customer or subscriber. The distributions of Citi Bike usage on each day of the week for each user type is given in Figure 1 below.