Abstract

User feedback is critical to improving the quality of open data. However, informal methods used to collect feedback make it difficult to translate this information into formal design requirements.  In this paper, we explore manual and automated classification for a corpus of user comments collected from data.gov - an open data portal providing access to thousands of datasets published by city, state, and federal government agencies in the USA. We inductively build a classification of user reported quality issues, manually apply this classification to all issues reported in 2015 and 2016, and then attempt to build a model to automate the classification of these issues. Our results indicate that most issues reported deal with broken links, and unavailable data. Compared to manual classification, automated classification struggles to accurately identify issues based on the length and content of the free-text descriptions. We conclude with future directions for improving open data quality based on the solicitation of user feedback.

Introduction 

Over the last decade open data initiatives have demonstrably increased access to public sector information (Davies, 2010); are shown to correlate strongly with more efficient and impactful basic science research (Piwowar and Vision, 2013); and, can act as a driver of private sector innovation in critical areas of our economy such as healthcare and renewable energy (Chan, 2013).  Yet, many previous studies have shown that open data remains difficult for lay users to discover (Chun et al., 2010), access (Janssen et al., 2012), and interpret \cite{braunschweig_state_2012}. Some barriers to the meaningful use of open data include poor metadata quality (), insufficient provenance information (), the effective design of data portals that match data producers with data consumers ().  Collectively, these barriers to use can be described as a function of the overall quality of open data (Martin, Rosario, & Perez, 2016). In the following paper we do not seek to define “data quality” in a broad sense – as this is a rather relative phenomena (for an extended discussion of this point, see ). Instead we want to better understand how user feedback can be systematically gathered and analyzed in order to improve the quality of open data. Formally stated our research question asks simply: What do users report as an issue with data quality on Data.Gov? And, how can this feedback best classified to improve requirements engineering for data portals? 

Research Design

The research design presented here builds off of an existing literature into modeling user feedback in software engineering. Notably, manual and automated classification methods have been used to improve requirements engineering for mobile application development \cite{carreno_analysis_2013}. Our work similarly attempts to inductively build a classification scheme for quality issues reported on data.gov, manually classify issues with this scheme, and then use simple machine learning algorithms automate classification tasks.
Data Users of data.gov file a quality issue by submitting a web-form linked to on the landing page of a dataset. Reported issues are then posted to data.gov, and an email is sent to the issue reporter notifying them of the status of the issue. Each report of a data quality issue therefore includes the following information: an identifier for the issue, a timestamp indicating when the issue was reported, a title describing the issue, the status of the reported issue (open or closed), and a free text description of the issue. The data in this study include all data quality issues filed by users of Data.gov in 2015 and 2016 (n= 955).
Extraction Issues metadata were scraped from data.gov using the ‘rvest’ package in R \cite{wickham_rvest:_2015}). The  were then normalized (e.g. duplicate reports removed, dates translated to proper formatting) and free text descriptions of the reported issues were subset by year – 2015 and 2016-  for further analysis (described in detail below).
Manual Classification
Two graduate students and the lead author developed a coding scheme to classify quality issues reported by users. Using a inductive approach to classification, a subset of the reported issues (n=30) were openly coded with short, free-text descriptions such as “Broken link” to describe issues that noted a 404 error. After a first round of open coding, we compared our descriptions, and developed a six category classification (category titles and definitions are described in table 1). Using these categories, we then coded a second subset of issues (n=30) and compared our results using Fleiss Kappa (as implemented by the ‘irr’ package in R (Fleiss and Cohen, 1973; \cite{gamer_irr:_2012}. Coding in the second round achieved an inter-rater reliability score that indicated strong agreement (K=0.89),. However our disagreements resulted in the addition of a new category. A third round of coding was then conducted (n=30) in order to ensure that no additional categories were needed. After comparing results (K = .94) no new terms were required. The two graduate students then used the categorization scheme from Table 1 to manually classified the remaining issues (n= 865).
Automatic Classification
Manually classified issues were used as a training set to explore automated classification using machine learning.  We divided classified issues into two separate datasets by year, and used 2015 manual classifications as a training set. We imported the training set to Weka, removed commonly occurring words using the stoplist found in the ClassRainbow library, transformed the issues to attributes using the ‘StringToWordVector’  - retaining only the top 300 tokenized words. We then analyzed the training set using ZeroR (as a baseline), JRip, and NaiveBayes classifiers. We then imported the 2016 issues dataset and ran all three classifiers again over this dataset.

Results

Broken Link
The link appearing on data.gov is no longer resolves to the resource described
Incorrect Link
The incorrect link was provided for a dataset
Data Availability
Data are no longer available from the data provider
Data Validity
Data are incorrect, corrupt, or invalid in some way 
Incorrect Metadata
Metadata are incorrect or incomplete
Obsolescence
Data are either outdated, or a newer (updated) version of the dataset should be available
Lack of Documentation
There is a lack of information available for properly using the data (e.g codebooks, provenance informaiton, etc)
Unclear
The issue filed doesn't appear to clearly deal with a data quality issue.