Abstract
User feedback is critical to improving the quality of
open data. However, informal methods used to collect feedback make it difficult
to translate this information into formal design requirements. In this paper, we explore manual and automated
classification for a corpus of user comments collected from data.gov - an open
data portal providing access to thousands of datasets published by city, state,
and federal government agencies in the USA. We inductively build a
classification of user reported quality issues, manually apply this
classification to all issues reported in 2015 and 2016, and then attempt to build
a model to automate the classification of these issues. Our results indicate
that most issues reported deal with broken links, and unavailable data.
Compared to manual classification, automated classification struggles to
accurately identify issues based on the length and content of the free-text
descriptions. We conclude with future directions for improving open data
quality based on the solicitation of user feedback.
Introduction
Over
the last decade open data initiatives have demonstrably increased access to
public sector information (Davies, 2010); are shown to correlate strongly with
more efficient and impactful basic science research (Piwowar and Vision, 2013);
and, can act as a driver of private sector innovation in critical areas of our
economy such as healthcare and renewable energy (Chan, 2013). Yet, many previous studies have shown that open
data remains difficult for lay users to discover (Chun et al., 2010), access (Janssen
et al., 2012), and interpret \cite{braunschweig_state_2012}. Some barriers to the meaningful
use of open data include poor metadata quality (), insufficient provenance
information (), the effective design of data portals that match data producers with
data consumers (). Collectively, these
barriers to use can be described as a function of the overall quality of open
data (Martin, Rosario, & Perez, 2016). In the following paper we do not seek to define “data
quality” in a broad sense – as this is a rather relative phenomena (for an
extended discussion of this point, see ). Instead we want to better understand
how user feedback can be systematically gathered and analyzed in order to
improve the quality of open data. Formally stated our research question asks
simply: What do users report as an issue with data quality on Data.Gov? And,
how can this feedback best classified to improve requirements engineering for
data portals?
Research Design
The research design presented here builds off of an
existing literature into modeling user feedback in software engineering. Notably,
manual and automated classification methods have been used to improve
requirements engineering for mobile application development \cite{carreno_analysis_2013}. Our work similarly attempts to inductively build a
classification scheme for quality issues reported on data.gov, manually classify
issues with this scheme, and then use simple machine learning algorithms automate
classification tasks.
Data Users of data.gov file a quality issue by submitting a
web-form linked to on the landing page of a dataset. Reported issues are then
posted to data.gov, and an email is sent to the issue reporter notifying them
of the status of the issue. Each report of a data quality issue therefore includes
the following information: an identifier for the issue, a timestamp indicating
when the issue was reported, a title describing the issue, the status of the
reported issue (open or closed), and a free text description of the issue. The
data in this study include all data quality issues filed by users of Data.gov in
2015 and 2016 (n= 955).
Extraction Issues metadata were scraped from data.gov using the ‘rvest’
package in R \cite{wickham_rvest:_2015}). The were
then normalized (e.g. duplicate reports removed, dates translated to proper
formatting) and free text descriptions of the reported issues were subset by
year – 2015 and 2016- for further
analysis (described in detail below).
Manual Classification
Two graduate students and the lead author developed a coding
scheme to classify quality issues reported by users. Using a inductive approach
to classification, a subset of the reported issues (n=30) were openly coded with
short, free-text descriptions such as “Broken link” to describe issues that
noted a 404 error. After a first round of open coding, we compared our
descriptions, and developed a six category classification (category titles and definitions
are described in table 1). Using these categories, we then coded a second subset
of issues (n=30) and compared our results using Fleiss Kappa (as implemented by
the ‘irr’ package in R (Fleiss and Cohen, 1973; \cite{gamer_irr:_2012}. Coding in
the second round achieved an inter-rater reliability score that indicated strong
agreement (K=0.89),. However our disagreements resulted in the addition of a
new category. A third round of coding was then conducted (n=30) in order to
ensure that no additional categories were needed. After comparing results (K =
.94) no new terms were required. The two graduate students then used the categorization
scheme from Table 1 to manually classified the remaining issues (n= 865).
Automatic Classification
Manually classified issues were used as a training set to
explore automated classification using machine learning. We divided classified issues into two separate
datasets by year, and used 2015 manual classifications as a training set. We
imported the training set to Weka, removed commonly occurring words using the stoplist
found in the ClassRainbow library, transformed the issues to attributes using
the ‘StringToWordVector’ - retaining
only the top 300 tokenized words. We then analyzed the training set using ZeroR
(as a baseline), JRip, and NaiveBayes classifiers. We then imported the 2016
issues dataset and ran all three classifiers again over this dataset.
Results
Broken Link | The link appearing
on data.gov is no longer resolves to the resource described | |
Incorrect Link | The incorrect link
was provided for a dataset | |
Data Availability | Data are no longer
available from the data provider | |
Data Validity | Data are incorrect,
corrupt, or invalid in some way | |
Incorrect Metadata | Metadata are
incorrect or incomplete | |
Obsolescence | Data are either
outdated, or a newer (updated) version of the dataset should be available | |
Lack of Documentation | There is a lack of
information available for properly using the data (e.g codebooks, provenance
informaiton, etc) | |
Unclear | The issue filed
doesn't appear to clearly deal with a data quality issue. | |