Johnny Takes on Stats for Informatics: Simple steps in Hypothesis Tests
Dahlia Hegab, Sonny Lin and Vijay Krishna Palepu
At many universities, graduates students, researchers, and faculty engage in research projects without having a keen grasp of statistical concepts and methods. We introduce a novel application titled Johnny Takes on Stats in Informatics, in an attempt to succinctly and clearly define relevant concepts in statistics that become crucial to conducting research in an accurate manner. We draw on storytelling narratives, such as those used in Rapunsel (www.rapunsel.org) and Alice 3D (www.alice.org), while integrating a JS application designed to walk researchers through the process of using on-line tools (here we use Google Trends) and data set files to analyze statistical information found on-line. We cite potential causes of concern in using data sets from Google Trends (and similar data set providers) and discuss future considerations for statistical learning applications such as Johnny Takes on Stats in Informatics.
The purpose of this narrative tutorial on statistical methods is to establish a simplified way for students to learn statistical concepts that are grounded in real-life examples. In addition to a narrative, we present an interface that allows students to interact with data sets that are accessible online. Additionally, we integrate this interface with information designed to clarify concepts in statistics, while demonstrating how to use those concepts in conjunction with research protocols and technologies like Google Trends (www.google.com/trends/).
The goal is to give students interested in doing research a fun way to quickly and clearly understand critical concepts in statistics, needed to continue to experiment with their own data sets (or online ones) so they can interpret future data sets they work with in meaningful and statistically sound ways. while allowing them the opportunity to walk through real-life example data sets that they can access and manipulate for future use. This tutorial is also designed to help students gain the foundation
Our application, titled “Johnny Takes on Stats for Informatics”, is an extension of earlier works that create storytelling modules to teach complex subject material. Using a simplified methodology of storytelling instructional guides, inspired by instructional applications like Alice 3D and websites like killmath.com and http://vectors.usc.edu, we create a story-based online application to teach to statistics concepts that are considered challenging to individuals first learning statistical methods in a research setting.
Our application begins with a narrative detailing our protagonist, Johnny. Johnny is an Informatics student, new to research and statistical methods, who must interpret data findings from a research project he is working on. After briefly introducing him, we discuss his research question, clarifying that every research study must detail a clear research question that contributes to scholarly knowledge. This happens to be: “people who update their Twitter accounts when using mobile devices update them more than people who update their Twitter statuses using computers and laptops.”
Relying on Schuyler Huck’s Reading Statistics and Research, we explain how this research question is actually an example of an alternative hypothesis (this is the hypothesis the researcher is testing to be the cause of a particular phenomenon). Although many conference and journal publications base their studies on trying to reject (or fail-to-reject) a hypothesis, in the statistical world, what becomes more relevant to focus on is the null hypothesis (as rejection of it will then show that it is more likely the alternative hypothesis is correct). We give an example of what the null hypothesis would be in this scenario, but can summarize it as the opposite viewpoint of the alternative hypothesis. Therefore, in this instance, the null hypothesis is would be: “people who update their Twitter accounts when using mobile devices do not update their Twitter statuses more than people who update their Twitter statuses using computers and laptops.” In frequentist methods, researchers try to reject the null hypothesis, while obtaining a p-value less than .05 (more on this later) in order present a statistically significant result.
We explain the significance of using certain terminology when framing a research question, noting how emphasis should be placed on the influence of variable(s) instead of wording which tries to establish direct causation. Since many factors can affect the scenario in question, we go on to explain ways to isolate and test the variable of interest so a convincing argument can be made from our findings in the data.
We then outline how Johnny should run his study. The goal is to demonstrate a framework that shows how to collect data in a way that is verifiable so statistical findings can be interpreted easily from it. In order to isolate the variable(s) of interest, we suggest Johnny create two groups of participants to be evaluated: a control group and an “altered” group (this is the only group which contains the variable of interest). Here the variable of interest is updating Twitter statuses through mobile devices. Working off of Schuyler Huck’s Reading Statistics and Research, we create a univariate study, where there is only one variable of interest present for evaluation. In future works and iterations however, the application could be extended to show, and manipulate, the inclusion or removal of additional independent and dependent variables of interest.
We then present the data from each of the control groups, making it a point to detail the methodology we use to evaluate the data. We extend concepts posed by Andrew Vickers in What is a P-Value Anyway, noting that the best evaluation strategy for interpreting this data is comparison of median averages from each of the data sets present. We are in agreement with Vickers’ approach because it prevents potential outliers in the data set from severely obfuscating our understanding of the distribution of the data, so we can see a general trend in the figures presented. Ordering the data sequentially in a data set with 15 figures in each set, we take the averages of the medians in each set (figures 7 and 8 sequentially), before obtaining our findings. Results show that users of mobile devices update their Twitter statuses more often than those using their laptops and computers to update their Twitter.
We implement visualizations to demonstrate this data set by using two JS libraries, D3.js (http://d3js.org/) and CrossFilter (http://square.github.io/crossfilter/). D3.js is a JS library that among other things creates beautiful visualizations, while Crossfilter is a library that helps users find new and engaging ways to interact with multivariate data sets.
In this section we are going to talk about the design process that guided the development process and certain key decisions that were made as a result of that process.
In order to begin developing the tool we initially decided to prototype our user interface through the creation of low-fidelity mockups. The central notion behind this approach was to draw out fundamental ideas that we wanted to capture in our application. Although our initial tool of choice for creating mockups was pen-and-paper, we quickly decided to use the Balsamiq Google Drive plug-in for this purpose. The Google Drive plug-in for Balsamiq provided us with a web-based interface to create the mockups in a collaborative setting. This resulted in easy sharing of the mockups as we created them, allowing us to critique and refine our ideas. One of the advantages of a web based interface, apart from easy collaboration, was that of a zero-installation overhead. Since the plug-in worked in a web browser we never had any issues with installing the plugging. This was especially important as we were all using different stacks of operating systems and their associated developer tool sets.
We very quickly gravitated towards the idea of creating a web-based application for our tool. A major reason for the same was our level of confidence and expertise with developing web based applications. However, the advantages behind the use of a web-based interface for prototyping the UI showed that the same advantages (of using the web-based technologies), could be employed in developing and deploying the actual application itself. Since, the underlying technologies and standards, i.e. HTML, CSS, and JS, were universal, we could not only develop using our own stack of development tools (editors, IDEs, web-browsers, etc.), but also suffer a zero-overhead in terms of deploying the application again due to the universal presence and use of these technologies. This further prompted us to render the application solely on the client side web-browser, with perhaps a basic file hosting capability from the server; thus simplifying the deployment process by reducing the dependence on a central resource-serving entity.
Some of our initial designs rendered a number of different inputs that would be required from the user, largely in terms of the data and the specifications of the test itself. The different kinds of inputs along with a whole host of results and their discussion, that would need to rendered, resulted in a very cluttered UI design. Thus, in order to avoid the clutter, a key decision that we made was to guide the user through a series of steps that would lead to execution of a simple Hypothesis Test. This served two goals. Firstly, this would separate the different user inputs into well defined stages or steps (as we call them in our application). Thus, streamlining the UI itself. Secondly, and more importantly, it actually created a basic walk through of a simple hypothesis test. This, in our opinion, might add pedagogical value to the application.
The results that we present as a part of the t-Test/Hypothesis-test, would be by and large numeric. However, we recognized that it is important to contextualize these numbers for the users of the application to make sense of these results. We focused on two basic approaches to provide this context. Firstly, we provide users with notification based discussion points about the implications of the results that they get, given the inputs with respect to the data and the test itself. Secondly, we display, once again, the Research Questions, and Hypotheses (H0 and Ha) when the results are finally presented in the final step. Note that, the RQ, H0 and Ha are initially presented in the first step of the application.