Jianguo Xia

and 1 more

The initial motivation for developing MetaboAnalyst was to save time for myself. I started my PhD with Dr. David Wishart at the University of Alberta. During that time period, the main focus of the lab was, of course, the Human Metabolome Database (HMDB). The development of a metabolomics core facility was also at its full speed. As part of my PhD training, I was involved in a metabolomics study on urine samples from cancer cachexia patients. At that time, the only bioinformatics tool for metabolomics data analysis was a commercial software - SIMCA-P (Umetrics). We purchased a copy of the tool which came with a comprehensive manual. Although I could perform some “standard” data analysis to produce the numbers and graphics as seen in many metabolomics publications, I soon realized its limitations - many approaches I would like to try were not supported. I then played with Weka (https://www.cs.waikato.ac.nz/ml/weka/), a widely-used java-based machine learning tool, for classification and regression analysis. However, it lacks many features specially needed for metabolomics data analysis. In the end, I taught myself R to perform data analysis. This worked well for a short time - I analyzed the data the way I wanted, generated impressive graphics, and produced analysis reports using Sweave & Latex. However, the process soon became less enjoyable when more collaborators requested their data to be analyzed in a similar fashion. A better way is to let someone else in the lab do it. The best way is to let researchers analyze their own data - most of them are highly educated and understand the basic principles behind most analysis methods. At that time, I was the only one in the lab who knew R and statistics - how can I let other people with some basic knowledge to perform the same analysis I would do? In 2008, I started thinking seriously about developing a biologist-friendly tool for metabolomics data analysis. One of the advantages of being last in the “omics” race is the benefit of hindsight. Many of the approaches developed from other omics fields are not domain-specific and can be adapted for metabolomics. For instance, the GenePattern tool suite \citep{Reich_2006} developed by the Broad Institute gave me a lot of inspirations. Other important considerations include - be web-based, respond at real time, and be implemented in the languages I know (Perl, Java and R). During a lab meeting in the summer of 2008, I proposed this idea to David. He was a bit uncertain as he knew that I had no formal training in developing web based applications (note: I obtained my MSc in Immunology after I graduated from a 5-yr Medicine program). I was very enthusiastic and said I could get this done by the end of year. He smiled and encouraged me to pursue in this direction. As most analysis methods and graphics were already implemented in R, the key challenge was to put these functions on the web through user-friendly interface. I wanted to use a technology that will not expire soon. The Perl CGI based web framework was losing its ground at that time. Java had a lot to offer in terms of web frameworks. However, many of them are too “heavy” for me to learn in a short time. Eventually, I chose the then relatively new JavaServer Faces (JSF) technology. The next technical challenge was how to efficiently communicate between R and Java to deal with concurrency (i.e. supporting multiple users to perform data analysis at the same time). The Rserve (https://www.rforge.net/Rserve) developed by Simon Urbanek came to my rescue. I spent around three months to complete the first prototype, which captured all the steps I would do for metabolomics data analysis. The web interface was designed to be quite “conversational” and acted as a playground to allow users to freely explore many useful statistical analysis methods once their data parse certain sanity checking, processing and normalization. MetaboAnalyst (version 1.0) was published in 2009 at Nucleic Acids Research \citep{Xia_2009}. It enables a researcher with a basic understanding of metabolomics and statistics to perform data analysis to generate a comprehensive analysis report. It was also heavily used by other members within our metabolomics group and saved a lot of my time. My next focus was on functional analysis of metabolomics data. Using the same infrastructure, I developed tools for metabolite set enrichment analysis \citep{Xia_2010}, metabolomic pathway analysis \citep{12235}, as well as time-series data analysis \citep{Xia2011}. They were eventually merged under the umbrella of MetaboAnalyst (version 2.0) for the ease of use and the convenience of maintenance \citep{Xia_2012}. While I was pursuing my PhD on bioinformatics for metabolomics, the next-generation sequencing revolution was in full swing. In 2012, I received two postdoctoral fellowships from the Canadian Institutes of Health Research (CIHR) and Killam Trust, to work on next-generation sequencing in Bob Hancock’s laboratory at the University of British Columbia (UBC). While at UBC, MetaboAnalyst was gaining steady increase in user traffics, and I felt obligated to maintain MetaboAnalyst and to keep addressing user requests. For instance, I added a biomarker analysis module to support a variety of common approaches clinicians would like to perform. With growing popularity, there were signs of performance issue - many colleagues experienced significantly slow responses when they used MetaboAnalyst for teaching in a large class.  I eventually decided to totally re-implement the software, with particular focus on addressing the performance bottlenecks in both Java and R functions. I also switched to the Google Computer Engine (GCE) for hosting the web application. The result is MetaboAnalyst 3.0 \citep{Xia_2015}. The impact of this update turned out to be very significant. Google Analytics showed that the submitted analysis jobs jumped from 500~800 jobs/day to 5000~8000 jobs/day, and the server downtime was also reduced significantly. We are actively developing MetaboAnalyst 4.0 at the time of writing. The key features will be to enable more transparent & reproducible analysis, better support for untargeted metabolomics, and integration with other omics through advanced statistics and network analysis.