Introduction

High Content Screening (HCS) is a method used in biological research and drug discovery to measure the phenotypic effect of substances such as small molecules, peptides and RNAi to cells \cite{Taylor2007}. Phenotypic changes of interest may include changes in cell morphology, or increases/decreases to the production of proteins. Other traditional assay formats quantify one parameter while a HCS assay is able to quantify hundreds of parameters. Modern HCS platforms can process as many as 384-well plate in minutes \cite{battich2013image}. The high throughput nature of HCS has allowed it to become a means of carrying out reverse genetics on a large scale.

A typical HCS workflow will involve cells being incubated with the substance, and after a period of time images of the cells are obtained by means of high throughput microscopy. The most common analysis technique involves labeling proteins of interest with a fluorescent tag, and phenotypical changes of interest quantified using digital image analysis techniques. This automated analysis then allows for the identification of molecules affecting cell phenotype, which in turn can yield new drug targets and novel insights into the biology of cell pathways.

HCS has been extraordinarily successful both in drug discovery \cite{swinney2011were} and biological research, and a result the number of papers documenting HCS research is increasing each year \cite{Singh2014}. Despite these successes and increased adoption of HCS, it is believed that the potential information gained from these experiments is lagging behind \cite{Singh2014}. Reasons for this lag have been hypothesised as being due to a focus of only 1-2 parameters in a HCS experiment, which far underutilises the ability of HCS to quantify hundreds of parameters from cellular images. The lag is also attributed to a lack of robustness in stastical analysis of the results obtained, and a lack of robustness in statistical analysis of the results. This has led to the belief that potentially valuable information has been needlessly discarded in some HCS experiments.

Our hypothesis is that the best way harness this lost information is through clustering the images according to their phenotypic similarity. We hypothesis this approach will enable the identification of unexpected or outlier phenotypes that may have been missed using the typical approach of only focusing on a small number of parameters. While it has been shown that clustering these images is possible, there currently does not exist a tool that enables biologists and bioinformaticians to easily carry such an analysis on a HCS dataset. Our project aims to address this gap by developing a tool that does just this.

We describe our project that currently has the working title HUSC (igh Content Screening nupervised Clustering). HUSC will be an analysis pipeline that clusters HCS images together according to their phenotypic similarity, allows for interactive exploration of the data and integrates gene function annotation tools. The data exploration and visualisation tools are provided to equip bioinformaticians with the tools to identify features of interest for future experiments, while the gene function annotation tools are provided to help biologists and bioinformaticians identify novel biochemical pathways and gene function relationships based on how the cell phenotypes are clustered together.

In this proposal, we briefly discuss HCS hardware and some of the software that is currently being used frequently in the analysis of data obtained from HCS. The image analysis and data clustering techniques will then be discussed followed by a mock up of the web interface that will HUSC will integrate.