Masters Proposal


High Content Screening (HCS) is a method used in biological research and drug discovery to measure the phenotypic effect of substances such as small molecules, peptides and RNAi to cells (Bray 2004). Phenotypic changes of interest may include changes in cell morphology, or increases/decreases to the production of proteins. Other traditional assay formats quantify one parameter while a HCS assay is able to quantify hundreds of parameters. Modern HCS platforms can process as many as 384-well plate in minutes (Battich 2013). The high throughput nature of HCS has allowed it to become a means of carrying out reverse genetics on a large scale.

A typical HCS workflow will involve cells being incubated with the substance, and after a period of time images of the cells are obtained by means of high throughput microscopy. The most common analysis technique involves labeling proteins of interest with a fluorescent tag, and phenotypical changes of interest quantified using digital image analysis techniques. This automated analysis then allows for the identification of molecules affecting cell phenotype, which in turn can yield new drug targets and novel insights into the biology of cell pathways.

HCS has been extraordinarily successful both in drug discovery (Swinney 2011) and biological research, and a result the number of papers documenting HCS research is increasing each year (Singh 2014). Despite these successes and increased adoption of HCS, it is believed that the potential information gained from these experiments is lagging behind (Singh 2014). Reasons for this lag have been hypothesised as being due to a focus of only 1-2 parameters in a HCS experiment, which far underutilises the ability of HCS to quantify hundreds of parameters from cellular images. The lag is also attributed to a lack of robustness in stastical analysis of the results obtained, and a lack of robustness in statistical analysis of the results. This has led to the belief that potentially valuable information has been needlessly discarded in some HCS experiments.

Our hypothesis is that the best way harness this lost information is through clustering the images according to their phenotypic similarity. We hypothesis this approach will enable the identification of unexpected or outlier phenotypes that may have been missed using the typical approach of only focusing on a small number of parameters. While it has been shown that clustering these images is possible, there currently does not exist a tool that enables biologists and bioinformaticians to easily carry such an analysis on a HCS dataset. Our project aims to address this gap by developing a tool that does just this.

We describe our project that currently has the working title HUSC (igh Content Screening nupervised Clustering). HUSC will be an analysis pipeline that clusters HCS images together according to their phenotypic similarity, allows for interactive exploration of the data and integrates gene function annotation tools. The data exploration and visualisation tools are provided to equip bioinformaticians with the tools to identify features of interest for future experiments, while the gene function annotation tools are provided to help biologists and bioinformaticians identify novel biochemical pathways and gene function relationships based on how the cell phenotypes are clustered together.

In this proposal, we briefly discuss HCS hardware and some of the software that is currently being used frequently in the analysis of data obtained from HCS. The image analysis and data clustering techniques will then be discussed followed by a mock up of the web interface that will HUSC will integrate.

High Content Screening


A typical HCS platform will consist of an automated microscope and a robotic arm for automatic sample preparation. Modern HCS platforms date back to 1997 where they were first introduced by Cellomics, Inc. (Taylor 2010).

Cell phenotypes are altered by means of gene silencing, a technique used to reduce expression of a particular gene (Hannon 2002). RNAi involves the delivery of double-stranded RNA molecules that activate the multi-protein RNA induced silencing complex (RISC). When activated, RISC will silence a particular gene. RNAi is a popular means of gene silencing as it’s able to target genes with a high degree of specificity but is also susceptible to off-target effects, where genes are unintentionally silenced (Jackson 2010). Much research goes into the identification and management of off-target effects.

A typical HCS assay is designed by treating cells in a 384-well plate with RNAi or other interfering agents. A number of cells will be designed as control samples. After some period of incubation, images of the cells are obtained and properties of interest are obtained with automated image analysis. Hundreds of parameters can be obtained from these images. The combination of fast imaging and automated analysis is what gives such great power to HCS. Changes in cell phenotype measured in HCS include but certainly aren’t limited to intracellular translational, organelle structure changes, morphology changes and cell sub population distribution (Buchser 2004).

Modern HCS platforms include bundled software both the storage/retrieval of image data and its subsequent analysis. There also exist free and open source packages available for image analysis. In the following sections we briefly discuss some of the software currently available.



A HCS screening platform will include bundled proprietary software. Examples of such software include Thermo-Scientific’s HCS Studio 2.0, Molecular Devices’ MetaXpress Becton Dickinson’s (BD) Pathway Software and Perkin Elmer’s Opera. Data obtained from a HCS platform need not be used with the bundled proprietary software, with free and open source packages available including CellProfiler(Carpenter 2006), PhenoRipper (Rajaram 2012) and HCS-Analyzer (Ogier 2012). Of particular interest is the PhenoRipper package, which takes a similar approach to our project in that it is built with the purpose of clustering imaging data according to phenotype similarity as opposed to quantifying parameters of interest. CellProfiler and PhenoRipper are examined in the following sections.


CellProfiler (Carpenter 2006) is a free and open source tool designed to enable biologists with no training in computer programming or image analysis to derive quantitative measurements from HCS experiments. CellProfiler is able to handle the image analysis stage of a HCS experiment from preprocessing of imaging data through to exporting the data in a spreadsheet format for analysis.

CellProfiler also encompasses the companion tool, CellProfiler Analyst (Jones 2008). CellProfiller Analyst was built with the goal of enabling researchers to have an easy to use method to visualise and filter the large amount of data generated in a HCS experiment.

CellProfiler already provides an easy to use tool to obtain and process HCS data, however attempting to cluster images using CellProfiler is a more difficult task. Image features must be chosen by the user that will allow a clustering algorithm (clustering algorithms are discussed in detail in Section \ref{sec:cluster}). This is certainly possible with the right domain knowledge and image analysis expertise, but would need to repeated for each dataset a user wishes to cluster. Furthermore, clustering of the image data requires the application of clustering algorithms which are not included in CellProfiler Analyst. The data would need to be analysed externally using R or Python packages.


PhenoRipper (Rajaram 2012) is a free and open source tool that groups images obtained from microscopy experiments together according to phenotypic similarity. This is process is relatively easy to carry out and does not require any knowledge of image analysis to use the software. PhenoRipper uses a visual codebooks technique to characterise images (this method is discussed in greater detail in Section \ref{sec:vizcodebooks}).

The images are plotted together onto a two or three dimensional plot by means of dimensionality reduction techniques such as Principal Components Analysis and Multi-Dimension Scaling. While this does offer a way to estimate similarity and dissimilarity of phenotypes, no clustering algorithm is utilised to group the images together. This method also relies heavily upon the human eye, and may be unsuitable for some image datasets measuring in the thousands.