The following is just a rough list of my immediate and stretch goals for the upcoming project:
use ensembl compara to determine orthologs and paralogs for zebra fish and mouse
Stick to pipeline outlined by Vilella et al. paper
use Gene ontology to obtain Biological process and Molecular function info for mouse and zebrafish?
use same cutoffs to include only experimentally inferred annotations
rework clark code and then use it on my data set
create similar graphs and compare results to clark paper
my theory: a purely mouse to zebrafish comparison should eliminate the experimental bias found in human vs mouse since mouse and zebrafish can be used for more similar experiments
Find RNA seq data to work with (if its already out there) as a further check
Fully eliminate authorship bias
normalize measures of function similarity with respect to background similarity
estimate frequencies of GO terms separately for each species?
Find a way to incorporate phenoscape data into comparison
find good source of similar data for mice
figure out how to accurately and consistently compare features in an automated fashion
The above goals were created in early January 2014, they changed during the course of the
project. The final goals, set around early february, were:
Obtain a sample set of genes that relate a mouse ortholog to a set of zebrafish paralogs that resulted from the teleost duplication
obtain a full set of that data (possibly from Yves Van De Peer)
use phenoscape to obtain ontological annotations for each gene
use scripts from Prishanti to calculate the functional similarity between orthologs and each paralog set, as well as between the paralogs.
I have had all of the orthologs mouse to zebrafish orthologs for a while. Recently I did
a bit of processing and narrowed the list down to just one to one orthologs since
that seems to be what the Clark paper wants. This was done using editOrtholgs.pl. I removed
that change but kept the script just in case it, or something like it, needs to be done again.
This is the section I have had the most issues with. I initially relied on a fetch all paralogs
method from ensembl but that cut out before it was done. So I went in and pulled paralogs for each
unchecked gene individually, thinking that I would add this new set to my pre-existing set. There
were two issues with that idea that I discovered recently. Firstly, that method also was interrupted
prematurely by the server cutting my connection. Secondly, and more dire, this gene based paralog fetching
seems to pull far more paralogs per gene. That worries me, since I want a consistent method of evaluating
what is and what is not a paralog. Currently I am running a test script that should allow me to limit
paralogs to only the most highly related pair. That script, zebraTopInParalogs.pl, was tested and works
but it has proved unnessecary. Moreover, the fetch by gene method used in ___paralogsExtra.pl returns the
same paralog sets as the fetch all method used in ___paralogs.pl. This was verified by spot checking with
several genes pulled from the fetch all method.
I used the Gene Ontology website to get their annotation guide and the annotations for
mice and zebrafish. I also went through the annotation and selected for only those annotations
that experimental support and focus on the processes outlined in the clark paper, using
clean___Ontology.pl. All of this data and the scripts are stored in my Ontology folder.
There has been a lot of hurry up and wait for a long time so I've been trying to prep myself for
future steps while I wait for programs to finish. There has been a lot of learning and relearning
Ensembl's API. I have also been learning python and practicing with it for when I get the scripts.
I also have been reading papers on Phenoscape and Phenex and am starting the process of learning all
the languages and API's I'll need to work with Phenoscape.