ROUGH DRAFT authorea.com/3982

# Orthology conjecture rotation project

The following is just a rough list of my immediate and stretch goals for the upcoming project:

# Primary goals

• use ensembl compara to determine orthologs and paralogs for zebra fish and mouse

• Stick to pipeline outlined by Vilella et al. paper

• use Gene ontology to obtain Biological process and Molecular function info for mouse and zebrafish?

• use same cutoffs to include only experimentally inferred annotations

• rework clark code and then use it on my data set

• create similar graphs and compare results to clark paper

• my theory: a purely mouse to zebrafish comparison should eliminate the experimental bias found in human vs mouse since mouse and zebrafish can be used for more similar experiments

# Stretch goals

• Find RNA seq data to work with (if its already out there) as a further check

• Fully eliminate authorship bias

• normalize measures of function similarity with respect to background similarity

• estimate frequencies of GO terms separately for each species?

• Find a way to incorporate phenoscape data into comparison

• find good source of similar data for mice

• figure out how to accurately and consistently compare features in an automated fashion

# goal changes

 The above goals were created in early January 2014, they changed during the course of the project. The final goals, set around early february, were: 

• Obtain a sample set of genes that relate a mouse ortholog to a set of zebrafish paralogs that resulted from the teleost duplication

• obtain a full set of that data (possibly from Yves Van De Peer)

• use phenoscape to obtain ontological annotations for each gene

• use scripts from Prishanti to calculate the functional similarity between orthologs and each paralog set, as well as between the paralogs.

# Feb 6 Update

## Orthologs

 I have had all of the orthologs mouse to zebrafish orthologs for a while. Recently I did a bit of processing and narrowed the list down to just one to one orthologs since that seems to be what the Clark paper wants. This was done using editOrtholgs.pl. I removed that change but kept the script just in case it, or something like it, needs to be done again. 

## Paralogs

 This is the section I have had the most issues with. I initially relied on a fetch all paralogs method from ensembl but that cut out before it was done. So I went in and pulled paralogs for each unchecked gene individually, thinking that I would add this new set to my pre-existing set. There were two issues with that idea that I discovered recently. Firstly, that method also was interrupted prematurely by the server cutting my connection. Secondly, and more dire, this gene based paralog fetching seems to pull far more paralogs per gene. That worries me, since I want a consistent method of evaluating what is and what is not a paralog. Currently I am running a test script that should allow me to limit paralogs to only the most highly related pair. That script, zebraTopInParalogs.pl, was tested and works but it has proved unnessecary. Moreover, the fetch by gene method used in ___paralogsExtra.pl returns the same paralog sets as the fetch all method used in ___paralogs.pl. This was verified by spot checking with several genes pulled from the fetch all method. 

## Annotations

 I used the Gene Ontology website to get their annotation guide and the annotations for mice and zebrafish. I also went through the annotation and selected for only those annotations that experimental support and focus on the processes outlined in the clark paper, using clean___Ontology.pl. All of this data and the scripts are stored in my Ontology folder. 

## General Work

 There has been a lot of hurry up and wait for a long time so I've been trying to prep myself for future steps while I wait for programs to finish. There has been a lot of learning and relearning Ensembl's API. I have also been learning python and practicing with it for when I get the scripts. I also have been reading papers on Phenoscape and Phenex and am starting the process of learning all the languages and API's I'll need to work with Phenoscape. 

# Zebrafish duplication bibliography

 Blomme, T., Vandepoele, K., De Bodt, S., Simillion, C., Maere, S., & Van de Peer, Y. (2006). The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biology, 7(5), R43. doi:10.1186/gb-2006-7-5-r43 Brunet, F. G., Roest Crollius, H., Paris, M., Aury, J. M., Gibert, P., Jaillon, O., et al. (2006). Gene loss and evolutionary rates following whole-genome duplication in teleost fishes. Molecular Biology and Evolution, 23(9), 1808-1816. doi:10.1093/molbev/msl049 Chain, F. J., Ilieva, D., & Evans, B. J. (2008). Duplicate gene evolution and expression in the wake of vertebrate allopolyploidization. BMC Evolutionary Biology, 8, 43-2148-8-43. doi:10.1186/1471-2148-8-43; 10.1186/1471-2148-8-43 Christoffels, A., Brenner, S., & Venkatesh, B. (2006). Tetraodon genome analysis provides further evidence for whole-genome duplication in the ray-finned fish lineage. Comparative Biochemistry and Physiology.Part D, Genomics & Proteomics, 1(1), 13-19. doi:10.1016/j.cbd.2005.06.001; 10.1016/j.cbd.2005.06.001 Christoffels, A., Koh, E. G., Chia, J. M., Brenner, S., Aparicio, S., & Venkatesh, B. (2004). Fugu genome analysis provides evidence for a whole-genome duplication early during the evolution of ray-finned fishes. Molecular Biology and Evolution, 21(6), 1146-1151. doi:10.1093/molbev/msh114 de Souza, F. S., Bumaschny, V. F., Low, M. J., & Rubinstein, M. (2005). Subfunctionalization of expression and peptide domains following the ancient duplication of the proopiomelanocortin gene in teleost fishes. Molecular Biology and Evolution, 22(12), 2417-2427. doi:10.1093/molbev/msi236 Li, C., Orti, G., Zhang, G., & Lu, G. (2007). A practical approach to phylogenomics: The phylogeny of ray-finned fish (actinopterygii) as a case study. BMC Evolutionary Biology, 7, 44. doi:10.1186/1471-2148-7-44 McClintock, J. M., Carlson, R., Mann, D. M., & Prince, V. E. (2001). Consequences of hox gene duplication in the vertebrates: An investigation of the zebrafish hox paralogue group 1 genes. Development (Cambridge, England), 128(13), 2471-2484. Ouedraogo, M., Bettembourg, C., Bretaudeau, A., Sallou, O., Diot, C., Demeure, O., et al. (2012). The duplicated genes database: Identification and functional annotation of co-localised duplicated genes across genomes. PloS One, 7(11), e50653. doi:10.1371/journal.pone.0050653; 10.1371/journal.pone.0050653 Robinson-Rechavi, M., Marchand, O., Escriva, H., Bardet, P. L., Zelus, D., Hughes, S., et al. (2001). Euteleost fish genomes are characterized by expansion of gene families. Genome Research, 11(5), 781-788. doi:10.1101/gr.165601 Semon, M., & Wolfe, K. H. (2007). Rearrangement rate following the whole-genome duplication in teleosts. Molecular Biology and Evolution, 24(3), 860-867. doi:10.1093/molbev/msm003 Semon, M., & Wolfe, K. H. (2007). Reciprocal gene loss between tetraodon and zebrafish after whole genome duplication in their ancestor. Trends in Genetics : TIG, 23(3), 108-112. doi:10.1016/j.tig.2007.01.003 Steinke, D., Hoegg, S., Brinkmann, H., & Meyer, A. (2006). Three rounds (1R/2R/3R) of genome duplications and the evolution of the glycolytic pathway in vertebrates. BMC Biology, 4, 16. doi:10.1186/1741-7007-4-16 Taylor, J. S., Braasch, I., Frickey, T., Meyer, A., & Van de Peer, Y. (2003). Genome duplication, a trait shared by 22000 species of ray-finned fish. Genome Research, 13(3), 382-390. doi:10.1101/gr.640303 Van de Peer, Y. (2004). Computational approaches to unveiling ancient genome duplications. Nature Reviews.Genetics, 5(10), 752-763. doi:10.1038/nrg1449 Van de Peer, Y. (2004). Tetraodon genome confirms takifugu findings: Most fish are ancient polyploids. Genome Biology, 5(12), 250. doi:10.1186/gb-2004-5-12-250 Van de Peer, Y., Taylor, J. S., & Meyer, A. (2003). Are all fishes ancient polyploids? Journal of Structural and Functional Genomics, 3(1-4), 65-73. Woods, I. G., Wilson, C., Friedlander, B., Chang, P., Reyes, D. K., Nix, R., et al. (2005). The zebrafish gene map defines ancestral vertebrate chromosomes. Genome Research, 15(9), 1307-1314. doi:10.1101/gr.4134305   Taylor, J. S., Braasch, I., Frickey, T., Meyer, A., & Van de Peer, Y. (2003) is where I got my sample set. Blomme, T., Vandepoele, K., De Bodt, S., Simillion, C., Maere, S., & Van de Peer, Y. (2006) a possible source of a full set of zebrafish paralogs created by the teleost duplication.