phenoscape

So as of the time of writing Jim is still compiling a nice little starter package for me so I can learn how phenoscape is set up and how to use OWL to access it in a way that best suits my project. Also Prishanti will be giving me scripts after the lab meeting. I also have been reading up on OWL and their main tutorial, though it is a bit vaugue. I'm thinking I will get a lot more help from Jim's scripts since they will have examples specific to my research.

ensembl access

Every now and again I mess with Ensembl's API to learn about gene trees or things like that but my scripts are always very slow so I got into contact with Steven Fishback, one of the guys who oversees killdevil. We did a bit of brain storming and now I have several ways that I could speed up data retrieval.The current ideas are: - download ENSEMBL database to killdevil - run scripts from the login node - use useastdb.ensembl.org as the host rather than ensembldb.ensembl.org I did not attempt the first idea however running scripts on the login node does seem to speed them up, however, that prevents running mulitple scripts in parallel on the queues. Changing the host also seems to be very effective. However, I have had sporadic issues with the useastdb not connecting properly and reporting that it can't find very simple databases. When this occurs I switch back to the traditional ensembldb host.

teleost duplication

I think I have found some good papers/datasets. This paper http://genome.cshlp.org/content/13/3/382.full is specifically about identifying ~50 paralog pairs in zebrafish from the teleost duplication. Table 1 lists all of these genes. This seems like a great test set that I could use for my first forrays into phenoscape. Furthermore, they have three different methods for identifying paralogs resulting from the teleost duplication. Every gene was identified by at least one method but some were identified by multiple. It may be worth limiting the dataset based on how they were identified to see if that affects their functional similarity scores. I also found http://genomebiology.com/2006/7/5/R43 which is a very good over view of the three WGD in vertebrates, with special focus on the teleost WGD. Judging by the experiments they perform and their methods, they almost certainly have a very simple way of determining which paralogs are a result of the teleost WGD, or have that dataset laid out already. However that data is not posted online. The paper states that if I want the data I would have to email them. So now the question is do I take my chances with them and hope they respond promptly or do I keep looking the see if their is a more easily accessible dataset. Side Note: I have also found two databases that specialize in duplicated genes but they do not assign duplication to any specific time so they don't appear any more useful than Ensembl.

clark scripts

I went over the scripts that Clark sent us and I noted something interesting. He only works with Homologs that have over 50 percent identity. This is outlined in MAIN_homology_maker.m in the fish folder. Not sure if that changes anything, just something I found that seemed noteworthy.