Corey Hudson edited Methods.md  over 10 years ago

Commit id: 701bd172ed699158003dad3bb9bc9726be16a1e6

deletions | additions      

       

A broad survey of the translation initiation motifs from both the TAIR10 _Arabidopsis thaliana_ genome build \cite{eyer_Muller_Ploetz_et_al__2007} and the IRGSP 1.0 Japanese Rice Genome \cite{rtz_Tanaka_Wu_Zhou_et_al__2013}. In order to capture transcription initiation motifs as well as possible leader peptide sequences, the Gene description GFF files were used to extract the 25 bases before and 18 bases after the start codon of each gene for each genome build. Because of differences in the definition of gene, CDS, mRNA and similar terms, some classes of records did not give the expected start codon in the first three bases and were rejected.  ###Chloroplast survey and motif extraction.  Because of the small number of genes in the chloroplast, a broad collection of motifs were also extracted for chloroplasts. The GenBank chromosome sequences were scraped from the Choloroplast DB webpage \cite{Cui_2006} and used to extract motifs using BioPython. BioPython \cite{cock2009biopython}.  This yielded 11810 initiation motifs from 109 organisms, which gave good consistency in the start codon with translation initiation generally occurring 8 nucleotides downstream of the ribosome binding site, as expected. ###Transcriptome Data  As we could find no publicly available matched proteome / transcriptome datasets, we obtained 10 arrays each from arabidopsis leaf and rice leaf. All replicate arrays for noon-time leaf expression in adult plants were obtained from Gene Expression Omnibus \cite{ippy_Sherman_Holko_et_al__2012}, via GEOSearch \cite{vis_Stephens_Meltzer_Chen_2008}.