Molecular tools for synthetic biology in plants: a first generation open bioinformatics workshop.

Ron Shigeta,
Niranjana Nagarajan,
Shriram Bharath,
Wilifred Tang,
Tony Hecht,
Alex Alekseyenko,
Bryce Wolfe,
Corey Hudson,
Jamey Kain,
Urvish Parikh,
Scott Fay,
Kyle Taylor

Loosely sponsored by Counter Culture Labs and Berkeley Bio Labs

Please address correspondence to

Abstract Synthetic biology has had profound effects on human life. It has provided more effective anti-malarial medicine, cheaper insulin, new useful bio-materials, and greener biofuels. However, much remains to be learned in order to synthesize proteins more efficiently. To explore the potential of the DIY biology movement to engage in meaningful synthetic biology bioinformatics research, we developed a bioinformatics workshop to study determinants of protein expression levels in plants. We extracted possible ribosome binding and translation initiation sequences and looked for correlations with experimentally determined protein levels, using publicly available data sets for the widely studied plants _Oryza sativa_ and _Arabidopsis thaliana_. The working group was open to the public and met every other week for 3 hours, typically starting with a short, relevant presentation followed by hands-on data work. We aim to develop, experimentally validate, and publish our consensus sequences, anticipating that our work will be useful for plant synthetic biology research. We hope our experience will serve as a model for future community projects that serve the dual purpose of educating curious members of the public while also generating useful scientific results.


Advances in sequencing technology have produced an avalanche of biological data over the past 12 years. The bottleneck in discovery has consequently shifted from data generation to data analysis, suggesting that much data is not used to its full potential (Lockhart 2000).

Crowdsourcing is one technique to gain more insight from existing biological data. Putting the diverse eyes and hands of the general public to the purpose of bioinformatics is not new (Good 2013) (Marbach 2012); examples include protein (Lane 2012) and RNA folding [], and both paid (Ingenuity® Systems, and unpaid (Hingamp 2008) curation of literature.

Rather than approach a problem strictly as professionals, we developed an Open Source DIY workshop where scientists and the public worked together to tackle a synthetic biology project resulting in a publishable outcome. The problem to be solved would need data from completely open sources and not require difficult analysis. A modest goal was set to do a survey of plant translation initiation motifs, aiming to create an open source parts list for controlling translation in metabolic engineering and synthetic biology. Working meetings were posted through Counter Culture Labs and Berkeley Bio Labs (groups with >100 members each) on and met every week or two over three months.

Plants offer many advantages as systems to do fine-tuned biological engineering [e.g., modification to enhance production of economically valuable terpinoid (Moses 2013), modification of lignin biosynthesis to expedite biofuel synthesis (Li 2008)]. There is a paucity of published information, however, on how to control sets of genes working in concert. Use of small sequence motifs as ribosome binding site parts for synthetic biology has been proposed in bacteria [ (Salis 2009) see also:] and similar parts have been produced for yeast []. Estimates for RBS parts in prokaryotic systems show that the translation level of a gene can be shifted by greater than an order of magnitude, indicating their potential utility in synthetic biology projects. Generating an estimate of the regulatory power of plant translation initiation motifs was thus seen as a useful goal for our project.

In most eukaryotic plant genes the 5' cap of the mRNA transcript acts as the ribosome binding site and the Kozak sequence acts as the signal for translation initiation. Due to the bacterial origins of the chloroplast, transcripts of genes encoded within the chloroplast genome contain distinct consensus sequences in comparison to transcripts from the nucleus. Instead of the 5' cap there is a short motif called the Shine-Delgarno sequence where the ribosome binds and then initiates translation, generally 8 nucleotides downstream, though this distance varies. Although there has been some experimental work on ribosome binding sites and Kozak sequences in plants [refs, perhaps Lutcke et al EMBO J 1987], genomic-scale surveys have not been performed.

Here we use publicly available, combined RNA- and protein expression data for both nuclear and chloroplast genes to estimate the power of the ribosome binding and translation initiation sequence motifs to initiate translation. These are initial results; experimental confirmation of the motifs will follow.


Plant Genome Survey and motif extraction.
A broad survey of the translation initiation motifs from both the TAIR10 Arabidopsis thaliana genome build (Swarbreck 2007) and the IRGSP 1.0 Japanese Rice Genome (Kawahara 2013) was carried out. In order to capture translation initiation motifs as well as possible leader peptide sequences, the Gene description GFF files were used to extract the 25 bases before and 18 bases after the start codon of each gene for each genome build. The terms "CDS" or "mRNA" were used to extract protein coding regions. With the data from the rice genome, we were unable to separate coding sequences in all three possible reading frames, therefore we excluded codign sequences that did not initiate with the canonical "ATG" start codon.

Chloroplast survey and motif extraction.
Because of the small number of genes in the chloroplast, a broad collection of motifs were also extracted for chloroplasts. The GenBank chromosome sequences were scraped from the Choloroplast DB webpage (Cui 2006) and used to extract motifs using BioPython (Cock 2009). This yielded 11810 initiation motifs from 109 organisms, which gave good consistency in the start codon with translation initiation generally occurring 8 nucleotides downstream of the ribosome binding site, as expected.

Transcriptome Data
As we could find no publicly available matched proteome / transcriptome datasets, we obtained 10 arrays each from arabidopsis leaf and rice leaf. All replicate arrays for noon-time leaf expression in adult plants were obtained from Gene Expression Omnibus (Barrett 2012), via GEOSearch (Zhu 2008).

Table [1] lists the data sets that were chosen in the Workshop session.

Tables Are Cool
col 3 is right-aligned $1600
col 2 is centered $12
zebra stripes are neat $1

Overall 10 arrays including replicates of four separate measurements were downloaded as Affymetrix Arabidopsis Genome ATH1 and Rice Genome Array CEL files. These were scaled using the MAS5 algorithm (Lim 2007) using the 'affy' Biocoductor library (Gautier 2004) in R (R Core Team 2013). The resulting data frame was reduced to mean, median and standard deviation estimates for each probe set for Rice and Arabidopsis each. The results were that the mean measurement standard deviation was 71 % and the median differed by 19% from the mean, indicating a reasonable sample variance that was satisfactory where doubling of intensities is considered significant.

Proteome Data
The Rice Proteome Project has a comprehensive set of quantitative proteome estimates from 2D SDS PAGE gel including different stages of the plant growth, portions of the plant as well as an organelle survey. Quantitation from gel densitometry, MASCOT scores and uniprot associations were downloaded as tables (Tanaka 2004).

Only a few hundred measurements were found from multiple sources for Arabidopsis, which did not cover the organelles explicitly and less than 10% of the known leaf proteome. As the data proved to be inadequate for this study, the Arabidopsis survey had to be set aside.

Uniprot identifiers were mapped to rice probe set identifiers. Many of the Uniprot identifiers were directly mappable to probe set (236 out of 554 identifiers: split among 123 chloroplast, 235 mature leaf, and 196 seedling leaf probes), using the Rice Coexpression Database (Sato 2013). The remaining Uniprot identifiers were manually mapped, BLASTP searching Uniprot sequence against the Oryza sativa Nipponbare reference genome (Kawahara 2013).

Translation Initiation Estimation for motifs
The interrelationship between Rice Gene, Protein in the Proteome set, and the MicroArray Probe set required several data sources. The Probe Set Annotation data for the Rice IVT Expression Array was extracted from the Probe Set Annotation CSV file provided by Affymetrix (Liu 2003). Because UniProt accessions drift over time, the Rice Proteome, which was generated circa 2004, had no protein accessions which were currently in uniprot. Data relationships to gene names and probe sets were assembled through multiple processes. Reviewing archival Rice Genome Array annotations were able to find about 50% of the probe set mappings we needed and the rest were recovered manually by searches of and if necessary, BLAST alignment of nucleotide sequences against the Rice Genome at

For genes which had proteome protein concentration estimates, the translational coefficient, \(\Theta\), for a given gene was estimated as the ratio of the protein to the mean RNA concentration as estimated by the microarray intensity.

\(\Theta\) = [Protein]/[mRNA] {1}

In order to reduce the influence of outliers, the mRNA concentration was takes as the median of the microarray values from the 10 data sets. This reduced the range of the values by 2 logs compared to taking the mean microarray probe set intensity.


Genome Surveys
Surveys of the nuclear chromosomes of Arabidopsis thaliana and Oryza sativa japonica yielded thousands of sequence motifs. A conventional logo survey (Crooks 2004) shows the expected Kozak sequence in the nuclear genes (see figure 1). In the case of chloroplast chromosome, since the Shine-Delgarno sequence does not have a fixed location with respect to the start codon (Hirose 2004), the weblogo does not show any appreciable signal [ Figure 2]

Figure 1: Sequence Logo of Chromosome 1 of Oryza sativa japonica, derived from 2134 sequences, restricted to those initiating with an ATG codon. This logo shows a canonical Kozak motif surrounding the initiating ATG. The X-axis represents the nucleotide position 20 bases upstream and 20 bases downstream of the ATG initiation codon. Some information in the wobble bases (third position) shows in the coding portion of the sequence. The other chromosomes were similar.

Figure 2: Sequence logo of Chloroplast translation initiation motifs. The X-axis represents the nucleotide position 20 bases upstream and 20 bases downstream of the ATG initiation codon. Bias in the wobble base of the codons is much more pronounced in this logo since only 81 sequences were available to analyze in Arabidopsis cholorplasts.

Translation Initiation Estimates

The relative Power estimates for the proteome to transcript ratio range over 12 powers of natural log (see figure 3 below), which is 165,000. The average value is -2.4 with an assymetrical distribution, with a greater range for enhancements to protein production ( \(\Theta\) > 1 ).

The correlation between mRNA and protein available in the cell turned out to be poor - the mRNA and proteome scores had a correlation of 0.12, which implies that there are likely several factors that are influencing both of these numbers that go into \(\Theta\), indicating that the model is too simple.

Figure 3: Histogram of relative chloroplast protein to mRNA abundance ratio (Transcript Initiation Power). Using the Median value of the microarray the ratio varied 59,000 fold.

Next Steps

Though the workshop has performed some novel analyses, this is a preliminary work. It's clear the estimate of transcript initiation has a tremendous amount of uncertainty associated with it. Microarray probe sets are not distinctly comparable with each other as the specific sequences of the probes vary in their target affinity.

An abundance of cell processes can affect the actual amount of protein produced compared to the mRNA reported by a microarray. Just a few of these may include nonsense mediated decay, inhibitory RNA, post-translational editing, protein sorting amongst cellular compartments, secondary structure in the mRNA.

Still for the largest and smallest \(\Theta\) values, the values determined might give some correlation with strong and weak translation. We will next test the leader sequences the largest Transcription Initiation Power in vivo in collaboration with the glowing plant project. To this end we'll be taking the motifs for the 10 largest and some smaller \(\Theta\) motifs and installing the sequence into a plasmid that can be validated in a plant cell by quantitation of florescence from a GFP vs a control construct with its current constitutive motif sequence. When their relative strengths have been determined, the parts themselves will be placed in the golden Braid public repository (Sarrion-Perdigones 2013).

The collection of motifs will also enable us to examine chloroplast Shine-Delgarno sequences and their relative effects on translation.

Open Workshop

One of us (RS) initiated the Open Workshop as an experiment to bring together the populations of curious laymen, experienced wet biologists, and software engineering talent in the East Bay area, and all three of these groups were represented in the attendees of the workshop. In addition, several working bioinformaticians contributed.

The project was structured to give an introduction and purpose to looking at a variety of publicly available biological data. The first five meetings each were spent on a category of biological data: chromosomal sequences; individual open reading frames; microarray data; quantitative proteomics data in 2D Polyacrylamide Gel Electrophoresis; and quantitative proteome data from Gas Chromatography/ Mass Spectroscopy. In each of these sessions, data was gathered from public sources and participants worked. Attendance ranged from 25 to 30 participants. As a public-scientific interface event, hands on work with data and computers was quite engaging and several useful scripts were written to process the data in python and R.

The following two months of biweekly data analysis sessions were less fully attended, with an average of 2-4 participants. Possible reasons for this decline include lack of understanding of the subject matter or technical skills needed to fully participate, inability to commit to an extended project, and unclear direction or incentives to continue. The more open ended nature of data interpretation and analysis is also a difficult process to relate to an introductory courss; it was difficult for newcomers to biology to attach to these tasks.

Future workshops may be structured into beginner, intermediate and advanced levels that would be more accessible to participants from diverse educational backgrounds and will likely be shorter in length to reduce attrition. Another idea is to take on a project with the sole goal of doing that project, rather than anticipating publishable results. As an experiment, the workshop did succeed in bringing together a range of talent and covered a broad set of biological data.

Slides for these sessions are available at When we have completed screening out parts, scripts, data collected and analysis for this project will be made available here:

The authors would like to thank SudoRoom, a tech Maker space in downtown Oakland, CA, for physically hosting the workshop. We would also like to thank the many other individuals who came to the workshop at one time or another. We didn't get everyone's full name, but thanks include: Felicia Betancourt, Ryan Behthencourt, Jack Cunha, Ruchira Datta, Timon D'Essarviard-Aatenhejm, Cristina Deptula, A Dangerfield, N Lynne Fix, Brian Gordon, Carl Gorringe, Louis Huang, Rajat Jain, Matt Jungert, Patrick O'Connor, Marcus Owens, Ken Ozburn, Barry Levine, Thomas Levine, Troy Massey, Ahnon Milman, Anthony Repetto, Johan Sosa, Nick Steigmann, Sasha Tocryani, Joseph Walsh, Heather Wilson and Kate Wright.


  1. David J. Lockhart, Elizabeth A. Winzeler. Nature 405, 827-836 Nature Publishing Group, 2000. Link

  2. B. M. Good, A. I. Su. Crowdsourcing for bioinformatics. Bioinformatics 29, 1925-1933 Oxford University Press, 2013. Link

  3. Daniel Marbach, James C Costello, Robert Küffner, Nicole M Vega, Robert J Prill, Diogo M Camacho, Kyle R Allison, Andrej Aderhold, Kyle R Allison, Richard Bonneau, et al.. Wisdom of crowds for robust gene network inference. Nature Methods 9, 796-804 Nature Publishing Group, 2012. Link

  4. Thomas J Lane, Diwakar Shukla, Kyle A Beauchamp, Vijay S Pande. To milliseconds and beyond: challenges in the simulation of protein folding. Current opinion in structural biology Elsevier, 2012.

  5. Pascal Hingamp, Céline Brochier, Emmanuel Talla, Daniel Gautheret, Denis Thieffry, Carl Herrmann. Metagenome annotation using a distributed grid of undergraduate students. PLoS biology 6, e296 Public Library of Science, 2008.

  6. Tessa Moses, Jacob Pollier, Johan M Thevelein, Alain Goossens. Bioengineering of plant (tri) terpenoids: from metabolic engineering of plants to synthetic biology in vivo and in vitro. New Phytologist Wiley Online Library, 2013.

  7. Xu Li, Jing-Ke Weng, Clint Chapple. Improvement of biomass through lignin modification. The Plant Journal 54, 569–581 Wiley Online Library, 2008.

  8. Howard M Salis, Ethan A Mirsky, Christopher A Voigt. Automated design of synthetic ribosome binding sites to control protein expression. Nature Biotechnology 27, 946-950 Nature Publishing Group, 2009. Link

  9. D. Swarbreck, C. Wilks, P. Lamesch, T. Z. Berardini, M. Garcia-Hernandez, H. Foerster, D. Li, T. Meyer, R. Muller, L. Ploetz, et al.. The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Research 36, D1009-D1014 Oxford University Press, 2007. Link

  10. Yoshihiro Kawahara, Melissa de la Bastide, John P Hamilton, Hiroyuki Kanamori, W Richard McCombie, Shu Ouyang, David C Schwartz, Tsuyoshi Tanaka, Jianzhong Wu, Shiguo Zhou, et al.. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6, 4 Springer (Biomed Central Ltd.), 2013. Link

  11. L. Cui. ChloroplastDB: the Chloroplast Genome Database. Nucleic Acids Research 34, D692-D696 Oxford University Press, 2006. Link

  12. Peter JA Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, others. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 Oxford Univ Press, 2009.

  13. T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, M. Holko, et al.. NCBI GEO: archive for functional genomics data sets–update. Nucleic Acids Research 41, D991-D995 Oxford University Press, 2012. Link

  14. Y. Zhu, S. Davis, R. Stephens, P. S. Meltzer, Y. Chen. GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus. Bioinformatics 24, 2798-2800 Oxford University Press, 2008. Link

  15. Wei Keat Lim, Kai Wang, Celine Lefebvre, Andrea Califano. Comparative analysis of microarray normalization procedures: effects on reverse engineering gene networks. Bioinformatics 23, i282–i288 Oxford Univ Press, 2007.

  16. Laurent Gautier, Leslie Cope, Benjamin M. Bolstad, Rafael A. Irizarry. affy—analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–315 Oxford University Press, 2004. Link

  17. R Core Team. R: A Language and Environment for Statistical Computing. (2013). Link

  18. N. Tanaka, M. Fujita, H. Handa, S. Murayama, M. Uemura, Y. Kawamura, T. Mitsui, S. Mikami, Y. Tozawa, T. Yoshinaga, et al.. Proteomics of the rice cell: systematic identification of the protein populations in subcellular compartments. Molecular Genetics and Genomics 271, 566-576 Springer-Verlag, 2004. Link

  19. Y. Sato, H. Takehisa, K. Kamatsuki, H. Minami, N. Namiki, H. Ikawa, H. Ohyanagi, K. Sugimoto, B. Antonio, Y. Nagamura. RiceXPro Version 3.0: expanding the informatics resource for rice transcriptome. Nucleic Acids Research 41, D1206-D1213 Oxford University Press, 2013. Link

  20. Y. Kawahara, M. de la Bastide, Hamilton J. P., H. Kanamori, W. R. McCombie, S. Ouyang, D. C. Schwartz, T. Tanaka, J. Wu, S. Zhou, K. L. Childs, R. M. Davidson, H. Lin, L. Quesada-Ocampo, B. Vaillancourt, H. Sakai, S. S. Lee, J. Kim, H. Numa, T. Itoh, C. R. Buell, T. Matsumoto. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6 (2013). Link

  21. G. Liu. NetAffx: Affymetrix probesets and annotations. Nucleic Acids Research 31, 82-86 Oxford University Press, 2003. Link

  22. G. E. Crooks. WebLogo: A Sequence Logo Generator. Genome Research 14, 1188-1190 Cold Spring Harbor Laboratory Press, 2004. Link

  23. T. Hirose. Functional Shine-Dalgarno-Like Sequences for Translational Initiation of Chloroplast mRNAs. Plant and Cell Physiology 45, 114-117 Oxford University Press, 2004. Link

  24. A. Sarrion-Perdigones, M. Vazquez-Vilar, J. Palaci, B. Castelijns, J. Forment, P. Ziarsolo, J. Blanca, A. Granell, D. Orzaez. GoldenBraid 2.0: A Comprehensive DNA Assembly Framework for Plant Synthetic Biology. PLANT PHYSIOLOGY 162, 1618-1631 American Society of Plant Biologists, 2013. Link

[Someone else is editing this]

You are editing this file