MaizeGDB Genome Assembly and Annotation Manifesto
If you need to document involvement of MaizeGDB in your planned
assembly or annotation efforts, contact Carson Andorf
for a letter of collaboration.
B73 GENOME ASSEMBLY
It is imperative that the community work from the same genome coordinate system
across projects in order to allow the data generated by various groups to be fully
leveraged and displayed in a comparable manner. Like many Model Organism Databases,
MaizeGDB is charged to facilitate this process and is committed to releasing official
genome assemblies as they are made available.
B73 RefGen_v2 was released as the default view of the assembly at MaizeGDB. This
version was calculated by the Maize Genome Sequencing Consortium and became available
via GenBank on December 7th, 2012. The project record is
The next version, B73 RefGen_v3, became the default assembly view of the MaizeGDB
Genome Browser in April 2013. RefGen_v3 was not a global re-assembly. B73 RefGen_v3
used Roche/454 reads produced from a whole genome shotgun (WGS) sequencing library
to capture missing gene space within and between the original BACs. The 454 reads
were assembled into contigs with AbySS and aligned to the B73 RefGen_v2 assembly
to identify new contiguous pieces of DNA sequence that were already represented
in the v2 assembly. In addition, ~65,000 Full Length cDNAs (FLcDNAs- from the Maize
Full Length cDNA project; more information
and here) were aligned to both the B73
RefGen_v2 contigs and the new contigs. B73 RefGen_v3 was the final product of the
Maize Genome Sequencing Consortium.
An entirely new assembly of the maize genome (B73 RefGen_v4) was constructed from
PacBio Single Molecule Real-Time (SMRT) sequencing at approximately 60 fold coverage
and scaffolded with the aid of a high-resolution whole-genome restriction (optical)
mapping. This new assembly was constructed without the assistance of the BAC physical
map that had been used to guide the previous V1-V3 assemblies. The pseudomolecules
of maize B73 RefGen_v4 were assembled nearly end-to-end, representing a 52-fold
improvement in average contig size relative to the previous reference (B73 RefGen_v3).
Additional information on this assembly can be found here. B73 RefGen_v4 was funded
by the NSF IOS #1112127
award to Gramene.
ADDITIONAL GENOME ASSEMBLIES
Initially, the B73 genome was the only reference quality genome assembly available
for maize due the high costs of sequencing and assembling a large (~2.1 GB) genome.
More recently, as sequencing and assembling costs for a large genome have dropped,
a number of maize research groups have constructed reference quality genome assemblies
for some of the more widely used maize inbred lines. Detailed information on those
genome assemblies can be found here.
GENOME ASSEMBLY AND GENE MODEL NOMENCLATURE
A well-developed nomenclature system is necessary to prevent confusion and to relay
as much information as possible without being overly cumbersome. A nomenclature
system needs to account for species-specific information so that the exact inbred
line used and project-specific metadata can be accessed easily. The change from
GRMZM IDs to the new nomenclature was necessitated for a few reasons. The main reason
is to connote which maize line the models are derived from. This is particularly
important in maize, which is well documented to contain substantial presence/absence
variation (PAV) and copy number variation (CNV) across inbred lines. To make this
transition easier, older maize nomenclature is retained as a synonym and can be
used to look up gene models at MaizeGDB. Note also that gene model names in maize
DO NOT CONVEY ORDER ALONG THE PSEUDOMOLECULE. Specific details on the current maize
nomenclature standards in use can be found here and
MAIZEGDB ACCEPTS FUNCTIONAL ANNOTATION
Functional annotation can mean different things to different people. It generally
involves attaching information regarding gene product identity, biological or biochemical
function, expression, regulation, and interactions to a genomic DNA sequence. Are
you generating RNAseq data and wish for that to be aligned to assemblies to show
that the genes in a particular region are expressed? Do you have a mutation for
a gene that is mapped to a genome assembly and the mutant phenotype is known? Have
you experimentally determined the temporal and spatial regulation of a small group
of transcription factors? MaizeGDB is interested in both small and large functional
annotation data sets determined by either in silico analysis or experimental validation.
Contact us at MaizeGDB to find out how
your functional annotations can be included in the MaizeGDB resource.
In addition to the types of functional annotations already described, we at MaizeGDB
accept functional annotations that are based upon assignment of terms from the
Gene Ontologies (GO; http://www.geneontology.org) to gene structures. When GO terms
are assigned to a particular gene, standard Evidence Codes are required to document
how the inference of function was made. For example, an annotation that was made
on the basis of an published, peer reviewed experiment would have the evidence code
EXP, whereas an annotation made on the basis of an enzyme assay would have the evidence
code IDA. Evidence Codes used by the Gene Ontology Consortium are available
SUGGESTED GUIDELINES FOR RESEARCH GROUPS PLANNING TO
SEQUENCE, ASSEMBLE, AND ANNOTATE A MAIZE GENOME FOR SUBMISSION TO MAIZEGDB
A plan for providing documentation that is complete, accurate, and timely.
A centrally accessible plan should be made available at the time that your project
begins and include a timeline for data delivery. Functional and structural annotation
should be provided with standard evidence codes, clearly discriminating annotation
with experimental evidence from purely in silico analyses.
A plan for developing a close working relationship with MaizeGDB as the ultimate
disseminators of the information. Assemblies and annotations should be delivered
to MaizeGDB regularly and in a timely fashion. MaizeGDB can display the deliverable
dates so as to keep the community informed. Ideally, these dates should be known
in advance, and should be adhered to if at all possible. MaizeGDB will create a
genome assembly webpage to display your project metadata for your genome assembly.
It is understood that delays can occur. The intent here is to make the process more
transparent to the research community.
A mechanism for interacting with the maize community directly and with a single
voice. Maize researchers comprise a vibrant community with researchers at all
levels in both the public and private sectors. A bidirectional means of communicating
with the maize community should be deployed at the start of the project so that
the maize community can both absorb and respond to new project information quickly.
The goal is to provide all community members with the same information at the same
time so that they can plan their research activities accordingly. This can be accomplished
in many ways (FAQs, blogs, social media, conferences, etc.) and all options should
be considered so as to reach the largest number of stakeholders.
A robust way to capture genome assembly and annotation information from the
community. For any genome assembly, researchers often have high-quality structural
and functional annotations for their genes of interest, both stored on lab computers
and documented in publications. Researchers are usually willing to share this information
freely, but currently, there is no robust means to capture it. Groups developing
genome assemblies are encouraged to work with MaizeGDB to develop a plan for collection
of high value annotations that are specific to their assemblies. All annotation
submitted by community members for specific genome assemblies will be vetted by
MaizeGDB curators and then be incorporated into the assembly, with an indication
of who provided the data. It is expected that while there will be comparatively
little data entering the assembly process in this way, these data would be of very