FAIR Access to Data
What is it? “FAIR” is the acronym for Findable, Accessible, Interoperable and Reusable, and applies to scientific DATA used by people and computers. The FAIR data principles were established in 2016. Each word has a precise meaning. More information is available at the FAIR website, from the publication, the inititive page, and the 2018 editorial.
The FAIR principles recognize that:
- published data should be stored where it can be found and identified accurately using globally unique persistent identifiers;
- data can be retrieved using standard protocols (like through a web interface);
- data is in a standard format such that it can be used in other applications; and
- is licensed properly to allow reuse.
At MaizeGDB, we value data highly and believe that it is the responsibility of every maize researcher to make their publicly-funded data FAIR. Here we outline some basic guidelines for good data management. Because of the wonderful history of cooperation in the maize research and breeding community (Kass et al., 2005, Maize COOP information, and Coe, 2001) we hope our community will lead the world wide charge to FAIR data! We are always happy to answer your questions on these issues.
Why is this important? Published data continues to increase dramatically in volume and complexity. Data sets in individual publications are now routinely so large, that the aid of computer analysis is required. For data to be “machine readable” (that is, the data can be manipulated by a computer program), new standards are being established so that data is BOTH human and machine readable. Also, journal publishers often do not accept large data sets, so other data repositories must be used, and it can be difficult for submitters to find the right database for their data. It can also be difficult for researchers to find data associated with a publication that is not in the supplementary data, if persistent globally unique identifiers are not specified in the published article.
Why should YOUR Data be FAIR? If you make your data FAIR, it will be more visible, easier to reuse, and more frequently cited. If others make their data FAIR, it will help the entire community harvest the vast depth and breath of Maize data, and discovery will proceed more quickly, especially as new analysis methods come online.
Start with these 8 Simple Things:
1. Understand the FAIR data principles, and make your data FAIR
Start by simply reading this article: https://www.nature.com/articles/sdata201618. It is also very important to budget time for data management, just as you budget time for the other aspects of your research. Do not skimp on this increasingly important step!
2. Understand “Machine Readable”
This simply means that data is in a format that can be read and processed by a computer without human intervention. For example, image files, such as tiff, jpeg, bmp, or pdfs are NOT machine readable. Word documents are also generally NOT machine readable (text mining remains in its infancy). Formats such as spreadsheets with header columns that can be exported as comma separated values (CSV), or standard formats for specific data types like FASTA, FASTQ, BED, GFF3, BAM, SAM, VCF, etc., ARE machine readable. Repositories often describe the machine-readable file formats they accept. For example, see the NCBI SRA File format Guide. If you have questions about what file formats MaizeGDB accepts, please contact Maggie Woodhouse ([email protected]) or Ethy Cannon ([email protected]). Also, remember that computers are good at exact matches; for example, "lg1" does NOT equal "liguleless1" and “Chr1” does NOT equal “1” to a computer.
3. Put your data in the right database
There are excellent stable repositories for many types of scientific data, and emerging databases for newer types of data. Data should go into the correct repository, and then it can be pulled into MaizeGDB for further curation and use with our tools. Where ever you deposit data, get a DOI (or other persistent, globally unique identifier) and put it in your publication.
DNA/RNA/Protein Sequences, genome assemblies should go to NCBI, EBI
NCBI (US), EBI (Europe), and DDBJ (Asia) provide stable, long-term storage
for DNA, RNA and protein sequence data and create stable identifiers for
datasets. These three organizations share sequence data on a daily basis,
so data deposited at one is available at all. Each has multiple
sub-databases; for example, NCBI has SRA and GEO for unmapped and mapped
SNPs: All maize SNPs should be submitted to EVA at EBI.
Genome Assemblies: Please submit genome assemblies to EBI or NCBI Genomes. We understand this can take some time to complete. We can help, so please do not be tempted to simply submit contigs to Genbank.
Protein/Proteomics/Metabolomics: Explore Uniprot, MassIVE, MetaboLights, Peptide Atlas, and PRIDE. Metabolomics data should be submitted following the MSI guidelines. Submit proteomics data to members of the ProteomeXchange, following the MIAPE recommendations.
General Repositories: Dryad, Figshare, CyVerse
Not sure? Nature provides an excellent < a href="https://www.nature.com/sdata/policies/repositories">list of data repositories and recommendations, as does PLOS ONE. The re3data.org and FAIRsharing.org websites have extensive lists of databases, resources, and repositories. If you are still unsure where to submit data, or need help submitting, please ask anyone at MaizeGDB. If your journal article refers to data NOT published with your article, please make sure to obtain and add a persistent identifier and location of your data in your article.
4. Attach complete and detailed metadata to your data sets, and use
accepted file formats
When you deposit data, you are asked for information about your data
(metadata). Please give this the same careful attention you give to your
bench work and analysis. Datasets that are not adequately described are not
reusable or reproducible, and raise questions about the carefulness and
accuracy of the research. You should supply enough metadata so that your
experiment can be reproduced. Be sure to use community standards for your
such as MIxS (Minimal
Information about any Sequence) or
MIAPPE (Minimum Information About a Plant
Phenotyping Experiment). These standards will inform you on what
information to provide and the accepted file formats for your type of data.
Standards can also be found at data repositories.
Lastly, use ontology terms to describe your data. Ontologies provide a powerful organizing framework for data, and help data to be machine readable.
5. Do not rename genes that already have names Once upon a time, the name of a maize gene was its unique, persistent identifier. But now, renaming of genes that already have names is a big problem. Many names for the same genes make it difficult to find all information for that gene. Even worse, when the same name is used for different genes, how can a human, much less a computer know they are different genes? Please look up your gene at MaizeGDB before assigning a name, and follow the maize nomenclature guidelines.
6. Let MaizeGDB know about your work Please let us know about your publications, and provide links to your data. If you have published on a gene or genes, or if you have a dataset that will be useful to others, let us know. Review the MaizeGDB pages on genes or other information that you study, and let us know if corrections should be made. We want to get it right! Submission templates can be found on our “Contribute Data to MaizeGDB” page.
7. Be open to culture change Graduate students and postdocs are generally aware of the FAIR data principles, as they will be required to abide by these new standards throughout their careers. Making data FAIR is a bit of a culture change in laboratory science, and will take some time to get used to. You could think of sharing data the way we share seeds &emdash; documenting digital data is as important as documenting the crosses that went into generating germplasm. The benefits of having so much public data at your fingertips will revolutionize how we do science, and will contribute towards accelerating discovery. Embrace the change. Incorporating FAIR data principles can help you get higher-rated grants and publications. MaizeGDB is happy to work with you on your data management plans for grant submissions.
8. Ask questions Asking questions is a great way to find out things you want to know. At MaizeGDB, we do not know all the answers, but we want to learn them with you! Please contact us.