Impact Challenge Day 12: Make your data discoverable on a data repository

Data is second only to journal articles in terms of importance to science communication and publishing–it’s the rocks from which diamonds are refined. And as a researcher, chances are you’ve got research data lying around on your hard drive or server.

Yet a lot of research data never sees the light of day. It used to be difficult to make data available to others, so researchers didn’t unless required to by journals or funder mandates.

But new research has found that by putting your research data online, you’ll become up to 30% more highly cited than if you kept your data hidden. Open research data also leads to more replicable studies, and is important to the quality of science overall. And advancements in technology have made it easier than ever to cheaply preserve and make your data Open Access.

In today’s challenge, we’ll share three easy ways to make your data available online: Open Repositories (ORs) like Figshare and Zenodo; Disciplinary Repositories (DRs) like Dryad and ICPSR; and Institutional Repositories (IRs).

Why post to a data repository?

A common way for many researchers to share their data over the years has been to submit it as a supplementary file to a journal article. But publishers are beginning to encourage scientists to deposit their data to repositories instead.

Publishers recognize that repositories of all persuasions are fantastic places to post your research data. That’s because of two standard features for most repositories: high-quality preservation options and persistent identifiers for your data.

Preservation is a no-brainer–if you’re entrusting your data to a repository, you want to know that it will be around until you decide to remove it.

Persistent identifiers are important because they allow your data to be found if the URL for your data changes, or it’s transferred to another repository when your repository is shuttered, and so on. And with persistent identifiers like DOIs, it’s easy to track citations, shares, mentions, and other reuse and discussion of your data on the Web.

There are several different types of repository that can host your data depending upon your institution and discipline. Let’s dig into the different types of repositories and what each does best.

Figshare, Zenodo, and other open repositories

Screen Shot 2014-11-13 at 5.40.42 PM.png

Open repositories like Figshare and Zenodo are repositories that anyone can use, regardless of institutional affiliation, to preserve any type of scholarly output they want. Here are specific advantages and disadvantages of two open repositories.

Pros

Figshare offers free deposits for open data up to 250 MB in file size. They issue persistent identifiers called DOIs for datasets. Users can “version” their data as simply as uploading updated files, and can easily embed Figshare datasets in other websites and blogs by copying and pasting a simple code. Other users can comment on datasets and download citation files to their reference managers for later use. Figshare offers preservation backed by CLOCKSS, a highly-trusted, community-governed archive used by repositories around the world. And you get basic information about the number of views and shares on social media your dataset has gotten to date.

Zenodo also offers free data deposits and issues DOIs for your datasets. Much like Figshare, the non-profit makes citation information for datasets available in BibTeX, EndNote, and a variety of other library and reference manager formats. Users can add highly detailed metadata for their files–much more than Figshare currently allows–which can aid in discoverability. Other Zenodo users can comment on your files. And best of all, Zenodo makes it easy to sign up with your ORCID identifier or GitHub account. (If you don’t have either yet, no worries! We’re going to cover them in upcoming challenges.)

Both repositories have open APIs, making them very interoperable with other systems., and they are both user-friendly and fun to use.

Cons

For some, Figshare’s funding model is a serious drawback; it’s a for-profit company funded by Digital Science, whose parent company, Macmillian Publisher, is the keeper of the Nature Publishing Group empire.

Zenodo’s preservation plan is less robust than Figshare’s, and currently Zenodo can only host files 2 GB or less in size. Zenodo also lacks public pageview and download statistics, meaning that you can’t track the popularity or reuse of the data you submit to the archive.

Dryad, ICPSR, and other disciplinary repositories

Screen Shot 2014-11-13 at 8.28.01 PM.png

Disciplinary repositories offer a way to share specialized research data with relevant communities. They offer many of the same features as IRs and ORs, but often with special features for disciplinary data.

Pros

Disciplinary repositories like KNB and ICPSR often allow users to use subject-specific metadata schema that enhance discoverability. They are focal points for their disciplines, meaning that your data will more likely be seen by those understand it. Repositories like those in the DataONE network are interoperable with the software that you and other researchers already use to collect and analyze data, making it super easy to deposit data as part of your regular workflow. Depending on the repository, they might offer DOIs for data you’ve deposited.

Cons

Not all disciplinary repositories allow you to deposit large datasets. Some do not offer DOIs. And occasionally, grant-funded subject repositories that don’t have sustainable business models shut down after their funding runs out.

Protein Data Bank, Genbank & other datatype-specific repositories

Screen Shot 2014-11-14 at 10.32.20 AM.png

In some disciplines, entire repositories exist just for data of particular formats. Some examples include the RCSB’s Protein Data Bank for 3D shapes of proteins, nucleic acids, and complex assemblies; Genbank for DNA sequences; and EMDataBank for 3D electron microscopy density maps, atomic models, and associated metadata.

Pros

If there’s a repository for the datatype you work with, your best bet is often to deposit it there. By virtue of being a hub for disciplinary data, datatype repositories are often frequented by others in your field who are doing similar research–an ideal audience of those you’d want to see and reuse your data. Datatype repositories often offer highly-specific metadata and search options, making it easy for others in your field to find your data.

Cons

Datatype repositories cater to a very small subset of data formats, and can sometimes lack linkages to the publications and other datasets that give them much-needed context. Some datatype repositories are inactive, having been abandoned after their funding ran out, or because of a lack of use by other scientists, or for a host of other reasons. Be careful to check whether the datatype repository you’re interested in using is regularly updated.

Institutional repositories

Screen Shot 2014-11-13 at 5.27.37 PM.png

Institutional repositories are platforms where an university’s faculty and graduate students can preserve their research data and other scholarly outputs.

Pros

Institutional repositories are often free to use, allow for the addition of both basic and complex data descriptions, and usually issue persistent identifiers called Handles that others can use to cite and find your data easily. (Currently, IRs that mint DOIs are harder to come by.) Some IRs even offer unlimited data storage, meaning you can store your terabytes worth of data for free.

And by virtue of being backed by a university and administered by librarians, they’ve got a degree of trust that money can’t buy; many universities have been around for a hundred or more years, librarians have been stewards of the scholarly record since the times of the Ancient Library of Alexandria, and both will likely be around long after the Googles of the world have been shuttered.

Cons

What IRs offer in trust, they lack in flexibility and control. Many IRs have strict requirements for who can sign up and deposit research data, what data formats they’ll support over time, and if and how you can edit your files and descriptive information.

Other issues with many (but not all) institutional repositories include their lack of features for collaboration, inability to “version” datasets, unclear licensing advice for open data needs, and a lack of APIs for interoperability with other systems. Many also only use a very general metadata standard, Dublin Core, and don’t support domain or datatype-specific metadata fields and controlled vocabularies.

Perhaps the biggest drawback? No one goes to IRs looking for data, so you’re entirely reliant on search engines for discoverability.

Data repository limitations

In addition to some of the drawbacks addressed above, the biggest limitation to the idea of making your data openly available is that not everyone can do it! If you work with sensitive data–defined by ANDS as “data that can be used to identify an individual, species, object, or location that introduces a risk of discrimination, harm, or unwanted attention”–you often can’t post your data openly online.

That said, some repositories like ICPSR do index sensitive data, making it available to registered users. The availability of a metadata record alone can sometimes be enough to cite sensitive data, and so it’s possible that you can still get cited, even if your data isn’t open access. But we don’t recommend keeping your data behind a login or other barrier if you don’t have to.

Unsure if your data is “sensitive”? Check out Purdue University Library’s guide on sensitive data, which can help you identify it and all applicable laws and regulations.

Homework

For today’s homework, we’re going to get your data online.

Register for an Open Repository

Explore data hosted on Figshare and Zenodo, then choose and sign up for an account on the platform of your choice. Deposit at least one data set to the service. It can be a copy of supplementary data you’ve posted alongside a journal article, raw data, or data from a dead-end project you’ve never published.

Be sure to add as much descriptive information as possible during the deposit. It’ll make your data useful to those who look at your data, and also more “Googleable”–both repositories are well-indexed by search engines.

Choose a disciplinary repository

There are thousands of repositories where you could possibly deposit data from your field. Ask a trusted colleague for a recommendation or check out the Re3Data guide for a comprehensive list of subject repositories.

Once you’ve found one that suits your needs, register for it and deposit a dataset or two.

Explore relevant datatype-specific repositories

Ask a colleague or your advisor what the best repositories are for the data formats you tend to create. Sign up for each that you think will be the most relevant to your work, explore some of the other datasets on the site, and deposit a dataset or two of your own. And just like you did for the previous two deposits, make sure you add great descriptive information, which can help others understand your data.

Got an idea of what repository you like best? Great! Next time you’ve got a dataset that you want to share with the world, do it!

Tomorrow, we’ll explore GitHub for sharing your scientific code and data.

9 thoughts on “Impact Challenge Day 12: Make your data discoverable on a data repository

  1. Iara Vidal says:

    I’ve been wondering about something for a long time: is it worth/valuable to share the results of a bibliographic search? My thesis data is basically a reference list of articles with “altmetric*” in their titles, abstracts, and/or keywords, collected from several databases (there’s also a list of the respective authors and their institutions). It is, of course, already outdated. Would it be valuable to share it? If so, in what format(s)?

    • Yes, it would be for two main reasons: reproducibility of the research in your thesis, and to provide a snapshot of the data from a particular point in time (future studies might reuse the data to compare/contrast, etc). Plus, there’s the possibility for reuse in the classroom.

      As for formats, it depends. What format is it in now?

Leave a Reply