Tracking the impacts of data – beyond citations

This post was originally published on the e-Science Community Blog, a great resource for data management librarians.

"How to find and use altmetrics for research data" text in front of a beaker filled with green liquid

How can you tell if data has been useful to other researchers?

Tracking how often data has been cited (and by whom) is one way, but data citations only tell part of the story, part of the time. (The part that gets published in academic journals, if and when those data are cited correctly.) What about the impact that data has elsewhere?

We’re now able to mine the Web for evidence of diverse impacts (bookmarks, shares, discussions, citations, and so on) for diverse scholarly outputs, including data sets. And that’s great news, because it means that we now can track who’s reusing our data, and how.

All of this is still fairly new, however, which means that you likely need a primer on data metrics beyond citations. So, here you go.

In this post, I’ll give an overview of the different types of data metrics (including citations and altmetrics), the “flavors” of data impact, and specific examples of data metric indicators.

What do data metrics look like?

There are two main types of data metrics: data citations and altmetrics for data. Each of these types of metrics are important for their own reasons, and offer the ability to understand different dimensions of impact.

Data citations

Much like traditional, publication-based citations, data citations are an attempt to track data’s influence and reuse in scholarly literature.

The reason why we want to track scholarly data influence and reuse? Because “rewards” in academia are traditionally counted in the form of formal citations to works, printed in the reference list of a publication.

Data is often cited in two ways: by citing the data package directly (often by pointing to where the data is hosted in a repository), and by citing a “data paper” that describes the dataset, functioning primarily as detailed metadata, and offering the added benefit of being in a format that’s much more appealing to many publishers.

In the rest of this post, I’m going to mostly focus on metrics other than citations, which are being written about extensively elsewhere. But first, here’s some basic information on data citations that can help you understand how data’s scholarly impacts can be tracked.

How data packages are cited

Much like how citations to publications differ depending on whether you’re using Chicago style or APA style formatting, citations to data tend to differ according to the community of practice and the recommended citation style of the repository that hosts data. But there are a core set minimums for what should be included in a citation. Jon Kratz has compiled these “core elements” (as well as “common elements) over on the DataPub blog. The core elements include:

  • Creator(s): Essential, of course, to publicly credit the researchers who did the work. One complication here is that datasets can have large (into the hundreds) numbers of authors, in which case an organizational name might be used.

  • Date: The year of publication or, occasionally, when the dataset was finalized.

  • Title: As is the case with articles, the title of a dataset should help the reader decide whether your dataset is potentially of interest. The title might contain the name of the organization responsible, or information such as the date range covered.

  • Publisher: Many standards split the publisher into separate producer and distributor fields. Sometimes the physical location (City, State) of the organization is included.

  • Identifier: A Digital Object Identifier (DOI), Archival Resource Key (ARK), or other unique and unambiguous label for the dataset.

Arguably the most important principle? The use of a persistent identifier like a DOI, ARK, or Handle. They’re important for two reasons: no matter if the data’s URL changes, others will still be able to access it; and PIDs provide citation aggregators like the Data Citation Index and Impactstory.org an easy, unambiguous way to parse out “mentions” in online forums and journals.

It’s worth noting, however, that as few as 25% of journal articles tend to formally cite data. (Sad, considering that so many major publishers have signed on to FORCE11’s data citation principles, which include the need to cite data packages in the same manner as publications.) Instead, many scholars reference data packages in their Methods section, forgoing formal citations, making text mining necessary to retrieve mentions of those data.

How to track citations to data packages

When you want to track citations to your data packages, the best option is the Data Citation Index. The DCI functions similarly to Web of Science. If your institution has a subscription, you can search the Index for citations that occur in the literature that reference data from a number of well-known repositories, including ICPSR, ANDS, and PANGEA.

Here’s how: login to the DCI, then head to the home screen. In the Search box, type in your name or the dataset’s DOI. Find the dataset in the search results, then click on it to be taken to the item record page. On the item record, find and click the “Create Citation Alert” button on the right hand side of the page, where you’ll also find a list of articles that reference that dataset. Now you have a list of the articles that reference your data to date, and you’ll also receive automated email alerts whenever someone new references your data.

Another option comes from CrossRef Search. This experimental search tool works for any dataset that has a DataCite DOI and is referenced in the scholarly literature that’s indexed by CrossRef. (DataCite issues DOIs for Figshare, Dryad, and a number of other repositories.) Right now, the search is a very rough one: you’ll need to view the entire list of DOIs, then use your browser search (often accessed by hitting CTRL + F or Command +F) to check the list for your specific DOI. It’s not perfect–in fact, sometimes it’s entirely broken–but it does provide a view into your data citations not entirely available elsewhere.

How data papers are cited

Data papers tend to be cited like any other paper: by recording the authors, title, journal of publication, and any other information that’s required by the citation style you’re using. Data papers are also often cited using permanent identifiers like DOIs, which are assigned by publishers.

How to find citations for data papers

To find citations to data papers, search databases like Scopus and Web of Science like you’d search for any traditional publication. Here’s how to track citations in Scopus and Web of Science.

There’s no guarantee that your data paper is included in their database, though, since data paper journals are still a niche publication type in some fields, and thus aren’t tracked by some major databases. You’ll be smart to follow up your database search with a Google Scholar search, too.

Altmetrics for data

Citations are good for tracking the impact of your data in the scholarly literature, but what about other types of impact, among other audiences like the public and practitioners?

Altmetrics are indicators of the reuse, discussion, sharing, and other interactions humans can have with a scholarly object. These interactions tend to leave traces on the scholarly web.

Altmetrics are so broadly defined that they include pretty much any type of indicator sourced from a web service. For the purposes of this post, we’ll separate out citations from our definition of altmetrics, but note that many altmetrics aggregators tend to include citation data.

There are two main types of altmetrics for data: repository-sourced metrics (which often measure not only researchers’ impacts, but also repositories’ and curators’ impacts), and social web metrics (which more often measure other scholars’ and the public’s use and other interactions with data).

First, let’s discuss the nuts and bolts of data altmetrics. Then, we’ll talk about services you can use to find altmetrics for data.

Altmetrics for how data is used on the social web

Data packages can be shared, discussed, bookmarked, viewed, and reused using many of the same services that researchers use for journal articles: blogs, Twitter, social bookmarking sites like Mendeley and CiteULike, and so on. There are also a number of services that are specific to data, and these tend to be repositories with altmetric “indicators” particular to that platform.

For an in-depth look into data metrics and altmetrics, I recommend that you read Costas’ et al’s report, “The Value of Research Data” (2013). Below, I’ve created a basic chart of various altmetrics for data and what they can likely tell us about the use of data.

Quick caveat: there’s been little research done into altmetrics for data. (DataONE, PLOS, and California Digital Library are in fact the first organizations to do major work in this area, and they were recently awarded a grant to do proper research that will likely confirm or negate much of the below list. Keep an eye out for future news from them.) The metrics and their meanings listed below are, at best, estimations based on experience with both research data and altmetrics.

Repository- and publisher-based indicators

Note that some of the repositories below are primarily used for software, but can sometimes be used to host data, as well.

Web Service

Indicator

What it might tell us

Reported on

GitHub

Stars

Akin to “favoriting” a tweet or underlining a favorite passage in a book, GitHub stars may indicate that some who has viewed your dataset wants to remember it for later reference.

GitHub, Impactstory

Watched repositories

A user is interested enough in your dataset (stored in a “repository” on GitHub) that they want to be informed of any updates.

GitHub, PlumX

Forks

A user has adapted your code for their own uses, meaning they likely find it useful or interesting.

GitHub, Impactstory, PlumX

SourceForge

Ratings & Recommendations

What do others think of your data? And do they like it enough to recommend it to others?

SourceForge, PlumX

Dryad, Figshare, and most institutional and subject repositories

Views & Downloads

Is there interest in your work, such that others are searching for and viewing descriptions of it? And are they interested enough to download it for further examination and possible future use?

Dryad, Figshare, and IR platforms; Impactstory (for Dryad & Figshare); PlumX (for Dryad, Figshare, and some IRs)

Figshare

Shares

Implicit endorsement. Do others like your data enough to share it with others?

Figshare, Impactstory, PlumX

PLOS

Supplemental data views, figure views

Are readers of your article interested in the underlying data?

PLOS, Impactstory, PlumX

Bitbucket

Watchers

A user is interested enough in your dataset that they want to be informed of any updates.

Bitbucket

Social web-based indicators

Web Service

Indicator

What it might tell us

Reported on

Twitter

tweets that include links to your product

Others are discussing your data–maybe for good reasons, maybe for bad ones. (You’ll have to read the tweets to find out.)

PlumX, Altmetric.com, Impactstory

Delicious, CiteULike, Mendeley

Bookmarks

Bookmarks may indicate that some who has viewed your dataset wants to remember it for later reference. Mendeley bookmarks may be an indicator for later citations (similar to articles).

Impactstory, PlumX; Altmetric.com (CiteULike & Mendeley only)

Wikipedia

Mentions (sometimes also called “citations”)

Does others think your data is relevant enough to include it in Wikipedia encyclopedia articles?

Impactstory, PlumX

ResearchBlogging, Science Seeker

Blog post mentions

Is your data being discussed in your community?

Altmetric.com, PlumX, Impactstory

How to find altmetrics for data packages and papers

Aside from looking at each platform that offers altmetrics indicators, consider using an aggregator, which will compile them from across the web. Most altmetrics aggregators can track altmetrics for any dataset that’s either got a DOI or is included in a repository that’s connected to the aggregator. Each aggregator tracks slightly different metrics, as we discussed above. For a full list of metrics, visit each aggregator’s site.

Impactstory easily tracks altmetrics for data uploaded to Figshare, GitHub, Dryad, and PLOS journals. Connect your Impactstory account to Figshare and GitHub and it will auto-import your products stored there and find altmetrics for them. To find metrics for Dryad datasets and PLOS supplementary data, provide DOIs when adding products one-by-one to your profile, and the associated altmetrics will be imported. Here’s an example of what a altmetrics for dataset stored on Dryad looks like on Impactstory.

PlumX tracks similar metrics, and offers the added benefit of tracking altmetrics for data stored on institutional repositories, as well. If your university subscribes to PlumX, contact the PlumX team about getting your data included in your researcher profile. Here’s what altmetrics for dataset stored on Figshare looks like on PlumX.

Altmetric.com can track metrics for any dataset that has a DOI or Handle. To track metrics for your dataset, you’ll either need an institutional subscription to Altmetric or the Altmetric bookmarklet, which you can use when on the item page for your dataset on a website like Figshare or in your institutional repository. Here’s what altmetrics for a dataset stored on Figshare looks like on Altmetric.com.

Flavors of data impact

While scholarly impact is very important, it’s far from the only type of impact one’s research can have. Both data citations and altmetrics can be useful in illustrating these flavors. Take the following scenarios for example.

Useful for teaching

What if your field notebook data was used to teach undergraduates how to use and maintain their own field notebooks, and use them to collect data? Or if a longitudinal dataset you created were used to help graduate students learn the programming language, R? These examples are fairly common in practice, and yet they’re often not counted when considering impacts. Potential impact metrics could include full-text mentions in syllabi, views & downloads in Open Educational Resource repositories, and GitHub forks.

Reuse for new discoveries

Researcher, open data advocate, and Impactstory co-founder Heather Piwowar once noted, “the potential benefits of data sharing are impressive:  less money spent on duplicate data collection, reduced fraud, diverse contributions, better tuned methods, training, and tools, and more efficient and effective research progress.” If those outcomes aren’t indicative of impact, I don’t know what is! Potential impact metrics could include data citations in the scholarly literature, GitHub forks, and blog post and Wikipedia mentions.

Curator-related metrics

Could a view-to-download ratio be an indicator of how well a dataset has been described and how usable a repository’s UI is? Or of the overall appropriateness of the dataset for inclusion in the repository? Weber et al (2013) recently proposed a number of indicators that could get at these and other curatorial impacts upon research data, indicators that are closely related to previously-proposed indicators by Ingwersen and Chavan (2011) at the GBIF repository. Potential impact metrics could include those proposed by Weber et al and Ingwersen & Chavan, as well as a repository-based view-to-download ratio.

Ultimately, more research is needed into altmetrics for datasets before these flavors–and others–are accurately captured.

Now that you know about data metrics, how will you use them?

Some options include: in grant applications, your tenure and promotion dossier, and to demonstrate the impacts of your repository to administrators and funders. I’d love to talk more about this on Twitter or in the comments below.

Recommended reading

  • Piwowar HA, Vision TJ. (2013) Data reuse and the open data citation advantage. PeerJ 1:e175 doi: 10.7717/peerj.175

  • CODATA-ICSTI Task Group. (2013). Out of Cite, Out of Mind: The current state of practice, policy, and technology for the citation of data [report]. doi:10.2481/dsj.OSOM13-043

  • Costas, R., Meijer, I., Zahedi, Z., & Wouters, P. (2013). The Value of research data: Metrics for datasets from a cultural and technical point of view. Copenhagen, Denmark. Knowledge Exchange. www.knowledge-exchange.info/datametrics

6 thoughts on “Tracking the impacts of data – beyond citations

Leave a Reply