How big does our text-mining training set need to be?

We got some great feedback from reviewers our new Sloan grant, including a suggestion that we be more transparent about our process over the course of the grant. We love that idea, and you’re now reading part of our plan for how to do that: we’re going to be blogging a lot more about what we learn as we go.

A big part of the grant is using machine learning to automatically discover mentions of software use in the research literature. It’s going to be a really fun project because we’ll get to play around with some of the very latest in ML, which currently The Hotness everywhere you look. And we’re learning a lot as we go. One of the first questions we’ve tackled (also in response to some good reviewer feedback) is: how big does our training set need to be? The machine learning system needs to be trained to recognized software mentions, and to do that we need to give it a set of annotated papers where we, as humans, have marked what a software mention looks like (and doesn’t look like). That training set is called the gold standard. It’s what the machine learning system learns from. Below is copied from one of our reviewer responses:

We came up with the number of articles to annotate through a combination of theory, experience, and intuition.  As usual in machine learning tasks, we considered the following aspects of the task at hand:

  • prevalence: the number of software mentions we expect in each article
  • task complexity: how much do software-mention words look like other words we don’t want to detect
  • number of features: how many different clues will we give our algorithm to help it decide whether each word is a software mention (eg is it a noun, is it in the Acknowledgements section, is it a mix of uppercase and lowercase, etc)

None of these aspects are clearly understood for this task at this point (one outcome of the proposed project is that we will understand them better once we are done, for future work), but we do have rough estimates.  Software mention prevalence will be different in each domain, but we expect roughly 3 mentions per paper, very roughly, based on previous work by Howison et al. and others.  Our estimate is that the task is moderately complex, based on the moderate f-measures achieved by Pan et al. and Duck et al. with hand-crafted rules.  Finally, we are planning to give our machine learning algorithm about 100 features (50 automatically discovered/generated by word2vec, plus 50 standard and rule-based features, as we discuss in the full proposal).

We then used these estimates.  As is common in machine learning sample size estimation, we started by applying a rule-of-thumb for the number of articles we’d have to annotate if we were to use the most simple algorithm, a multiple linear regression.  A standard rule of thumb (see https://en.wikiversity.org/wiki/Multiple_linear_regression#Sample_size) is 10-20 datapoints are needed for each feature used by the algorithm, which implies we’d need 100 features * 10 datapoints = 1000 datapoints.  At 3 datapoints (software mentions) per article, this rule of thumb suggests we’d need 333 articles per domain.  

From there we modified our estimate based on our specific machine learning circumstance.  Conditional Random Fields (our intended algorithm) is a more complex algorithm than multiple linear regression, which might suggest we’d need more than 333 articles.  On the other hand, our algorithm will also use “negative” datapoints inherent in the article (all the words in the article that are *not* software mentions, annotated implicitly as not software mentions) to help learn information about what is predictive of being vs not being a software mention — the inclusion of this kind of data for this task means our estimate of 333 articles is probably conservative and safe.

Based on this, as well as reviewing the literature for others who have done similar work (Pan et al. used a gold standard of 386 papers to learn their rules, Duck et al. used 1479 database and software mentions to train their rule weighting, etc), we determined that 300-500 articles per domain was appropriate. We also plan to experiment with combining the domains into one general model — in this approach, the domain would be added as an additional feature, which may prove more powerful overall. This would bring all 1000-1500 articles to the test set.

Finally, before proposing 300-500 articles per domain, we did a gut-check whether the proposed annotation burden was a reasonable amount of work and cost for the value of the task, and we felt it was.

References

Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. (2016). A Survey of Bioinformatics Database and Software Usage through Mining the Literature. PLOS ONE, 11(6), e0157989. http://doi.org/10.1371/journal.pone.0157989

Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology (JASIST), Article first published online: 13 MAY 2015. http://doi.org/10.1002/asi.23538

Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871. http://doi.org/10.1016/j.joi.2015.07.012

Comparing Sci-Hub and oaDOI

Nature writer Richard Van Noorden recently asked us for our thoughts about Sci-Hub, since in many ways it’s quite similar to our newest project, oaDOI. We love the idea of comparing the two, and thought he had (as usual) good questions. His recent piece on Sci-Hub founder Alexandra Elbakyan quotes some of our responses to him; we’re sharing the rest below:

Like many OA advocates, we see lots to admire in Sci-Hub.

First, of course, Sci-Hub is making actual science available to actual people who otherwise couldn’t read it. Whatever else you can say about it, that is a Good Thing.

Second, SciHub helps illustrate the power of universal OA. Imagine a world where when you wanted to read science, you just…did? Sci-Hub gives us a glimpse of what that will look like, when universal, legal OA becomes a reality. And that glimpse is powerful, a picture that’s worth a thousand words.

Finally, we suspect and hope that SciHub is currently filling toll-access publishers with roaring, existential panic. Because in many cases that’s the only thing that’s going to make them actually do the right thing and move to OA models.

All this said, SciHub is not the future of scholarly communication, and I think you’d be hard pressed to find anyone who thinks it is. The future is universal open access.

And it’s not going to happen tomorrow. But it is going to happen. And we built oaDOI to be a step along that path. While we don’t have the same coverage as SciHub, we are sustainable and built to grow, along with the growing percentage of articles that have open access versions. And as you point out, we offer a simple, straightforward way to get fulltext.

That interface was not exactly inspired by SciHub, but rather I think an example of convergent evolution. The current workflow for getting scholarly articles is, in many cases, absolutely insane. Of course this is the legacy of a publishing system that is built on preventing people from reading scholarship, rather than helping them read it. It doesn’t have to be this hard. Our goal at oaDOI is to make it less miserable to find and read science, and in that we’re quite similar to SciHub. We just think we’re doing it in a way that’s more powerful and sustainable over the long term.

Better than a free Ferrari: Why the coming altmetrics revolution needs librarians

This post was originally published as the forward to Meaningful Metrics: A 21st Century Librarian’s Guide to Bibliometrics, Altmetrics, and Research Impact [paywall, embargoed for 6mo]. It’s also persistently archived on figshare.

A few days ago, we were speaking with an ecologist from Simon Fraser University here in Vancouver, about an unsolicited job offer he’d recently received. The offer included an astonishing inducement: anyone from his to-be-created lab who could wangle a first or corresponding authorship of a Nature paper would receive a bonus of one hundred thousand dollars.

Are we seriously this obsessed with a single journal? Who does this benefit? (Not to mention, one imagines the unfortunate middle authors of such a paper, trudging to a rainy bus stop as their endian-authoring colleagues roar by in jewel-encrusted Ferraris.)  Although it’s an extreme case, it’s sadly not an isolated one. Across the world, A Certain Kind of administrator is doubling down on 20th-century, journal-centric metrics like the Impact Factor.

That’s particularly bad timing, because our research communication system is just beginning a transition to 21st-century communication tools and norms. We’re increasingly moving beyond the homogeneous, journal-based system that defined 20th century scholarship.

Today’s scholars increasingly disseminate web-native scholarship. For instance, Jason’s 2008 tweet coining the term “altmetrics” is now more cited than some of his peer-reviewed papers. Heather’s openly published datasets have gone on to fuel new articles written by other researchers. And like a growing number of other researchers, we’ve published research code, slides, videos, blog posts, and figures that have been viewed, reused, and built upon by thousands all over the world. Where we do publish traditional journal papers, we increasingly care about broader impacts, like citation in Wikipedia, bookmarking in reference managers, press coverage, blog mentions, and more. You know what’s not capturing any of this? The Impact Factor.

Many researchers and tenure committees are hungry for alternatives, for broader, more diverse, more nuanced metrics. Altmetrics are in high demand; we see examples at Impactstory (our altmetrics-focused non-profit) all the time. Many faculty share how they are including downloads, views, and other alternative metrics in their tenure and promotion dossiers, and how evaluators have enthused over these numbers. There’s tremendous drive from researchers to support us as a nonprofit, from faculty offering to pay hundreds of extra dollars for profiles, to a Senegalese postdoc refusing to accept a fee waiver. Other altmetrics startups like Plum Analytics and Altmetric.com can tell you similar stories.

At higher levels, forward-thinking policy makers and funders are also seeing the value of 21st-century impact metrics, and are keen to realize their full potential. We’ve been asked to present on 21st-century metrics at the NIH, NSF, the White House, and more. It’s not these folks who are driving the Impact Factor obsession; on the contrary, we find that many high-level policy-makers are deeply disappointed with 20th-century metrics as we’ve come to use them. They know there’s a better way.

But many working scholars and university administrators are wary of the growing momentum behind next-generation metrics. Researchers and administrators off the cutting edge are ill-informed, uncertain, afraid. They worry new metrics represent Taylorism, a loss of rigor, a loss of meaning. This is particularly true among the majority of faculty who are less comfortable with online and web-native environments and products. But even researchers who are excited about the emerging future of altmetrics and web-native scholarship have a lot of questions. It’s a new world out there, and one that most researchers are not well trained to negotiate.

We believe librarians are uniquely qualified to help. Academic librarians know the lay of the land, they keep up-to-date with research, and they’re experienced providing leadership to scholars and decision-makers on campus. That’s why we’re excited that Robin and Rachel have put this book together. To be most effective, librarians need to be familiar with the metrics research, which is currently advancing at breakneck speed. And they need to be familiar with the state of practice–not just now, but what’s coming down the pike over the next few years. This book, with its focus on integrating research with practical tips, gives librarians the tools they need.

It’s an intoxicating time to be involved in scholarly communication. We’ve begun to see the profound effect of the Web here, but we’re just at the beginning. Scholarship is on the brink of Cambrian explosion, a breakneck flourishing of new scholarly products, norms, and audiences. In this new world, research metrics can be adaptive, subtle, multi-dimensional, responsible. We can leave the fatuous, ignorant use of Impact Factors and other misapplied metrics behind us. Forward-thinking librarians have an opportunity to help shape these changes, to take their place at the vanguard of the web-native scholarship revolution. We can make a better scholarship system, together. We think that’s even better than that free Ferrari.

Impact Challenge Day 16: Post your preprints

Today, we’ll expand on self-archiving your articles to cover how you can make your article preprints available online.

“Publishing” your preprints has been popular in disciplines like physics for a while, and it’s starting to catch on in other fields, too. It’s easy to see why: publishing preprints gets your work out right away, while still letting you publish the formally peer-reviewed version later. That has some big advantages:

  • You establish intellectual precedence for your ideas
  • You can start accumulating citations right away
  • You can get early feedback from colleagues
  • It helps research in your field move more quickly

In today’s challenge, we’ll correct some common misconceptions about sharing preprints, and discuss your options for where to post them. Let’s get down to it!

Preprints – facts vs. fiction

FACT: Posting preprints makes your research freely available to all

You can get the “prestige” of publishing with certain toll-access journals while still archiving your work in places where the public and other scholars can access it. That access means that others can cite your work before its been formally published, getting you more citations. (More on that in a moment.) More importantly, that access fulfills your duty to science and humankind: to advance knowledge for all.

FICTION: Journals won’t publish your work if it’s already been posted online

It’s a common misconception that if you post your preprints online before they’ve been published, most journals won’t allow you to publish it formally, citing “prior publication.” As ecologist Ethan White points out,

The vast majority of publication outlets do not believe that preprints represent prior publication, and therefore the publication ethics of the broader field of academic publishing clearly allows this. In particular Science, Nature, PNAS, the Ecological Society of America, the Royal Society, Springer, and Elsevier all generally allow the posting of preprints.

And some publishers (PLOS, PeerJ, and eLife, among others) even encourage the posting of preprints! You can check this list of preprint policies or Sherpa/Romeo to find out what the policies are for your journal of choice. If you’re still unsure, contact your journal’s editors for more information.

FACT: Preprints can accumulate citations that traditional articles can’t

A major advantage to preprints is the speed with which they can accumulate citations. Scientists report getting citations for preprints in articles that are published before their articles are, and citing others ahead of their article’s formal publication. Would you prefer that others didn’t cite your preprint, and waited for the final copy? That’s as easy as adding a warning to the header of your article (as we see here and here).

FICTION: You’ll get scooped

Some worry that if their results are online before publication, others will be able to scoop them by publishing a similar study. Yet, researchers share their work all the time at conferences without similar worries, and in fact having a digital footprint that proves you’ve established intellectual precedence can prevent scooping.

As paleontologist Mike Taylor points out, “I can’t think of anyone who would be barefaced enough to scoop [something] that had already been published on arXiv…If they did, the whole world would know unambiguously exactly what had happened.”

FACT: Preprints can advance science much more rapidly than traditional publishing can

By posting your preprints, others can more quickly build upon your work, accelerating science and discovery.  After all, it can take years for papers to be published after their acceptance. And that can lead to situations like Mike Taylor’s:

We wrote the bulk of the neck-anatomy paper back in 2008 — the year that we first submitted it to a journal. In the four years since then, all the observations and deductions that it contains have been unavailable to the world. And that is stupid.

Preprints will help you avoid four year (!) publication delays.

FACT: Preprints aren’t rigorously peer reviewed

It’s 100% true that most preprints aren’t peer reviewed beyond a simple sanity check before going online for the world to see. It’s possible that the lack of peer review means that incorrect results could get circulated, leading to confusion or misinformation down the line. (Of course, peer-reviewed work is also often retracted or modified after publication–no one’s perfect ;))   A great tool to manage the versions of a paper, including preprints, is CrossMark, which was invented to provide an easy-to-find breadcrumb trail that leads from the preprint to the peer-reviewed paper to any subsequent, corrected versions of the paper.

FACT: Feedback on your work, before you submit it

If you’re posting your work to a disciplinary preprint server where your colleagues are likely to read it, you can benefit from your community’s constructive feedback ahead of submitting your article for publication. As genomics researcher Nick Loman explains,

[I find very useful] the benefits of publishing to a self-selected audience who are genuinely interested in this subject, and actively wish to read and critique such papers out of professional curiosity, not just because they are lucky/unlucky enough to be selected as peer reviewers.

And even if your work is already in press, you can get feedback on your soon-to-be-published work immediately, rather than months (or years) later when the paper is finally published.

Where to post preprints

Options abound for posting your preprints. Note that some of the following options are considered commercial repositories, and thus might not be eligible for use under some publishers’ conditions.

Figshare

A popular, discipline-agnostic, commercial repository that’s free to use and has a CLOCKSS-backed preservation strategy. Figshare issues DOIs for content it hosts, offers altmetrics (views and shares) to help you track the readership and interest in your preprint, and requires CC-BY licenses for publicly accessible preprints. Figshare’s commenting feature allows for easy, public feedback on your work.

One downside to Figshare is that it’s easy for your preprint to get lost in the mix amongst all the other data, posters, and other scholarly outputs that are shared on the site, from many different disciplines. It’s also a for-profit venture, meaning it wouldn’t meet the non-commercial requirement that some journals have for preprints.

PeerJ PrePrints

A preprint server for the biomedical sciences that’s closely integrated with the Open Access journal, PeerJ. PeerJ PrePrints is free to use and popular in the Open Science community due to its sleek submission interface and the availability of altmetrics. PeerJ PrePrints also offers a commenting feature for feedback.

Like Figshare, PeerJ PrePrints will not meet the “non-commercial” requirement that some journals have for how preprints are shared.

ArXiv

ArXiv is one of the oldest and most famous preprint servers, and it serves mostly the physics, maths, and computational science communities. It’s a non-profit venture run by Cornell University Library, meaning it meets the “non-commercial” requirement of some publishers. By virtue of being a disciplinary repository, it’s a good place to post your work so that others in your field will read it.

Two drawbacks of ArXiv are that it’s not often used by those outside of physics and its other core disciplines, and that it doesn’t offer altmetrics, making it impossible to know the extent to which your work has been viewed and downloaded on the platform.

Ethan White has a great list of preprint servers on his blog; check it out for more preprint server options.

Homework

For today’s homework, you’re going to do some due diligence. Use this list of preprint policies and Sherpa/Romeo or rchive.it to learn what journals in your discipline allow pre-publication archiving, and do some thinking on how you can share your next study prior to publication. That way, when you write your next article, you’ve got a preprint server in mind for it, so you can share it as quickly as possible.

And if you didn’t finish uploading preprints for articles you’ve already published (your homework from yesterday), upload them today. The more content you’ve got online and freely available, the more everyone benefits!

Tomorrow: ORCID identifiers to collect and claim your articles, datasets, and more. Stay tuned!

Impactstory Advisor of the Month: Keita Bando (November 2014)

Our November Impactstory Advisor of the Month is Keita Bando!

15101222975_5b920b542d_o.jpg

Keita is one of Japan’s best-known Open Access and altmetrics advocates. As a co-founder of MyOpenArchive (a precursor to Figshare), Keita’s been at the vanguard of scholarly communication for much of his career.

We chatted with Keita this week to learn more about his current Mendeley2ORCID project, his advisory roles at other scholcomm startups like Figshare and Mendeley, and his vision for what part librarians can play in helping researchers navigate the ever-expanding world of scholarly communication.

Tell us a bit about your role at MyOpenArchive.

Back in 2007, on the assumption that “there must be a host of unnamed research papers around the world,” we started MyOpenArchive, a web repository where you can post and share your academic works.

Here’s an example of why we started MyOpenArchive: peer-reviewed papers in journals and cited sources are only “a handful of jewels,” so to speak. But these jewels could potentially be found anywhere in academia: graduate school papers, internal research, and so on. Before we started MyOpenArchive, tons of knowledge could not be found outside of institutions. So we created a place where you can post and share all knowledge: MyOpenArchive.

We advocated for open access locally in Japan in the first three years of MyOpenArchive’s life [1][2][3], and in the latter three years we advocated internationally. [4][5][6] These six years of advocacy made substantial impact on researchers who hope to be more open and cited. We closed MyOpenArchive in 2013. [7]

Since the closing of MyOpenArchive, my primary role is an Open Access advocate at large.

Why did you initially decide to join Impactstory?

I encountered altmetrics around 2010. As an Open Access advocate, I support Jason Priem et al’s “altmetrics: a manifesto” and I am convinced that we can use altmetrics to change the open access movement.

I marveled to find how Total-Impact (now ImpactStory) and the Mendeley API-powered ReaderMeter work as altmetrics tools.

It is Total-Impact (ImpactStory) that inspired me to take part in the vanguard of open access, and to be engaged also as a Mendeley Advisor.

Why did you decide to become an Advisor?

Because I was absorbed in altmetrics as a concept, as a tool, and a community of research, I devoted myself to posting some of the first research on altmetrics in Japanese. Also I talked at academic institutional programs, including an event for librarians, and took part in several other open access advocacy events. [8][9][10]

So becoming an Impactstory Advisor was a natural fit, given my interests and advocacy.

I also volunteer my time to support Mendeley, figshare and ORCID as an advisor or an ambassador.[11]

All of my many advisory roles essentially represent my immersion in academic communications around open access and altmetrics. It has been my visiting card at public meetings, to be able to say that “As an advisor I consider altmetrics as…” and so on.

What’s your favorite Impactstory feature?

Above all, I love the T-shirt for advisors! (Look at the picture, please!) 🙂

Sorry, just kidding! (But I love it!)

I found the API-enabled, third-party integration features amazingly helpful. For instance, figshare, GitHub, slideshare are well-integrated with Impactstory.

I am particularly interested in Impactstory’s ORCID integration. Recently I posted on my personal blog how I love how well Mendeley, ORCID and ImpactStory are integrated. I was amazed that a lot of researchers found the post useful, and we had a good discussion.[12]

“Two things you have to do when you publish your academic work,” I suggested. “1. Manage the works you published and 2. Manage the altmetrics”.

First, register your publication on Mendeley. Next, add it on ORCID. But don’t waste your time by doing both manually. (As a researcher, you don’t have enough time, right?) We created Mendeley2ORCID so you don’t have to do it manually.[13] The service allows users to sync “My publications” on Mendeley to “My Works” on ORCID.[14]

It takes few minutes to sync, then you can sync ORCID to ImpactStory. Once you sync that, you can see how your research works’ altmetrics work.

One important thing to mention: works you add should have a DOI (or ArXiv ID, PubMed ID, etc). If you don’t have one, you can get a DOI by adding it to Figshare. Using a DOI makes it easy to measure social media impacts and citations.

The Mendeley2ORCID syncer is one reason I enjoy taking part in these four advisory/ambassador programs.

You blog and tweet a lot about changes in publishing, digital preservation, and open science. In your expert opinion, what’s the single biggest challenge facing scholarly communication today?

In general, scholars don’t have enough time for scholarly communication–I mean, time to learn how you can integrate several tools around.

First, there are literature management services like Mendeley and Readcube. Next, there are document collaboration services like writelatex. Then, you’ve got altmetrics services like Altmetric and ImpactStory. Furthermore, you’ve got PeerJ and eLife as OA journals, figshare as a repository and also ORCID.

These scholarly communication tools need some kind of instructors. I suggest librarians and research administrators could be lead instructors. Scholars and research administrators could work together, joining forces to enter the emerging era of scholarly communication. Because a great deal of academic web services are born and fade all the time, it’s hard to keep track of how we can share our academic publications.

That is the long-term vision I hope to make happen. At least we could change how librarians help with scholarly communication tools.

ImpactStory is the key when you enter the emerging era of scholarly communication.

Thanks, Keita!

As a token of our appreciation for Keita’s outreach efforts, we’re sending him an Impactstory t-shirt of his choice from our Zazzle store.

Keita is just one part of a growing community of Impactstory Advisors. Want to join the ranks of some of the Web’s most cutting-edge researchers and librarians? Apply to be an Advisor today!