Green Open Access comes of age

This morning David Prosser, executive director of Research Libraries UK, tweeted, “So we have @unpaywall, @oaDOI_org, PubMed icons – is the green #OA infrastructure reaching maturity?(link).

We love this observation, and not just because two of the three projects he mentioned are from us at Impactstory 😀. We love it because we agree: Green OA infrastructure is at a tipping point where two decades of investment, a slew of new tools, and a flurry of new government mandates is about to make Green OA the scholarly publishing game-changer.

A lot of folks have suggested that Sci-Hub is scholarly publishing’s “Napster moment,” where the internet finally disrupts a very resilient, profitable niche market. That’s probably true. But like music industry shut down Napster, Elsevier will likely be able shut down Sci-Hub. They’ve got both the money and the legal (though not moral) high ground and that’s a tough combo to beat.

But the future is what comes after Napster. It’s in the iTunes and the Spotifys of scholarly communication. We’ve built something to help to create this future. It’s Unpaywall, a browser extension that instantly finds free, legal Green OA copies of paywalled research papers as you browse–like a master key to the research literature. If you haven’t tried it yet, install Unpaywall for free and give it a try.

Unpaywall has reached 5,000 active users in our first ten days of pre-release.

But Unpaywall is far from the only indication that we’re reaching a Green OA inflection point. Today is a great day to appreciate this, as there’s amazing Green OA news everywhere you look:

  • Unpaywall reached the 5000 Active Users milestone. We’re now delivering tens of thousands of OA articles to users in over 100 countries, and growing fast.
  • PubMed announced Institutional Repository LinkOut, which links every PubMed article to a free Green copy in institutional repositories where available. This is huge, since PubMed is one of the world’s most important portals to the research literature.
  • The Open Access Button announced a new integration with interlibrary loan that will make it even more useful for researchers looking for open content. Along with the interlibrary loan request, they send instructions to authors to help them self-archive closed publications.

Over the next few years, we’re going to see an explosion in the amount of research available openly, as government mandates in the US, UK, Europe, and beyond take force. As that happens, the raw material is there to build completely new ways of searching, sharing, and accessing the research literature.
We think Unpaywall is a really powerful example: When there’s a big Get It Free button next to the Pay Money button on publisher pages, it starts to look like the game is changing. And it is changing. Unpaywall is just the beginning of the amazing open-access future we’re going to see. We can’t wait!

How to smash an interstellar paywall

Last month, hundreds of news outlets covered an amazing story: seven earth-sized planets were discovered, orbiting a nearby star. It was awesome. Less awesome: the paper with the details, published in the journal Nature, was paywalled. People couldn’t read it.

That’s messed up. We’re working to fix it, by releasing our new free Chrome extension Unpaywall. Using Unpaywall, you can get access to the article, and millions like it, instantly and legally. Let’s learn more.

First, is this really a problem? Surely google can find the article. I mean, there might be aliens out there. We need to read about this. Here we go, let’s Google for “seven terrestrial planets nature article.” Great, there it is, first result. Click, and…

What, thirty-two bucks to read!? Well that’s that, I quit.

Or maybe there are some ways around the paywall? Well, you can know someone with access. My pal Cindy Wu helped out her journal club out this way, offering on Twitter to email them a copy of the paper. But you have to follow Cindy on Twitter for that to work.

Or you could know the right places to look for access. Astronomers generally post their papers are on a free web server called the ArXiv, and sure enough if you search there, you’ll find the Nature paper.  But you have to know about ArXiv for that to work. And check out those Google search results again: ArXiv doesn’t appear.

Most people don’t know Cindy, or ArXiv. And no one’s paying $32 for an article. So the knowledge in this paper, and thousands of papers like it, is locked away from the taxpayers who funded it. Research becomes the private reserve of those privileged few with the money, experience, or connections to get access.

We’re helping to change that.

Install our new, free Unpaywall Chrome extension and browse to the Nature article. See that little green tab on the right of the page? It means Unpaywall found a free version, the one the authors posted to ArXiv. Click the tab. Read for free. No special knowledge or searches or emails or anything else needed. 

Today you’ll find Unpaywall’s green tab on ten million articles, and that number is growing quickly thanks to the hard work of the open-access movement. Governments in the US, UK, Europe, and beyond are increasingly requiring that taxpayer-funded research be publically available, and as they do Unpaywall will get more and more effective.

Eventually, the paywalls will all fall. Till then, we’ll be standing next to ‘em, handing out ladders. Together with millions of principled scientists, libraries, techies, and activists, we’re helping make scholarly knowledge free to all humans. And whoever else is out there 😀 👽.

Introducing Unpaywall: unlock paywalled research papers as you browse

Last Friday night we tweeted about a new Chrome extension we’ve been working on. It’s called Unpaywall, and it links you to free fulltext as you browse research articles. Hit a paywall? No problem: click the green tab and read it free.

Unpaywall is powered by an index of over ten million legally-uploaded, open-access resources, and it delivers. For example, in a set of 11k recent cancer research articles covered in mainstream media, Unpaywall users were able to read around half of them for free–even without any subscription, and even though most of them were paywalled.

So far the response to Friday’s tweet has been amazing — 500 retweets, and in just a few days we’ve gotten more than 1500 installations: Hockey stick growth!  🙂

 

And we’ve also gotten rave reviews, like this one from Sarah:

Why the excitement?  Finding free, legal, open access is now super easy — it happens automatically.  With the Unpaywall extension, links to open access are automatically available as you browse.

This is useful for researchers like Ethan.  It’s also really helpful for people outside academia, who don’t enjoy the expensive subscription benefits of institutional libraries. This is especially true for nonprofits:

…. and folks working to communicate scholarship to a broader audience:

Go give it a try and see what you think! The official release is April 4th, but you can already  install it, learn more, and follow @unpaywall. We’d love your help to spread the word about Unpaywall to your friends and colleagues. Together we can accelerate toward to a future of full #openaccess for all!

 

 

 

behind the scenes: cleaning dirty data

Dirty Data.  It’s everywhere!  And that’s expected and ok and even frankly good imho — it happens when people are doing complicated things, in the real world, with lots of edge cases, and moving fast.  Perfect is the enemy of good.

Thanks http://www.navigo.com.au/2015/05/cleaning-out-the-closet-how-to-deal-with-dirty-data/ for the image

Alas it’s definitely behind-the-scenes work to find and fix dirty data problems, which means none of us learn from each other in the process.  So — here’s a quick post about a dirty data issue we recently dealt with 🙂  Hopefully it’ll help you feel comradery, and maybe help some people using the BASE data.

We traced some oaDOI bugs to dirty records from PMC in the BASE open access aggregation database.

Most PMC records in BASE are really helpful — they include the title, author, and link to the full text resource in PMC.  For example, this record lists valid PMC and PubMed urls:

and this one lists the PMC and DOI urls:

The vast majority of PMC records in BASE look like this.  So until last week, to find PMC article links for oaDOI we looked up article titles in BASE and used the URL listed there to point to the free resource.

But!  We learned!  There is sometimes a bug!  This record has a broken PMC url — it lists http://www.ncbi.nlm.nih.gov/pmc/articles/PMC with no PMC id in it (see, look at the URL — there’s nothing about it that points to a specific article, right?).  To get the PMC link you’d have to follow the Pubmed link and then click to PMC from there.  (which does exist — here’s the PMC page which we wish the BASE record had pointed to).

That’s some dirty data.  And it gets worse.  Sometimes there is no pubmed link at all, like this one (correct PMC link exists):

and sometimes there is no valid URL, so there’s really no way to get there from here:

(pretty cool PMC lists this article from 1899, eh?.  Edge cases for papers published more than 100 years ago seems fair, I’ve gotta admit 🙂 )

Anyway.  We found this dirty PMC data in base is infrequent but common enough to cause more bugs than we’re comfortable with.  To work around the dirty data we’ve added a step — oaDOI now uses the the DOI->PMCID lookup file offered by PMC to find PMC articles we might otherwise miss.  Adds a bit more complexity, but worth it in this case.

 

 

So, that’s This Week In Dirty Data from oaDOI!  🙂  Tune in next week for, um, something else 🙂

And don’t forget Open Data Day is Saturday March 4, 2017.   Perfect is the enemy of the good — make it open.

oaDOI integrated into the SFX link resolver

We’re thrilled to announce that oaDOI is now available for integration with the SFX link resolver. SFX, like other OpenURL link resolvers, makes sure that when library users click a link to a scholarly article, they are directed to a copy the library subscribes to, so they can read it.

But of course, sometimes the library doesn’t subscribe. This is where oaDOI comes to the rescue. We check our database of over 80 million articles to see if there’s a Green Open Access version of that article somewhere. If we find one, the user gets directed there so they can read. Adding oaDOI to SFX is like adding ten million open-access articles to a library’s holdings, and it results in a lot more happy users, and a lot more readers finding full text instead of paywalls. Which is kind of our thing.

The best part is, it’s super easy set up, and of course completely free. Since SFX is used today by over 2000 institutions, we’re really excited about how big a difference this can make.

Edited march 28, 2017. There are now over 600 libraries worldwide using the oaDOI integration, and we’re handling over a million requests for fulltext every day.

 

How big does our text-mining training set need to be?

We got some great feedback from reviewers our new Sloan grant, including a suggestion that we be more transparent about our process over the course of the grant. We love that idea, and you’re now reading part of our plan for how to do that: we’re going to be blogging a lot more about what we learn as we go.

A big part of the grant is using machine learning to automatically discover mentions of software use in the research literature. It’s going to be a really fun project because we’ll get to play around with some of the very latest in ML, which currently The Hotness everywhere you look. And we’re learning a lot as we go. One of the first questions we’ve tackled (also in response to some good reviewer feedback) is: how big does our training set need to be? The machine learning system needs to be trained to recognized software mentions, and to do that we need to give it a set of annotated papers where we, as humans, have marked what a software mention looks like (and doesn’t look like). That training set is called the gold standard. It’s what the machine learning system learns from. Below is copied from one of our reviewer responses:

We came up with the number of articles to annotate through a combination of theory, experience, and intuition.  As usual in machine learning tasks, we considered the following aspects of the task at hand:

  • prevalence: the number of software mentions we expect in each article
  • task complexity: how much do software-mention words look like other words we don’t want to detect
  • number of features: how many different clues will we give our algorithm to help it decide whether each word is a software mention (eg is it a noun, is it in the Acknowledgements section, is it a mix of uppercase and lowercase, etc)

None of these aspects are clearly understood for this task at this point (one outcome of the proposed project is that we will understand them better once we are done, for future work), but we do have rough estimates.  Software mention prevalence will be different in each domain, but we expect roughly 3 mentions per paper, very roughly, based on previous work by Howison et al. and others.  Our estimate is that the task is moderately complex, based on the moderate f-measures achieved by Pan et al. and Duck et al. with hand-crafted rules.  Finally, we are planning to give our machine learning algorithm about 100 features (50 automatically discovered/generated by word2vec, plus 50 standard and rule-based features, as we discuss in the full proposal).

We then used these estimates.  As is common in machine learning sample size estimation, we started by applying a rule-of-thumb for the number of articles we’d have to annotate if we were to use the most simple algorithm, a multiple linear regression.  A standard rule of thumb (see https://en.wikiversity.org/wiki/Multiple_linear_regression#Sample_size) is 10-20 datapoints are needed for each feature used by the algorithm, which implies we’d need 100 features * 10 datapoints = 1000 datapoints.  At 3 datapoints (software mentions) per article, this rule of thumb suggests we’d need 333 articles per domain.  

From there we modified our estimate based on our specific machine learning circumstance.  Conditional Random Fields (our intended algorithm) is a more complex algorithm than multiple linear regression, which might suggest we’d need more than 333 articles.  On the other hand, our algorithm will also use “negative” datapoints inherent in the article (all the words in the article that are *not* software mentions, annotated implicitly as not software mentions) to help learn information about what is predictive of being vs not being a software mention — the inclusion of this kind of data for this task means our estimate of 333 articles is probably conservative and safe.

Based on this, as well as reviewing the literature for others who have done similar work (Pan et al. used a gold standard of 386 papers to learn their rules, Duck et al. used 1479 database and software mentions to train their rule weighting, etc), we determined that 300-500 articles per domain was appropriate. We also plan to experiment with combining the domains into one general model — in this approach, the domain would be added as an additional feature, which may prove more powerful overall. This would bring all 1000-1500 articles to the test set.

Finally, before proposing 300-500 articles per domain, we did a gut-check whether the proposed annotation burden was a reasonable amount of work and cost for the value of the task, and we felt it was.

References

Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. (2016). A Survey of Bioinformatics Database and Software Usage through Mining the Literature. PLOS ONE, 11(6), e0157989. http://doi.org/10.1371/journal.pone.0157989

Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology (JASIST), Article first published online: 13 MAY 2015. http://doi.org/10.1002/asi.23538

Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871. http://doi.org/10.1016/j.joi.2015.07.012

Comparing Sci-Hub and oaDOI

Nature writer Richard Van Noorden recently asked us for our thoughts about Sci-Hub, since in many ways it’s quite similar to our newest project, oaDOI. We love the idea of comparing the two, and thought he had (as usual) good questions. His recent piece on Sci-Hub founder Alexandra Elbakyan quotes some of our responses to him; we’re sharing the rest below:

Like many OA advocates, we see lots to admire in Sci-Hub.

First, of course, Sci-Hub is making actual science available to actual people who otherwise couldn’t read it. Whatever else you can say about it, that is a Good Thing.

Second, SciHub helps illustrate the power of universal OA. Imagine a world where when you wanted to read science, you just…did? Sci-Hub gives us a glimpse of what that will look like, when universal, legal OA becomes a reality. And that glimpse is powerful, a picture that’s worth a thousand words.

Finally, we suspect and hope that SciHub is currently filling toll-access publishers with roaring, existential panic. Because in many cases that’s the only thing that’s going to make them actually do the right thing and move to OA models.

All this said, SciHub is not the future of scholarly communication, and I think you’d be hard pressed to find anyone who thinks it is. The future is universal open access.

And it’s not going to happen tomorrow. But it is going to happen. And we built oaDOI to be a step along that path. While we don’t have the same coverage as SciHub, we are sustainable and built to grow, along with the growing percentage of articles that have open access versions. And as you point out, we offer a simple, straightforward way to get fulltext.

That interface was not exactly inspired by SciHub, but rather I think an example of convergent evolution. The current workflow for getting scholarly articles is, in many cases, absolutely insane. Of course this is the legacy of a publishing system that is built on preventing people from reading scholarship, rather than helping them read it. It doesn’t have to be this hard. Our goal at oaDOI is to make it less miserable to find and read science, and in that we’re quite similar to SciHub. We just think we’re doing it in a way that’s more powerful and sustainable over the long term.

Collaborating on a $635k grant to improve credit for research software

We’re thrilled to announce Impactstory will be collaborating with James Howison at the University of Texas-Austin on a project to improve research software by helping its creators get proper credit for their work. The project will be funded by a three-year, $635k grant from the Alfred P. Sloan foundation.

Research software is an essential component of modern science. But the tradition-bound scholarly credit system does not appropriately reward the academic unsung heroes who create research software, putting further development of software-intensive science in jeopardy. Even when software is mentioned, the mentions are often informal, such as URLs in footnotes or just names in text. Howison, working with doctoral student Julia Bullard, found that 63% of mentions in a random sample of 90 biology articles were informal (Howison and Bullard, 2014).

We’re going to help fix that.

We’ll be working with James and his lab to make a huge database of every research software project used in every paper in the biomedicine, astronomy, and economics literatures. This database will filled in using a deep learning system that’ll automatically extract both formal and informal mentions of software, after being trained on a large, manually-coded gold standard dataset.

We’ll use this database to build and study three cool prototype tools:

  • CiteSuggest will analyze submitted text or code and make recommendations for normalized citations using the software author’s preferred citation,
  • CiteMeAs will help software producers make clear requests for their preferred citations, and
  • Software Impactstory will help software authors demonstrate the scholarly impact of their software in the literature.

We believe these tools will help transform the scholarly reward system into one where where software is a first-class research products, and its authors get full academic credit for their work. This in turn will support the software-intensive open science system we need for the future.

The project will build on our experience creating Depsy, a platform to track the scholarly impact of Python and R packages with an emphasis on dependencies, and on James’ extensive experience researching development in open source software and software in science. For lots more detail on the whole thing, check out the submitted proposal (edit Nov 9, 2016:  note this document is not a complete representation of the proposal, since the application and approval process also involved confidential back and forth with reviewers.  The reviewers added great comments and insight that we’re incorporating into the work as we go forward.)

Thank you, Sloan.  Thanks to Program Director Josh Greenberg for his continued advice and encouragement, and to the grant reviewers for well-informed and helpful feedback. And thanks especially to James, who had this idea in the first place, brought us on board, and has been a patient, good-natured, and ingenious collaborator in a lot of hard work already. We can’t wait to get started!

Introducing oaDOI: resolve a DOI straight to OA

Most papers that are free-to-read are available thanks to “green OA” copies posted in institutional or subject repositories.  The fact these copies are available for free is fantastic because anyone can read the research, but it does present a major challenge: given the DOI of a paper, how can we find the open version, given there are so many different repositories?screen-shot-2016-10-25-at-9-07-11-am

The obvious answer is “Google Scholar” 🙂  And yup, that works great, and given the resources of Google will probably always be the most comprehensive solution.  But Google’s interface requires an extra search step, and its data isn’t open for others to build tools on top of.

We made a thing to fix that.  Introducing oaDOI:

We look for open copies of articles using the following data sources:

  • The Directory of Open Access Journals to see if it’s in their index of OA journals.
  • CrossRef’s license metadata field, to see if the publisher has reported an open license.
  • Our own custom list DOI prefixes, to see if it’s in a known preprint repository.
  • DataCite, to see if it’s an open dataset.
  • The wonderful BASE OA search engine to see if there’s a Green OA copy of the article. BASE indexes 90mil+ open documents in 4000+ repositories by harvesting OAI-PMH metadata.
  • Repository pages directly, in cases where BASE was unable to determine openness.
  • Journal article pages directly, to see if there’s a free PDF link (this is great for detecting hybrid OA)

oaDOI was inspired by the really cool DOAI.  oaDOI is a wrapper around the OA detection used by Impactstory. It’s open source of course, can be used as a lookup engine in Zotero, and has an easy and powerful API that returns license data and other good stuff.

Check it out at oadoi.org, let us know what you think (@oadoi_org), and help us spread the word!

What’s your #OAscore?

We’re all obsessed with self-measurement.

We measure how much we’re Liked online. We measure how many steps we take in a day. And as academics, we measure our success using publication counts, h-indices, and even Impact Factors.

But we’re missing something.

As academics, our fundamental job is not to amass citations, but to increase the collective wisdom of our species. It’s an important job. Maybe even a sacred one. It matters. And it’s one we profoundly fail at when we lock our work behind paywalls.

Given this, there’s a measurement that must outweigh all the others we use (and misuse) as researchers: how much of our work can be read?

This Open Access Week, we’re rolling out this measurement on Impactstory. It’s a simple number: what percentage of your work is free to read online? We’d argue that it’s perhaps the most important number associated with your professional life (unless maybe it’s the percentage of your work published with a robust license that allows reuse beyond reading…we’re calculating that too). We’re calling it your Open Access Score.

We’d like to issue a challenge to every researcher: find out your open access score, do one thing to raise it, and tell someone you did. It takes ten minutes, and it’s a concrete thing you can do to be proud of yourself as a scholar.

Here’s how to do it:

  1. Make an Impactstory profile. You’ll need a Twitter account and nothing more…it’s free, nonprofit, and takes less than five minutes. Plus along the way you’ll learn cool stuff about how often your research has been tweeted, blogged, and discussed online.
  2. Deposit just one of your papers into an Open Access repository. Again: it’s easy. Here’s instructions.
  3. Once you’re done, update your Impactstory, and see your improved score.
  4. Tweet it. Let your community know you’ve made the world a richer, more beautiful place because you’ve made you’ve increased the knowledge available to humanity. Just like that. Let’s spread that idea.

Measurement is controversial. It has pros and cons. But when you’re measuring the right things, it can be incredibly powerful. This OA Week, join us in measuring the right things. Find your #OAscore, make it better, tweet it out. If we’re going to measure steps, let’s make them steps that matter.

 

Crossposted on the Open Access Week blog.