How big does our text-mining training set need to be?

We got some great feedback from reviewers our new Sloan grant, including a suggestion that we be more transparent about our process over the course of the grant. We love that idea, and you’re now reading part of our plan for how to do that: we’re going to be blogging a lot more about what we learn as we go.

A big part of the grant is using machine learning to automatically discover mentions of software use in the research literature. It’s going to be a really fun project because we’ll get to play around with some of the very latest in ML, which currently The Hotness everywhere you look. And we’re learning a lot as we go. One of the first questions we’ve tackled (also in response to some good reviewer feedback) is: how big does our training set need to be? The machine learning system needs to be trained to recognized software mentions, and to do that we need to give it a set of annotated papers where we, as humans, have marked what a software mention looks like (and doesn’t look like). That training set is called the gold standard. It’s what the machine learning system learns from. Below is copied from one of our reviewer responses:

We came up with the number of articles to annotate through a combination of theory, experience, and intuition.  As usual in machine learning tasks, we considered the following aspects of the task at hand:

  • prevalence: the number of software mentions we expect in each article
  • task complexity: how much do software-mention words look like other words we don’t want to detect
  • number of features: how many different clues will we give our algorithm to help it decide whether each word is a software mention (eg is it a noun, is it in the Acknowledgements section, is it a mix of uppercase and lowercase, etc)

None of these aspects are clearly understood for this task at this point (one outcome of the proposed project is that we will understand them better once we are done, for future work), but we do have rough estimates.  Software mention prevalence will be different in each domain, but we expect roughly 3 mentions per paper, very roughly, based on previous work by Howison et al. and others.  Our estimate is that the task is moderately complex, based on the moderate f-measures achieved by Pan et al. and Duck et al. with hand-crafted rules.  Finally, we are planning to give our machine learning algorithm about 100 features (50 automatically discovered/generated by word2vec, plus 50 standard and rule-based features, as we discuss in the full proposal).

We then used these estimates.  As is common in machine learning sample size estimation, we started by applying a rule-of-thumb for the number of articles we’d have to annotate if we were to use the most simple algorithm, a multiple linear regression.  A standard rule of thumb (see https://en.wikiversity.org/wiki/Multiple_linear_regression#Sample_size) is 10-20 datapoints are needed for each feature used by the algorithm, which implies we’d need 100 features * 10 datapoints = 1000 datapoints.  At 3 datapoints (software mentions) per article, this rule of thumb suggests we’d need 333 articles per domain.  

From there we modified our estimate based on our specific machine learning circumstance.  Conditional Random Fields (our intended algorithm) is a more complex algorithm than multiple linear regression, which might suggest we’d need more than 333 articles.  On the other hand, our algorithm will also use “negative” datapoints inherent in the article (all the words in the article that are *not* software mentions, annotated implicitly as not software mentions) to help learn information about what is predictive of being vs not being a software mention — the inclusion of this kind of data for this task means our estimate of 333 articles is probably conservative and safe.

Based on this, as well as reviewing the literature for others who have done similar work (Pan et al. used a gold standard of 386 papers to learn their rules, Duck et al. used 1479 database and software mentions to train their rule weighting, etc), we determined that 300-500 articles per domain was appropriate. We also plan to experiment with combining the domains into one general model — in this approach, the domain would be added as an additional feature, which may prove more powerful overall. This would bring all 1000-1500 articles to the test set.

Finally, before proposing 300-500 articles per domain, we did a gut-check whether the proposed annotation burden was a reasonable amount of work and cost for the value of the task, and we felt it was.

References

Duck, G., Nenadic, G., Filannino, M., Brass, A., Robertson, D. L., & Stevens, R. (2016). A Survey of Bioinformatics Database and Software Usage through Mining the Literature. PLOS ONE, 11(6), e0157989. http://doi.org/10.1371/journal.pone.0157989

Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for Information Science and Technology (JASIST), Article first published online: 13 MAY 2015. http://doi.org/10.1002/asi.23538

Pan, X., Yan, E., Wang, Q., & Hua, W. (2015). Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers. Journal of Informetrics, 9(4), 860–871. http://doi.org/10.1016/j.joi.2015.07.012

Comparing Sci-Hub and oaDOI

Nature writer Richard Van Noorden recently asked us for our thoughts about Sci-Hub, since in many ways it’s quite similar to our newest project, oaDOI. We love the idea of comparing the two, and thought he had (as usual) good questions. His recent piece on Sci-Hub founder Alexandra Elbakyan quotes some of our responses to him; we’re sharing the rest below:

Like many OA advocates, we see lots to admire in Sci-Hub.

First, of course, Sci-Hub is making actual science available to actual people who otherwise couldn’t read it. Whatever else you can say about it, that is a Good Thing.

Second, SciHub helps illustrate the power of universal OA. Imagine a world where when you wanted to read science, you just…did? Sci-Hub gives us a glimpse of what that will look like, when universal, legal OA becomes a reality. And that glimpse is powerful, a picture that’s worth a thousand words.

Finally, we suspect and hope that SciHub is currently filling toll-access publishers with roaring, existential panic. Because in many cases that’s the only thing that’s going to make them actually do the right thing and move to OA models.

All this said, SciHub is not the future of scholarly communication, and I think you’d be hard pressed to find anyone who thinks it is. The future is universal open access.

And it’s not going to happen tomorrow. But it is going to happen. And we built oaDOI to be a step along that path. While we don’t have the same coverage as SciHub, we are sustainable and built to grow, along with the growing percentage of articles that have open access versions. And as you point out, we offer a simple, straightforward way to get fulltext.

That interface was not exactly inspired by SciHub, but rather I think an example of convergent evolution. The current workflow for getting scholarly articles is, in many cases, absolutely insane. Of course this is the legacy of a publishing system that is built on preventing people from reading scholarship, rather than helping them read it. It doesn’t have to be this hard. Our goal at oaDOI is to make it less miserable to find and read science, and in that we’re quite similar to SciHub. We just think we’re doing it in a way that’s more powerful and sustainable over the long term.

Collaborating on a $635k grant to improve credit for research software

We’re thrilled to announce Impactstory will be collaborating with James Howison at the University of Texas-Austin on a project to improve research software by helping its creators get proper credit for their work. The project will be funded by a three-year, $635k grant from the Alfred P. Sloan foundation.

Research software is an essential component of modern science. But the tradition-bound scholarly credit system does not appropriately reward the academic unsung heroes who create research software, putting further development of software-intensive science in jeopardy. Even when software is mentioned, the mentions are often informal, such as URLs in footnotes or just names in text. Howison, working with doctoral student Julia Bullard, found that 63% of mentions in a random sample of 90 biology articles were informal (Howison and Bullard, 2014).

We’re going to help fix that.

We’ll be working with James and his lab to make a huge database of every research software project used in every paper in the biomedicine, astronomy, and economics literatures. This database will filled in using a deep learning system that’ll automatically extract both formal and informal mentions of software, after being trained on a large, manually-coded gold standard dataset.

We’ll use this database to build and study three cool prototype tools:

  • CiteSuggest will analyze submitted text or code and make recommendations for normalized citations using the software author’s preferred citation,
  • CiteMeAs will help software producers make clear requests for their preferred citations, and
  • Software Impactstory will help software authors demonstrate the scholarly impact of their software in the literature.

We believe these tools will help transform the scholarly reward system into one where where software is a first-class research products, and its authors get full academic credit for their work. This in turn will support the software-intensive open science system we need for the future.

The project will build on our experience creating Depsy, a platform to track the scholarly impact of Python and R packages with an emphasis on dependencies, and on James’ extensive experience researching development in open source software and software in science. For lots more detail on the whole thing, check out the submitted proposal (edit Nov 9, 2016:  note this document is not a complete representation of the proposal, since the application and approval process also involved confidential back and forth with reviewers.  The reviewers added great comments and insight that we’re incorporating into the work as we go forward.)

Thank you, Sloan.  Thanks to Program Director Josh Greenberg for his continued advice and encouragement, and to the grant reviewers for well-informed and helpful feedback. And thanks especially to James, who had this idea in the first place, brought us on board, and has been a patient, good-natured, and ingenious collaborator in a lot of hard work already. We can’t wait to get started!

Introducing oaDOI: resolve a DOI straight to OA

Most papers that are free-to-read are available thanks to “green OA” copies posted in institutional or subject repositories.  The fact these copies are available for free is fantastic because anyone can read the research, but it does present a major challenge: given the DOI of a paper, how can we find the open version, given there are so many different repositories?screen-shot-2016-10-25-at-9-07-11-am

The obvious answer is “Google Scholar” 🙂  And yup, that works great, and given the resources of Google will probably always be the most comprehensive solution.  But Google’s interface requires an extra search step, and its data isn’t open for others to build tools on top of.

We made a thing to fix that.  Introducing oaDOI:

We look for open copies of articles using the following data sources:

  • The Directory of Open Access Journals to see if it’s in their index of OA journals.
  • CrossRef’s license metadata field, to see if the publisher has reported an open license.
  • Our own custom list DOI prefixes, to see if it’s in a known preprint repository.
  • DataCite, to see if it’s an open dataset.
  • The wonderful BASE OA search engine to see if there’s a Green OA copy of the article. BASE indexes 90mil+ open documents in 4000+ repositories by harvesting OAI-PMH metadata.
  • Repository pages directly, in cases where BASE was unable to determine openness.
  • Journal article pages directly, to see if there’s a free PDF link (this is great for detecting hybrid OA)

oaDOI was inspired by the really cool DOAI.  oaDOI is a wrapper around the OA detection used by Impactstory. It’s open source of course, can be used as a lookup engine in Zotero, and has an easy and powerful API that returns license data and other good stuff.

Check it out at oadoi.org, let us know what you think (@oadoi_org), and help us spread the word!

What’s your #OAscore?

We’re all obsessed with self-measurement.

We measure how much we’re Liked online. We measure how many steps we take in a day. And as academics, we measure our success using publication counts, h-indices, and even Impact Factors.

But we’re missing something.

As academics, our fundamental job is not to amass citations, but to increase the collective wisdom of our species. It’s an important job. Maybe even a sacred one. It matters. And it’s one we profoundly fail at when we lock our work behind paywalls.

Given this, there’s a measurement that must outweigh all the others we use (and misuse) as researchers: how much of our work can be read?

This Open Access Week, we’re rolling out this measurement on Impactstory. It’s a simple number: what percentage of your work is free to read online? We’d argue that it’s perhaps the most important number associated with your professional life (unless maybe it’s the percentage of your work published with a robust license that allows reuse beyond reading…we’re calculating that too). We’re calling it your Open Access Score.

We’d like to issue a challenge to every researcher: find out your open access score, do one thing to raise it, and tell someone you did. It takes ten minutes, and it’s a concrete thing you can do to be proud of yourself as a scholar.

Here’s how to do it:

  1. Make an Impactstory profile. You’ll need a Twitter account and nothing more…it’s free, nonprofit, and takes less than five minutes. Plus along the way you’ll learn cool stuff about how often your research has been tweeted, blogged, and discussed online.
  2. Deposit just one of your papers into an Open Access repository. Again: it’s easy. Here’s instructions.
  3. Once you’re done, update your Impactstory, and see your improved score.
  4. Tweet it. Let your community know you’ve made the world a richer, more beautiful place because you’ve made you’ve increased the knowledge available to humanity. Just like that. Let’s spread that idea.

Measurement is controversial. It has pros and cons. But when you’re measuring the right things, it can be incredibly powerful. This OA Week, join us in measuring the right things. Find your #OAscore, make it better, tweet it out. If we’re going to measure steps, let’s make them steps that matter.

 

Crossposted on the Open Access Week blog.

Data-driven decisions with Net Promoter Score


Today we’re releasing some changes in the way users sign up for Impactstory profiles, based on research we’ve done to learn more about our users. It’s a great opportunity to share a little about what we learned, and to describe the process we used to do this research–both to add some transparency around our decision making, and to maybe help folks looking to do the same sorts of things. There’s lots to share, so let’s get to it:

Meet the Net Promoter Score

As part of our journey to find product-market fit for the Impactstory webapp, we’ve become big fans of the Net Proscreen-shot-2016-09-15-at-7-26-10-pmmoter Score (NPS), an increasingly popular way to assess how much value users are getting from one’s product. It’s appealingly simple: we ask users to rank how likely they’d be to recommend Impactstory to a colleague, on a scale of 1-10, and why. Answers of 9-10 are Promoters, from 1-6 are Detractors. You subtract %detractors from %supporters and there’s your score.

It’s a useful score. It doesn’t measure how much users like you. It doesn’t measure how much they generally support the idea of what you’re doing. It measures how much you are solving real problems for real users, right now. Solving those problems so well that users will put their own reputation on the line and sing your praises to their friends.

Until we’re doing that, we don’t have product-market fit, we aren’t truly making something people want, and we don’t have a sustainable business. Same as any startup.

As a nonprofit, we’ve got lots of people who support what we’re doing and (correctly!) see that we’re solving a huge problem for academia as a whole. So they’ve got lots of good things to say to us. Which: yay. That’s fuel and we love it. But it can disguise the fact that we may not be solving their personal problems. We need to get at that signal, to help us find that all-important product-market fit.

Getting the data

We used Promoter.io to manage creating, sending, and collecting emails surveys. It just works and it saved us a ton of time. We recommend it.  Our response rate was 28%, which is we figure pretty good for asking help via email from people who don’t know you or owe you anything, and without pestering them with any reminders. We sliced and diced users along many dimensions but they all had about the same response rate, which improves robustness of the findings. Since we assume users who have no altmetrics will hate the app, we only sent surveys to users with relatively complete profiles (at least three Impactstory badges).

Once we had responses, we followed up using Intercom, an app that nicely integrates most of our customer communication (feedback, support, etc). We got lots more qualitative feedback this way.

Once we had all our data, we exported the results into a spreadsheet and had us some Pivot Table Fun Time. Here’s the raw data in Google Docs (with identifying attributes removed to protect privacy) in case you’d like to dive into the data yourself.

Finally, we exported loads of user data from our Postgres app database hosted on Heroku. All that got added into the spreadsheet and pivot tables as well.

Here’s what we found

The overall NPS is 26, which is not amazing. But it is good. And encouragingly, it’s much better than we got when we surveyed users about our old, non-free version in March. Getting better is a good sign. We’ll take it.

Users who have made profiles in both versions (new and old) seem to agree. The overall NPS for these users was 58, which is quite strong. In fact, users of the old version were the group with the highest NPS overall in this survey. Since we made a lot of changes in the new app from the old, this wouldn’t have to have been true. It made us happy.

But we wanted more actionable results. So we sliced and diced everyone into subgroups along several dimensions, looking for features that can predict extra-high NPS in future sign-ups.

We found four of these predictive features. As it happens, each predictor changes the NPS of its group by the same amount: your NPS (on average) goes from 15 (ok) to 35 (good) if you

  1. have a Twitter account,
  2. have more than 20 online mentions of some kind (Tweets, Wikipedia, Pinterest, whatever) pointing to your publications,
  3. have made more than 55% of your publications green or gold open access, or
  4. have been awarded more than 6 Impactstory badges.

Of these, (4) is not super useful since it covaries a lot with numbers of mentions (2) and OA percentage (3); after all, we give out badges for both those things. A bit more surprisingly, users who have Twitter are likely to have more mentions per product, and less likely to have blank profiles, meaning Feature 1 accounts for some of the variance in Feature 2. So simply having a Twitter account is one of our best signals that you’ll love Impactstory.

Surprisingly, having a well-stocked ORCID profile with lots of your works in it doesn’t seem to predict a higher NPS score at all. This was unexpected because we figured the kind of scholcomm enthusiasts who keep their ORCID records scrupulously up-to-date would be more likely to dig the kind of thing we’re doing with Impactstory. Plus they’d have an easier and faster time setting up a profile since their data is super easy for us to import. Good to have the data.

About 60% of response included qualitative feedback. Analysing these, we found three themes:

  • It should include citations. Makes sense users would want this, given that citations are the currency of academia and all. Alas they ain’t gonna get it, not till someone comes out with a open and complete citation database. Our challenge is to help users be less bummed about this, hopefully be positioning Impactstory as a complement to indexes like Google Scholar rather than a competitor.
  • It’s pretty. That’s good to hear, especially since we want folks to share their profiles, make them part of their online identity. That’s way easier if you think it looks sharp.
  • It’s easy. Also great to hear, because the last version was not very easy, mostly as a result of feature bloat. It hurt to lose some features on this version, so it’s good to see the payoff was there.
  • It puts everything all in one place.  Presumably users were going to multiple places to gather all the altmetrics data that Impactstory puts in one spot. 

Here’s what we did

The most powerful takeway from all this was that users who have Twitter get more out of Impactstory and like it more. And that makes sense…scholars with Twitter are more likely be into this whole social media thing, and (in our experience talking with lots of researchers) more ready to believe altmetrics could be a useful tool.

So, we’ll redouble our focus on these users.

The way we’re doing that concretely right away is by changing the signup wizard to start with a “signup with Twitter” button. That’s a big deal because it means you’ll need a Twitter account to sign up, and therefore excludes some potential users. That’s a bummer.

But it’s excluding users who, statistically, are among the least likely to love the app. And it’s making it easier to sign up for the users that are loving Impactstory the most, and most keen to recommend us. That means better word of mouth, a better viral coefficient, and a chance to test a promising hypothesis for achieving product-market fit.

We’re also going to be looking at adding more Twitter-specific features like analysing users’ tweeted content and follower lists. More on that later.

To take advantage of our open-access predictor, we’ll be working hard to reach out to the open access community…we’re already having great informal talks with folks at SPARC and with the OA Button, and are reaching out in other ways as well. More on that later, too.

We’re excited about this approach to user-driven development. It’s something we’ve always valued, but often had a tough time implementing because it has seemed a bit daunting. And honestly, it is a bit daunting. It took a ton of time, and it takes a surprising amount of mental energy to be open-minded in a way that makes the feedback actionable. But overall we’re really pleased with the process, and we’re going to be doing it more, along with these kinds of blog posts to improve the transparency decision-making. Looking forward to hearing your thoughts!

Now, a better way to find and reward open access

There’s always been a wonderful connection between altmetrics and open science.

Altmetrics have helped to demonstrate the impact of open access publication. And since the beginning, altmetrics have excited and provoked ideas for new, open, and revolutionary science communication systems. In fact, the two communities have overlapped so much that altmetrics has been called a “school” of open science.

We’ve always seen it that way at Impactstory. We’re uninterested in bean-counting. We are interested in setting the stage for a second scientific revolution, one that will happen when two open networks intersect: a network of instantly-available diverse research products and a network of comprehensive, open, distributed significance indicators.

So along with promoting altmetrics, we’ve also been big on incentives for open access. And today we’re excited that we got a lot better at it.

We’re launching a new Open Access badge, backed by a really accurate new system for automatically detecting fulltext for online resources. It finds not just Gold OA, but also self-archived Green OA, hybrid OA, and born-open products like research datasets.

A  lot of other projects have worked on this sticky problem before us, including the Open Article Gauge, OACensus, Dissemin, and the Open Access Button. Admirably, these have all been open-source projects, so we’ve been able to reuse lots of their great ideas.

Then we’ve added oodles of our own ideas and techniques, along with plenty of research and testing. The result? Impactstory is now the best, most accurate way to automatically assess openness of publications. We’re proud of that.

And we know this is just the beginning! Fork our code or send us a pull request if you want to make this even better. Here’s a list of where we check for OA to get you started:

  • The Directory of Open Access Journals to see if it’s in their index of OA journals,
  • CrossRef’s license metadata field,  to see if the publisher has uploaded an open license.
  • Our own custom list DOI prefixes, to see if it’s in a known preprint repo
  • DataCite, to see if it’s an open dataset.
  • The wonderful BASE OA search engine to see if there’s a Green OA copy of the article.
  • Repository pages directly, in cases where BASE was unable to determine openness.
  • Journal article pages directly, to see if there’s a free PDF link (this is great for detecting hybrid OA)

What’s it mean for you? Well, Impactstory is now a powerful tool for spreading the word about open access. We’ve found that seeing that openness badge–or OH NOES lack of a badge!–on their new profile is powerful for a researcher who might otherwise not think much about OA.

So, if you care about OA: challenge your colleagues to go make a free profile and see how open they really are. Or you can use our API to learn about the openness of groups of scholars (great for librarians, or for a presentation to your department). Just hit the endpoint http://impactstory.org/u/someones_orcid_id to find out the openness stats for anyone.

Hit us up with any thoughts or comments, and enjoy!

Why researchers are loving the new Impactstory

We put our heart and soul into the new Impactstory and have been on pins and needles to hear what you think.  Well it’s been a week and the verdict is in — we’re hearing that the new version is awesome, fantastic, and truly excellent, a home run and must-have–an academic profile that’s exciting and relevant.

And so much more. So much more, in fact, that we wanted to a little break from the frenzied responding, bugfixing, and feature-launching we’ve been doing this week and summarize a bit of what we’ve heard.

What do you like?

A lot of users have appreciated that it now takes seconds and is super easy to set up a profile that’s blazing fast and smooth to use: it’s instant insights about your research.

Unlike speed, beauty is in the eye of the beholder–but our beholders seem delightfully agreed that our new look is great, great, great.  Whether users are calling it fresh or beautifully crafted, or sleek or smooth or snazzy, everyone seems to agree that the new version looks awesome, it looks pretty damn awesome. And we are pretty thrilled to hear that.

They’re enjoying that it’s got some fun 🙂 And, we’re not surprised to hear that people like the new price point of Free, making it easier to recommend to others.  

What’s it good for?

Impactstory helps researchers find impacts of their work beyond just citations. People have found mentions they didn’t know about on Wikipedia, discussion in cool blog posts, and reviews on Faculty of 1000. And not just numbers, but impact across the globe. Not just numbers but connecting with people: for instance user Peter van Heusden tweeted, “Using @Impactstory I discovered someone who is consistently promoting work I’m involved in, but who I had no idea existed!”

All this amounts to more than just a lovely ego boost (although it’s that too!). People are telling us that it’s motivating them to adopt more Open Science practices like uploading research slides to a proper repository, getting an ORCID, adding works to their ORCID profile, and celebrating their non-paper publications.

How are you using it?

People are already sending their Impactstory profiles to their funders, and their funders are loving them.  Researchers have added their new profile to their CV, and are planning on using Impactstory data to define innovative ‘pathway to impact’ for UK grants and in tenure and promotion packets.

Folks are including it in workshops.  And even better — building things with our open data! Check out the ferret.io plugin, it rolled out impactstory support this week and it’s really cool 🙂

What have we been doing?

We’ve made a bunch of changes this week in response to your feedback:

  • imports all your publications, not just DOIs.  Everything on your ORCID profile now displays in your Impactstory profile, and we’re working on getting more openness and altmetrics data
  • twitter integration
    • connecting twitter updates your profile pic so you don’t have to fight with gravatar
    • you don’t have to enter email manually–even faster signup
    • we’ll be using your twitter feed for achievements in the future
  • there’s a new Open Sesame achievement
  • we changed the scores at the top of the profile beside your picture; they are now counts of your achievements
  • the achievements and the import process are better documented
  • we rolled out dozens of smaller features, usability enhancements, and bugfixes.

What’s next?

We’re on our way to the FORCE16 conference this week.  We’ll be rolling the feedback from the conference along with your continued feedback into continued improvement to the app.

And you?  Join in with everyone showing off their profile, spread the word (this is how we will grow), and if you don’t have a profile, get one, and tell us what you think!

Finally, thanks.

Finally, we’d like to thank the hundreds of passionate people who have helped us with money and with moral support along the way, from our early days till now. It’s safe to say the new Impactstory is a big hit.  It’s our hit, together.

 

The new Impactstory: Better. Freer.

We are releasing a new version of Impactstory!

https://impactstory.org/u/0000-0001-6728-7745

https://impactstory.org/u/0000-0001-6728-7745

We baked what we’ve learned from hundreds of conversations with researchers into a sleeker, leaner, more useful Impactstory.

Our new Achievements showcase your meaningful accomplishments, not just counts. Our new three-part score helps you track your buzz, engagement, and openness. And next-generation notification emails are improved to tell you what you want to know reliably every week.

And of course we’ve got a slew of other new features as well, including Depsy integration, ORCID sync-on-demand, and full support for mobile.

What’s more, we’re simplifying and streamlining everywhere, eliminating little-used features and doubling down on what users have told us they love. Profile creation is now only via ORCID, we only deal in DOIs, and citation metrics are gone. As a result, creating a profile takes just seconds, our support for diverse research products (preprints, datasets, etc) is bulletproof, and metrics are now consistently clear and up-to-date. Along with a complete code rewrite, these changes make Impactstory faster and more reliable than it’s ever been.

Last but not least, not only are we making Impactstory better: we’re making it cheaper. As in, all the way cheaper. Free!

Why? We heard you love the idea, but not the price–largely because your disciplines or departments aren’t quite ready to use altmetrics for evaluation. We can see this is starting to change, and want to help that change happen as quickly as possible. That means letting as many researchers as possible engage with altmetrics, right now. Free helps that happen.

Alternative sustainability models (like freemium features and new grants) will allow us to continue to build and maintain tools like Impactstory and Depsy to help change how researchers think about understanding and measuring the influence of their work.

Sound good? It is. We think you’ll love it. Go make yourself a profile and see what you learn: https://impactstory.org (and if you’re a current impactstory subscriber check your email for migration details).

We think this new Impactstory the best thing we’ve ever done, and it’s a big step towards creating the open science, altmetrics-powered future we believe in. Thanks building that future with us. We’re looking forward to hearing what you think!

Let’s value the software that powers science: Introducing Depsy

Today we’re proud to officially launch Depsy, an open-source webapp that tracks research software impact.

We made Depsy to solve a problem:  in modern science, research software is often as important as traditional research papers–but it’s not treated that way when it comes to funding and tenure. There, the traditional publish-or-perish, show-me-the-Impact-Factor system still rules.

We need to fix that. We need to provide meaningful incentives for the scientist-developers who make important research software, so that we can keep doing important, software-driven science.

Lots of things have to happen to support this change. Depsy is a shot at making one of those things happen: a system that tracks the impact of software in software-native ways.

That means not just counting up citations to a hastily-written paper about the software, but actual mentions of the software itself in the literature. It means looking how software gets reused by other software, even when it’s not cited at all. And it means understanding the full complexity of software authorship, where one project can involve hundreds of contributors in multiple roles that don’t map to traditional paper authorship.

Ok, this sounds great, but how about some specifics. Check out these examples:

  • GDAL is a geoscience library. Depsy finds this cool NASA-funded ice map paper that mentions GDAL without formally citing it. Also check out key author Even Rouault: the project commit history demonstrates he deserves 27% credit for GDAL, even though he’s overlooked in more traditional credit systems.
  • lubridate improves date handling for R. It’s not highly-cited, but we can see it’s making a different kind of impact: it’s got a very high dependency PageRank, because it’s reused by over 1000 different R projects on GitHub and CRAN.
  • BradleyTerry2 implements a probability technique in R. It’s only directly reused by 8 projects—but Depsy shows that one of those projects is itself highly reused, leading to huge indirect impacts. This indirect reuse gives BradleyTerry2 a very high dependency PageRank score, even though its direct reuse is small, and that makes for a better reflection of real-world impact.
  • Michael Droettboom makes small (under 20%) contributions to other people’s research software, contributions easy to overlook. But the contributions are meaningful, and they’re to high-impact projects, so in Depsy’s transitive credit system he ends up as a highly-ranked contributor. Depsy can help unsung heroes like Micheal get rewarded.
     

Depsy doesn’t do a perfect job of finding citations, tracking dependencies, or crediting authors (see our in-progress paper for more details on limitations). It’s not supposed to. Instead, Depsy is a proof-of-concept to show that we can do them at all. The data and tools are there. We can measure and reward software impact, like we measure and reward the impact of papers.

Embed impact badges in your GitHub README

Given that, it’s not a question of if research software becomes a first-class scientific product, but when and how. Let’s start having the conversations about when and how (here are some great places for that). Let’s improve Depsy, let’s build systems better than Depsy, and let’s (most importantly) start building the cultural and political structures that can use these systems.

For lots more details about Depsy, check out the paper we’re writing (and contribute!), and of course Depsy itself. We’re still in the early stages of this project, and we’re excited to hear your feedback: hit us up on twitter, in the comments below, or in the Hacker News thread about this post.

Depsy is made possible by a grant from the National Science Foundation.
edit nov 15 2015: change embed image to match new badge