Coverage in the Financial Times of OpenAlex and the Sorbonne

The Financial Times recently published an article detailing Sorbonne University’s “radical decision” to switch to OpenAlex for its publication database and bibliometric analytics. The article (behind a paywall, unfortunately 😞) came out a little while ago, but we wanted to highlight it here in case you missed it.

The news comes in the context of “a wider pushback against the current model in academic publishing, where researchers publish and review papers for free but have to buy expensive subscriptions to the journals in which they are published to analyse data relating to their work.” It includes a quote from OurResearch/OpenAlex co-founder and CEO Jason Priem: “We felt there’s a mismatch between the values of the academy and the shareholder boardroom. Research is fundamentally about sharing, while for-profits are fundamentally about capturing and enclosing. We aim to create and sustain research infrastructure that’s truly aligned with . . . the values of the research community.”

Exciting times for OpenAlex and open science!

Jack, Andrew. “Sorbonne’s Embrace of Free Research Platform Shakes up Academic Publishing.” Financial Times, December 27, 2023. https://www.ft.com/content/89098b25-78af-4539-ba24-c770cf9ec7c3.

Assigning Institutions — New England Journal of Medicine Case Study

The New England Journal of Medicine uses a non-standard format when presenting authors and their institutional affiliations, which is a problem when we want to keep track of these links in our data. We developed a custom algorithm to solve this problem, preserving more than a hundred thousand author-institution links.

Linking works, authors, and institutions

Part of a diagram from the OpenAlex docs, showing how authors and institutions are linked to works through authorships.
OpenAlex data has links between works, authors, and institutions.

Works, authors, and institutions are three of the basic entities in the OpenAlex data. Keeping track of the relationships between these entities is one of the core things we do. It’s important that we identify these links correctly, so they can be used for downstream tasks like university research intelligence, ranking, etc. Often, this information comes to us via structured data which is not difficult to ingest. Many times, however, the data is messy, and using it is not so straightforward.

Affiliation data in the New England Journal of Medicine

Publications from the New England Journal of Medicine (NEJM) are an example of this messiness. Author affiliations in these papers are presented in a format that is human-readable, but not straightforward for a computer to parse automatically. In most other journals, authors are listed alongside their affiliated institutions, and so it is relatively easy for a program to link them together. NEJM does it a different way—as shown in the screenshot of a paper from the journal’s website, institutions are listed together with the initials of the authors, which in turn correspond to the full author names at the top of the paper.

Screenshot of the affiliations of a paper from the New England Journal of Medicine's website.
Author affiliations in NEJM come in a nonstandard format that is not easy for a computer to parse.

We might hope that the structured metadata we get from Crossref would have the data in a more standard format. But alas, this isn’t the case, as shown in the screenshot of data from the Crossref API.

Screenshot of JSON data from the Crossref API
Data about the paper from the Crossref API is also in the nonstandard format.

There are around 170,000 works from this journal. This is a relatively tiny proportion of the total number of works in OpenAlex. However, NEJM is a highly influential journal in medicine, so it’s a priority that we get this right.

Custom OpenAlex solution to assign institutions to NEJM authors

OpenAlex team member Nolan created a bespoke algorithm specifically for NEJM papers to parse the affiliation strings and assign authors to institutions. This rule-based algorithm identifies the author initials that might correspond to the full names, and uses those as a mapping to get the link from institution to author, as shown in the screenshot from the OpenAlex API of the example paper from above. The full data for this work can be found at https://api.openalex.org/works/W4386208393.

We have been able to apply this to around 35,000 articles, amounting to 158,000 institutional affiliations. Additionally, we identified about ten thousand raw affiliation strings that we couldn’t match to an institution, but can still prove useful to our users.

The NEJM case is an example of the attention to data and extra effort that is part of the value that OpenAlex hopes to provide. The data can be messy sometimes. It’s our mission to help make sense of it, so the world can have access to high-quality, free and open data.

Screenshot of JSON data from the OpenAlex API
OpenAlex data has institutional affiliations as structured, fully linked data.

Introducing Jason Portenoy, newest full-time team member at OpenAlex

Photo of Jason Portenoy

Hi, I’m Jason Portenoy, and I’m very happy to be joining OurResearch as the newest full-time team member! As a data engineer, I will be focusing my efforts on user engagement and outreach for OpenAlex. It is my responsibility to understand the OpenAlex dataset—its strengths and limitations—and work with the user community to improve it and make it easier to use.

I completed my PhD in Information Science at the University of Washington, studying the use of the scholarly literature as data to curate, explore, and evaluate scientific research. This field—known by various terms including scientometrics, science of science, metascience, and Big Scholarly Data—captivated me from the moment I learned about it. As the scale of scientific output continues to increase well beyond the capacity of any individual to make sense of it, the need for new tools and techniques to help becomes more and more pronounced. Working with Dr. Jevin West at the UW Datalab, I developed these tools and techniques—analyzing and visualizing scholarly data, and building recommender systems to connect scientists to new research and ideas. I extended this work through projects with Semantic Scholar, the Chan-Zuckerberg Initiative, and JSTOR.

While working on these tools and analyses, I came to rely on several scholarly data sets, such as Web of Science and Microsoft Academic Graph. Through my experience, I became an advocate for having high-quality, open, and accessible data for researchers and builders to use. A solid foundation of quality data will strengthen all downstream applications, from simple counts and bibliometric statistics, to advanced natural language processing and complex systems approaches.

Joining the OpenAlex team is a fantastic opportunity for me to contribute to the future of scholarly data. When Microsoft decided to end its academic service, myself and many others in the community wondered what would come next. It has become clear that OpenAlex will play a key role in the future of this field. I come to this position with technical training as a data engineer and data scientist, as well as experience with scholarly data. My goal is to work with the community of users to continually improve the OpenAlex data and experience. If there’s anything you think I might be able to help with, please let us know!

OpenAlex documentation improvements

It’s a new year and at OurResearch we’re starting off 2023 full steam ahead! We’ve revamped the OpenAlex documentation so that it’s easier to get started, and easier to find the fields and filters that are available in the OpenAlex API. It should take less “clicks” to find what you need.

Poised for growth

The major change we made was to highlight the core entities (works, authors, etc) in OpenAlex, giving them their own up-front space. OpenAlex grew considerably in 2022, not only in number records, but also by the number of ways that you can filter, group, and search scholarly data. This new approach provides more room to add and document filters. We can better describe the unique search capabilities available in each entity. Overall, it sets us up to grow again in 2023.

Our goal is to maintain friendly and approachable documentation, so hopefully we’ve kept that up as well. If you find something broken, or have some suggested improvements, let us know!

Author search in OpenAlex: improved handling of diacritics within names

We’ve improved the author search feature within OpenAlex, so you get more results when searching for author names that may or may not include diacritics. For example, a search for the name “David Tarragó” will return the same number of results as the the version that is converted via Lucene’s ASCII folding filter, which in this case is “David Tarrago”.

When searching with diacritics, results with the queried diacritics are more likely to be ranked towards the top. So the two searches may have slightly different rankings. You can see the results of these two searches in the API:

These queries return the same number of results, with diacritic and non-diacritic names included. Keep in mind that results are weighted by the author’s works count, so that has an impact on relevance as well.

Why make this change?

When creating the OpenAlex author search capability, it was important for us to honor author’s names by respecting diacritics. So searching with a diacritic returned results with diacritics. However, this strict approach makes it harder to find some authors. We’re comfortable with the compromise of searching with and without diacritics at the same time, while giving priority to the intended search query. Hopefully this improved feature is helpful!

Fetch multiple DOIs in one OpenAlex API request

Did you know that you can request up to 50 DOIs in a single API call? That’s possible due to the OR query in the OpenAlex API and looks like this:

https://api.openalex.org/works?filter=doi:10.3322/caac.21660|https://doi.org/10.1136/bmj.n71|10.3322/caac.21654&mailto=support@openalex.org

We simply separate our DOIs with the pipe symbol ‘|’. That query will return three works associated with the three DOIs we entered. As you can see in the query, a short form DOI or long form DOI (as a URL) are both supported.

This will save time and resources when requesting many DOIs. This technique works with all IDs in OpenAlex, to include OpenAlex IDs and PubMed Central IDs (PMID).

Example with python requests

Let’s write an example python script to show how we can get DOIs in batches of 50 using requests:

import requests

dois = ["10.3322/caac.21660", "https://doi.org/10.1136/bmj.n71", "10.3322/caac.21654"]
pipe_separated_dois = "|".join(dois)
r = requests.get(f"https://api.openalex.org/works?filter=doi:{pipe_separated_dois}&per-page=50&mailto=support@openalex.org")
works = r.json()["results"]

for work in works:
  print(work["doi"], work["display_name"])

# results
https://doi.org/10.3322/caac.21660 Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries
https://doi.org/10.1136/bmj.n71 The PRISMA 2020 statement: an updated guideline for reporting systematic reviews
https://doi.org/10.3322/caac.21654 Cancer Statistics, 2021

Hope this is helpful!

Meet Casey – Now full time with OurResearch

Hi I’m Casey. I am excited to announce that I am now full time with OurResearch as a software engineer working on OpenAlex and Unpaywall!

My Journey

I freelanced for OurResearch prior to joining full time this summer. With Jason and Heather’s help I maintained Paperbuzz, Cite-As, and also built out a project to catalog academic journal pricing. With freelancing I was able to improve my python and data management skills in order to tackle bigger projects.

Prior to freelancing I enjoyed a career in the US Air Force, which I am proud of. I’m fortunate to have hundreds of hours as aircrew on multiple aircraft, as well as a variety of technical and leadership assignments. So if you ever want to talk airplanes be ready because I might talk your ear off!

My academic experience comes from my time in university pursuing advanced education.

My Vision with OurResearch

In December I helped build the API and set up Elasticsearch for a project called OpenAlex. That project has continued to grow and I love to see how many people are using it. My core job with OpenAlex is to provide front-line customer support, as well as maintain and improve the API and search infrastructure. I’m also working on several parts of UnPaywall.

It’s incredible that OurResearch tools are freely open and available. I find OurResearch has similar core values as my time in the Air Force: small teams empowered to make decisions, humble and accepting of feedback in order to make things better. That’s why we believe our community of users are invaluable and important in keeping those tools free, open, and easy to use.

So we will listen to your feedback, fix bugs and implement features quickly, and continue to maintain our documentation so the dataset and APIs are as frictionless as they can be. We welcome and need your help with this mission! So do not hesitate to contact me or the team.

I look forward to improving OpenAlex and Unpaywall, and to meeting those of you using OurResearch products!

– Casey

Unsub – All Publishers Supported

Unsub is a dashboard that helps you reevaluate your big deal’s value and understand your cancellation options.

For the last few years we’ve supported a small set of very large publishers.

One of the most requested features has been support for more publishers.

As of today – right now – we support all publishers.

We heard you, and we’re super excited to get this in your hands. Here’s some important details:

  • All publishers are supported. We no longer support specific publishers, but rather we support any publisher.
  • A mix of publishers is supported. This was another oft requested feature, mostly related to aggregators, and actually naturally arose out of our change to support all publishers. Unsub dashboards no longer have logic filtering what titles are in your dashboard by publisher – so it’s just as easy for a dashboard to have titles from one publisher or 20 publishers.
  • Title prices are now required. Supporting all publishers, it’s not feasible for us to collect and update titles prices for all of their titles. For existing Unsub packages created before today, we’ve incorporated the public prices we had (for the big 5 we supported: Elsevier, Springer Nature, Wiley, Taylor & Francis, SAGE) into your packages. For new packages moving forward, you’ll have to upload your own title prices. We’ve updated the documentation accordingly.
  • APC report has moved from package to institution level. We have APC data for the big 5 publishers, but now that we’re moving to any publisher, we can no longer provide publisher specific APC reports. However, you can now get an APC report for your institution that includes an estimate of your APC spend for the big 5 publishers (Elsevier, Springer Nature, Wiley, Taylor & Francis, SAGE). See the APC Report documentation page for more.

But, we didn’t stop there. Here’s some additional features you can use today that we think you’ll enjoy:

  • Packages now have Descriptions. When you login to Unsub you’ll see evidence of this change straight away. You can use this package attribute to include a lot of detail about your package to remind your future self and others of important details about your package. See the docs for more information.
  • Package views now have an Edit Details tab. In this tab you can change the package name and description. See the docs for more information.
  • Packages have an optional filter setup step. This could be used for a variety of use cases, but first and foremost can be used to get back to the state of your package before today’s changes. That is, we no longer filter by publisher. If you had a Wiley package before today you should have only seen titles published by Wiley in your dashboard. However, moving forward, we do not filter by publisher, so that same Wiley package may include some titles from other publishers that were in your COUNTER reports. You can use this new feature to limit the set of titles that appear in your dashboard. See the Upload journal filter documentation page to learn more.

Notes:

  • During testing, we heard that aggregators may not provide a COUNTER 5 TR_J2 file. As we require a TR_J2 file if you choose COUNTER 5 in Unsub, we provide a fake TR_J2 file. Let us know if you run into any issues with this! See the docs page for more info.
  • As we support more publishers, we’ll run into more edge cases. We’ve heard that some publishers only provide a COUNTER 5 TR_J1 file – and do not provide TR_J2, TR_J3, and TR_J4 files. We don’t currently support the COUNTER 5 TR_J1 file. Get in touch if this is something you need.
  • There may be “growing pains” moving from support for 5 publishers to all publishers. For example, journal metadata that’s crucial to Unsub may not be complete for some journals. Please do get in touch if you run into any issues. We’ll be keeping an eye on things and will

If you are not a current Unsub subscriber and you’re interested to learn more schedule a demo or go ahead and purchase.

If you are a current Unsub subscriber, log in, kick the tires, and let us know what you think.

To learn more about all the new features head over to our documentation.

In an upcoming webinar (date to be announced soon) I’ll dive into all the new features and answer any questions.

Unsub Webinar Series

We’re starting an Unsub (https://unsub.org/) webinar series next week!

Why would you want to attend? These webinars should help you get better value from Unsub regardless of whether you want to just understand your options, get a better deal on your big deal, or cancel your big deal. 

Every two weeks we’ll cover a new topic, with two time slots for each topic to serve a wider array of time zones: morning and afternoon PST (Pacific Standard) time.

If our webinar times don’t work for you, we are planning to record webinars and upload them for anyone to watch on Vimeo (https://vimeo.com/unsub).

Here are the first three topics we’ll cover:

  • Feb 8 & 10: Unsub demo – an overview of the product
  • Feb 22 & 24: Eric Schares demoing Unsub Extender
  • Mar 8 & 10: Deep dive on Unsub scenarios

Other topics are in the works – we’ll announce them soon. Let us know here, elsewhere, or email me (scott@ourresearch.org) if there’s any topics you’d like covered in our webinar series.

The webinar series is free. However, we will require registration so we know how many people are coming and to make it easier for you to remember to attend (i.e., Zoom email confirmation, add to your calendar, etc). 

Our first webinar is titled Unsub Demo – An Overview of the Product – Feb 8 and 10:

We’ll put out registration links soon for subsequent webinar topics.

Joining OurResearch to work on Unsub

Why OurResearch?

I’m thrilled to have landed a job with OurResearch working full-time on Unsub. When I was looking for a job this summer I wanted a new experience; I wanted to be challenged and to learn new skills – Unsub was the perfect match. With respect to coding, I moved from 100% R programming to 100% Python. In addition, the domain (tools for librarians) is very different from my previous job (open source software for researchers) – just the big change I wanted. 

Academic Libraries

Despite coming into this job without experience working as a librarian, I’ve always deeply appreciated libraries and the work librarians do. During my time in academia (bachelors through post-doc) I benefited a lot from various university libraries (Rice University and Simon Fraser University, to name a few), and experienced the technological change from print to electronic as ILL requests first came in hardbound and printed form, then transitioned to electronic forms. I’m excited to be able to help librarians after benefiting from their work for so many years.

What I’ll work on

As the Unsub product owner I’ll make decisions about features, implement those features, fix bugs, do demos for librarians, and of course do lots of support. I’m excited to make Unsub the best tool for librarians to reevaluate big deals and understand their cancellation options. 

Challenges and opportunities

The biggest challenge I see in maintaining Unsub is making sure our forecasts are as accurate as possible. I’ve learned already that it can be difficult to keep track of what publishers are doing with respect to big deals, title by title prices, etc. 

There’s a big, neh huge, opportunity here to push scholarly literature much further towards open access – while at the same time freeing up library budgets to support more collaborative players in the scholarly publishing community.