behind the scenes: cleaning dirty data

Dirty Data.  It’s everywhere!  And that’s expected and ok and even frankly good imho — it happens when people are doing complicated things, in the real world, with lots of edge cases, and moving fast.  Perfect is the enemy of good.

Thanks http://www.navigo.com.au/2015/05/cleaning-out-the-closet-how-to-deal-with-dirty-data/ for the image

Alas it’s definitely behind-the-scenes work to find and fix dirty data problems, which means none of us learn from each other in the process.  So — here’s a quick post about a dirty data issue we recently dealt with 🙂  Hopefully it’ll help you feel comradery, and maybe help some people using the BASE data.

We traced some oaDOI bugs to dirty records from PMC in the BASE open access aggregation database.

Most PMC records in BASE are really helpful — they include the title, author, and link to the full text resource in PMC.  For example, this record lists valid PMC and PubMed urls:

and this one lists the PMC and DOI urls:

The vast majority of PMC records in BASE look like this.  So until last week, to find PMC article links for oaDOI we looked up article titles in BASE and used the URL listed there to point to the free resource.

But!  We learned!  There is sometimes a bug!  This record has a broken PMC url — it lists http://www.ncbi.nlm.nih.gov/pmc/articles/PMC with no PMC id in it (see, look at the URL — there’s nothing about it that points to a specific article, right?).  To get the PMC link you’d have to follow the Pubmed link and then click to PMC from there.  (which does exist — here’s the PMC page which we wish the BASE record had pointed to).

That’s some dirty data.  And it gets worse.  Sometimes there is no pubmed link at all, like this one (correct PMC link exists):

and sometimes there is no valid URL, so there’s really no way to get there from here:

(pretty cool PMC lists this article from 1899, eh?.  Edge cases for papers published more than 100 years ago seems fair, I’ve gotta admit 🙂 )

Anyway.  We found this dirty PMC data in base is infrequent but common enough to cause more bugs than we’re comfortable with.  To work around the dirty data we’ve added a step — oaDOI now uses the the DOI->PMCID lookup file offered by PMC to find PMC articles we might otherwise miss.  Adds a bit more complexity, but worth it in this case.

 

 

So, that’s This Week In Dirty Data from oaDOI!  🙂  Tune in next week for, um, something else 🙂

And don’t forget Open Data Day is Saturday March 4, 2017.   Perfect is the enemy of the good — make it open.

Leave a Reply