A brief introduction to SageRx.
Open drug data is not easy to work with.
By "open drug data" we mean data about drugs openly available to the public - typically (but not only) from US government sources like the Food and Drug Administration (FDA), National Library of Medicine (NLM), and Centers for Medicare and Medicaid Services (CMS). These organizations do a reasonably good job at presenting and sharing their own siloed data; however, they all seem to use different data formats, structures, and update frequencies.
When we say open drug data is not easy to work with, here are a few examples of what we mean.
🖱️ Click each example below to read more.
NLM DailyMed is a major source of structured product label (SPL) information - you know, those 6-point font paper printouts that sometimes accompany a prescription you get from the pharmacy. This data is very valuable and contains structured information about inactive ingredients, package contents, label images, and more. The problem is that all of this data is stored in XML using a specific XML template format used by FDA for label submissions, and each SPL’s XML and images is contained within a zip file. So there’s 40,000 zip files (one per SPL) inside of four or more other zip files (because putting all of that in one zip file would be unwieldy) and at the bottom of all that, you still have to deal with parsing through the XML documents.
FDA National Drug Code (NDC) Directory is our go-to source for NDC-level information, but there are a few problems with how the data is presented. First, in order for the data to be shared as a flat text file, information about things like drug classes, substances, and active ingredient strengths are concatenated into single columns and separated by delimiters (i.e. comma or semicolon delimited). That information would be much more valuable in a normalized relational database format. Second, in order to join FDA NDC data to other data sources from NLM and CMS, you first need to normalize the NDCs to NDC11 format; however, the FDA does not have a column for an NDC11 formatted NDC so you need to somehow normalize this yourself every time you need to work with these combined sources.
CMS National Average Drug Acquisition Cost (NADAC) is arguably the best openly available source of drug pricing information for drugs covered by Medicaid. CMS outsources the collection of pharmacy pricing surveys and then hosts the aggregated data on its website. We have found that this data is often not de-duplicated week to week, which requires an extra cleanup step before working with it. Also, the file name of the CSV file uploaded every week contains the date the file was uploaded. Generally, this is always a Wednesday, but sometimes it is a Tuesday or a Thursday. From the perspective of trying to automate the download of this data every week, this presents a challenge because not only do you have to change the file name every week, but the file name is not consistently the same day of the week.
NLM RxNorm is the widely agreed-upon standard for open medication terminology. The NLM hosts both an application programming interface (API) and also all of the files needed to load RxNorm into a database. Unfortunately, it only provides a guide for loading this data into a MySQL or Oracle database. So if you wanted to use a different database (PostgreSQL, for instance), you will need to figure out how to work with the RRF-format files yourself. Also, if you are not familiar with APIs or working with databases and just want to work with this information in a flat file format, you are out of luck. Even if you are familiar with working with databases, the bulk of the data is contained in three very abstract-sounding tables that don’t follow relational database normalization rules. This makes the barrier to entry high even for someone with a technical background.
We could go on, but you probably get the idea. Open drug data holds a lot of value, but reliably accessing this value on a ongoing basis requires either a well thought-out data pipeline infrastructure or a lot of error-prone manual work.
So what are we doing about it?
We built a platform of one-click data pipelines that that can automatically extract up-to-the-day current open drug data and not only load it all into a common database so it’s easier to work with, but also transform it into curated marts containing the polished end result of a complex series of novel combination and re-organization of the original drug data.
Oh - and we open sourced it all.
More details about how SageRx works behind the scenes are available in the aptly named
So what’s the big deal?
This is different from the commercial drug database you might be familiar with.
- For one, not only is the code and SQL to do the data transformations completely open source, the documentation is also open and written by pharmacist / developer hybrids who know how to translate pharmacy domain knowledge into developer-friendly concepts.
- Second, it is fairly lightweight, easy to spin up (using Docker), and pretty much runs itself. Even not-super-technical people can add their own custom data transformations just by writing some SQL. And - if you think your work could benefit others - you could even contribute a pull request to the overall open-source SageRx project.
- Lastly, at its core, SageRx is based on open common standards that promote interoperability - instead of licensed, proprietary coding systems that make it difficult to share data between organizations.
To be clear, we’re not a huge organization of people scrubbing the source data and phoning manufacturers to fill in gaps… but it’s not our intention to be that. We want to build something sustainable with very little overhead that might make drug data more accessible and understandable for people that need to work with it.
Who needs it?
Our hope is that SageRx can benefit (at the very least):
- Startup founders
- Data analysts
- Maybe you?
If any of this interests you, please star the repo, join our Slack, and/or shoot us an email. Oh - and please be patient as we try to get our documentation in order. If you have questions or need help getting started in the meantime, the #proj-sagerx channel of our Slack is an excellent resource for support.