🌱

Getting started

Access to open drug data is valuable. Open drug data is not easy to work with.

Background

By "open drug data" we mean data about drugs openly available to the public - typically (but not only) from US government sources like the Food and Drug Administration (FDA), National Library of Medicine (NLM), and Centers for Medicare and Medicaid Services (CMS). These organizations do a reasonably good job at presenting and sharing their own siloed data; however, they all seem to use different data formats, structures, and update frequencies.

An actual image of me trying to explain how to manually combine open drug data from different sources.

When we say open drug data is valuable but not easy to work with, here are a few examples of what we mean.

🖱️ Click each example below to read more.

‣

NLM DailyMed uses XML for everything and contains tens of thousands of zip files

‣

FDA NDC Directory crams valuable information into a single cell and doesn’t use NDC11 format

‣

CMS NADAC has duplicate rows and is not consistently uploaded on the same day of the week

‣

NLM RxNorm uses RRF-format files and an arcane table and column naming convention

We could go on, but you probably get the idea. Open drug data holds a lot of value, but reliably accessing this value on a ongoing basis requires either a well thought-out data pipeline infrastructure or a lot of error-prone manual work.

So what are we doing about it?

Introduction

We built a platform of one-click data pipelines that that can automatically extract up-to-the-day current open drug data and not only load it all into a common database so it’s easier to work with, but also transform it into curated marts containing the polished end result of a complex series of novel combination and re-organization of the original drug data.

Oh - and we open sourced it all.

sagerx

coderxio ⋅ 3 months ago

SageRx uses Airflow to schedule jobs to extract, load, and transform (using dbt) open drug data.

Data ends up in a PostgreSQL database and can be queried using pgAdmin (included with SageRx) or via any SQL editor of your choice.

More details about how SageRx works behind the scenes are available in the aptly named 💡How SageRx works.

So what’s the big deal?

This is different from the commercial drug database you might be familiar with.

For one, not only is the code and SQL to do the data transformations completely open source, the documentation is also open and written by pharmacist / developer hybrids who know how to translate pharmacy domain knowledge into developer-friendly concepts.
Second, it is fairly lightweight, easy to spin up (using Docker), and pretty much runs itself. Even not-super-technical people can contribute by adding their own custom data transformations requiring only SQL. And - if you think your work could benefit others - you could even contribute a pull request to the overall open-source SageRx project.
Lastly, at its core, SageRx is based on open common standards that promote interoperability - instead of licensed, proprietary coding systems that make it difficult to share data between organizations.

To be clear, we’re not a huge organization of people scrubbing the source data and phoning manufacturers to fill in gaps… but it’s not our intention to be that. We want to build something sustainable with very little overhead that might make drug data more accessible and understandable for people that need to work with it.

Who needs it?

Our hope is that SageRx can benefit (at the very least):

Startup founders
Researchers
Data analysts
Maybe you?

There has to be a better way! 🌿

If any of this interests you, please star the repo, join our Slack, and/or shoot us an email. Oh - and please be patient as we try to get our documentation in order. If you have questions or need help getting started in the meantime, the #proj-sagerx channel of our Slack is an excellent resource for support.

← Previous

Example queries

Installation

On this page

Getting started
Background
Introduction
So what’s the big deal?
Who needs it?