How SageRx works
Airflow, dbt, and a lot of pharmacy domain knowledge.
Overall architecture
Docker is used to containerize the tools below. Spinning up SageRx is as simple as a few initial setup steps and then a couple of docker-compose up
statements.
See detailed installation instructions in Installation.
Flow of data
1. Airflow
⬇️
2. PostgreSQL / pgAdmin
⬇️
3. dbt
This simplistic diagram is not entirely correct because dbt also updates PostgreSQL with tables created from the transformations… but hopefully you get the general idea.
I will try to explain some highlights specific to SageRx from each of these steps in the sections that follow.
1. Airflow
Airflow is the data pipeline tool we chose for this project because it is open source, supported by Apache, and seems to be widely used by a variety of industries. All of the business logic relating to where the raw open drug data lives (NLM / CMS / FDA / etc), what format it is stored in, how often it is updated, how to unpack and manipulate it, etc is stored within Airflow “DAGs” (directed acyclic graphs) - which are basically just Python scripts.
Airflow is used for the Extraction and Loading (EL) of the overall Extract-Load-Transform (ELT) data engineering process.
Timing of data source updates
The update timing of data sources is maintained here. Some data sources are updated weekly, others monthly, others yearly, etc. With the flip of a switch, you can automate recurring updates of any data source.
Authentication
Select data sources (RxNorm) require a free license key even though the parts of the data used from the source are open. Airflow handles the authentication steps required for these scenarios.
Changing file names
Some raw data file names change with every release - possibly including a date at the very least. Some files aren't regularly updated on the same day of the month and require an initial API call to determine the most recent file before downloading.
Back-loading of data
Some data sources have data going back several years. In situations where loading all of this data is beneficial (NADAC, for instance), Airflow can automate the backfill process of historic data.
Unzipping of files
Some data sources use zipped files one or more layers deep. Airflow, combined with custom functions, handles unzipping these files and only loading the necessary ones into the database.
Loading into the database
SageRx utilizes the Postgres COPY function to load data directly from a file into a database. We think this is the most efficient way to load data into a database. This can handle the vast majority of file types (CSV, TSV, RRF, etc).
In scenarios where the COPY function can't handle a file type (Excel files, for instance), we use Python / pandas to transform the data into a format that can be loaded into the database.
2. PostgreSQL / pgAdmin
Currently, all data ends up in a PostgreSQL database. We chose PostgreSQL because it is open source, supported by a large community, and has excellent documentation.
We suspect that there will be a need for connectors to other types of databases (cloud, data lake, etc), but initially we wanted to keep it simple.
Schemas
The two main schemas in the PostgreSQL database are:
sagerx_lake
sagerx_dev
All of the raw data from Airflow ends up in the sagerx_lake
schema in as close to its raw format as possible. There may be slight differences solely due to column type, but the intent is to leave it in raw format.
From there, it is transformed by dbt (as described in step “3. dbt” below) and the resulting data ends up in the sagerx_dev
schema. More information about this in the dbt section below.
Connecting to the database
Connect using pgAdmin
It is not necessary to use pgAdmin to connect to the database, but it makes it easy to run some quick queries and has (in our opinion) a great user interface.
By default pgAdmin is hosted on port 8002.
Connect directly
Since we use Postgres as the database, connecting to a hosted version of SageRx is just like connecting to any other Postgres database.
By default, the connection information is configured as listed below. Replace <<server_address>>
with your server address.
server: <<server_address>>:5432
username: sagerx
password: sagerx
3. dbt
Data build tool (commonly known as dbt) is where the Transformation (T) of the Extract-Load-Transform (ELT) data engineering process happens.
Staging models
The staging step is where normalization of data generally happens (i.e. converting to NDC11, formatting date columns as DATE data types, etc).
It is also where basic data modeling begins happening. If the sagerx_lake
schema contains raw energy and quarks, the sagerx_dev
schema condenses and refines this material into individual atoms we will use to build more intricate and useful structures. Some examples:
- FDA NDC Directory - Class tables in this schema have one row per drug class per product. In the raw data, this information is crammed into a single cell, delimited by semicolons.
- NLM RxNorm - Major transformation of RxNorm data happens at this step. Raw RxNorm tables are converted into data models representing NDCs, clinical products, brand products, ingredients, dose forms, etc.
- CMS NADAC - De-duplication of data happens at this step. Additionally, differences in percentage and dollars is calculated between each price change reported in the NADAC data.
Intermediate models
Intermediate models use the building blocks of the staging models to create more complex, and sometimes more useful, data models. Continuing the raw energy → atoms analogy, intermediate models are where we use the atoms from staging models to create molecules.
Some examples of intermediate tables:
- NLM RxNorm - The intermediate schema of RxNorm data contains extremely useful tables that let you go from NDC to clinical product information (ingredients, dose forms, etc) with a single select statement.
- CMS NADAC - This table contains data from the NADAC staging table, but filtered to only current prices based on the current flag defined in the staging table.
Documentation
Documentation is a first class citizen in dbt (and SageRx), and helps support the understanding of the transformations that happen so that (in addition to the underlying queries being open source) end users can feel confident when using the resulting data.
Documentation lives in YAML files inside of the folders of the models/
directory, but dbt also generates a user-friendly website (dbt docs generate
) that can be hosted (dbt docs serve
) to make documentation easily accessible to consumers of the data.
Seeds
Seeds are a dbt feature, and are meant to be small CSV files that can be loaded as tables in a different seeds
schema.
SageRx doesn't currently utilize seeds for much, but one potential use could be a repository of "fixes" or "flags" for incorrect or out of date open data (i.e. take the raw source data and “patch” it with corrections or modifications stored in seeds).
Macros
Macros are a dbt feature that utilizes Jinja 2 templating to allow you to create SQL "functions" or easily re-usable SQL code that can be updated in one place and instantly update all the models where it is used.
NDC formatter
Due to the need for common transformations like taking a hyphenated NDC and converting it to NDC11 format, we created custom dbt macros to handle this.
select {{ ndc_to_11('1234-5678-90') }}
/* returns 01234567890 */
← Previous
Next →
On this page
- How SageRx works
- Overall architecture
- 1. Airflow
- Timing of data source updates
- Authentication
- Changing file names
- Back-loading of data
- Unzipping of files
- Loading into the database
- 2. PostgreSQL / pgAdmin
- Schemas
- Connecting to the database
- Connect using pgAdmin
- Connect directly
- 3. dbt
- Staging models
- Intermediate models
- Documentation
- Seeds
- Macros