Documentation
Getting startedInstallationLoading data with AirflowQuerying the databaseFinding tables in pgAdminHow SageRx worksData dictionary
Show me the dataData sourcesSource dataStaging dataIntermediate dataMartsUse cases
Use casesExample queriesHow SageRx works
Airflow, dbt, and a lot of pharmacy domain knowledge.
Overall architecture
DockerĀ is used to containerize the tools below. Spinning up SageRx is as simple as a few initial setup steps and then a couple ofĀ docker-compose up
Ā statements.
See detailed installation instructions in Installation.
Flow of data
1. Airflow
ā¬ļø
2. PostgreSQL / pgAdmin
ā¬ļø
3. dbt
This simplistic diagram is not entirely correct because dbt also updates PostgreSQL with tables created from the transformationsā¦ but hopefully you get the general idea.
I will try to explain some highlights specific to SageRx from each of these steps in the sections that follow.
1. Airflow
AirflowĀ is the data pipeline tool we chose for this project because it is open source, supported by Apache, and seems to be widely used by a variety of industries. All of the business logic relating to where the raw open drug data lives (NLM / CMS / FDA / etc), what format it is stored in, how often it is updated, how to unpack and manipulate it, etc is stored within Airflow āDAGsā (directed acyclic graphs) - which are basically just Python scripts.
Airflow is used for the Extraction and Loading (EL) of the overall Extract-Load-Transform (ELT) data engineering process.
Timing of data source updates
The update timing of data sources is maintained here. Some data sources are updated weekly, others monthly, others yearly, etc. With the flip of a switch, you can automate recurring updates of any data source.
Authentication
Select data sources (RxNorm) require a free license key even though the parts of the data used from the source are open. Airflow handles the authentication steps required for these scenarios.
Changing file names
Some raw data file names change with every release - possibly including a date at the very least. Some files aren't regularly updated on the same day of the month and require an initial API call to determine the most recent file before downloading.
Back-loading of data
Some data sources have data going back several years. In situations where loading all of this data is beneficial (NADAC, for instance), Airflow can automate the backfill process of historic data.
Unzipping of files
Some data sources use zipped files one or more layers deep. Airflow, combined with custom functions, handles unzipping these files and only loading the necessary ones into the database.
Loading into the database
SageRx utilizes the Postgres COPY function to load data directly from a file into a database. We think this is the most efficient way to load data into a database. This can handle the vast majority of file types (CSV, TSV, RRF, etc).
In scenarios where the COPY function can't handle a file type (Excel files, for instance), we use Python / pandas to transform the data into a format that can be loaded into the database.
2. PostgreSQL / pgAdmin
Currently, all data ends up in aĀ PostgreSQLĀ database. We chose PostgreSQL because it is open source, supported by a large community, and has excellent documentation.
We suspect that there will be a need for connectors to other types of databases (cloud, data lake, etc), but initially we wanted to keep it simple.
Schemas
The three main schemas in the PostgreSQL database are:
datasource
staging
intermediate
All of the raw data from Airflow ends up in theĀ datasource
Ā schema in as close to its raw format as possible. There may be slight differences solely due to column type, but the intent is to leave it in raw format.
From there, it is transformed by dbt (as described in step ā3. dbtā below) and the resulting data ends up in theĀ staging
Ā andĀ intermediate
Ā schemas. More information about this in the dbt section below.
Eventually, we may have some data mart data in theĀ mart
Ā schema and possibly a separateĀ flatfile
Ā schema for data meant to be loaded into flat files for easy download.
Custom functions
NDC formatter
Due to the need for common transformations like taking a hyphenated NDC and converting it to NDC11 format, we created custom Postgres functions to handle this.
We made this decision prior to implementing dbt, so a future optimization may be converting these functions from Postgres to dbt. However, they come in handy when writing queries directly in pgAdmin.
select ndc_to_11('1234-5678-90')
/* returns 01234567890 */
Connecting to the database
Using pgAdmin
It is not necessary to useĀ pgAdminĀ to connect to the database, but it makes it easy to run some quick queries and has (in our opinion) a great user interface.
By default pgAdmin is hosted on port 8002.
Connecting directly
Since we use Postgres as the database, connecting to a hosted version of SageRx is just like connecting to any other Postgres database.
By default, the connection information is configured as listed below.
server: <<server_address>>:5432
username: sagerx
password: sagerx
3. dbt
Data build toolĀ (commonly known asĀ dbt) is where the Transformation (T) of the Extract-Load-Transform (ELT) data engineering process happens.
The steps of the dbt process line up nicely with schemas in Postgres. The two main schemas populated by dbt areĀ staging
Ā andĀ intermediate
.
Staging
The staging step is where normalization of data generally happens (i.e. converting to NDC11, formatting date columns as DATE data types, etc).
It is also where basic data modeling begins happening. If theĀ datasource
Ā schema containsĀ raw energy and quarks, theĀ staging
Ā schema condenses and refines this material into individual atoms we will use to build more intricate and useful structures. Some examples:
- FDA NDC DirectoryĀ - Class tables in this schema have one row per drug class per product. In the raw data, this information is crammed into a single cell, delimited by semicolons.
- NLM RxNormĀ - Major transformation of RxNorm data happens at this step. Raw RxNorm tables are converted into data models representing NDCs, clinical products, brand products, ingredients, dose forms, etc.
- CMS NADACĀ - De-duplication of data happens at this step. Additionally, differences in percentage and dollars is calculated between each price change reported in the NADAC data.
Intermediate
TheĀ intermediate
Ā schema uses the building blocks of theĀ staging
Ā schema to create more complex, and sometimes more useful, data models. Continuing the raw energy ā atoms analogy, theĀ intermediate
Ā schema is where we use the atoms from theĀ staging
Ā schema to create molecules.
Some examples of intermediate tables:
- NLM RxNormĀ - The intermediate schema of RxNorm data contains extremely useful tables that let you go from NDC to clinical product information (ingredients, dose forms, etc) with a single select statement.
- CMS NADACĀ - This table contains data from the NADAC staging table, but filtered to only current prices based on the current flag defined in the staging table.
Documentation
Documentation is a first class citizen in dbt (and SageRx), and helps support the understanding of the transformations that happen so that (in addition to the underlying queries being open source) end users can feel confident when using the resulting data.
Documentation lives in YAML files inside of the folders of theĀ models/
Ā directory, but dbt also generates a user-friendly website (dbt docs generate
) that can be hosted (dbt docs serve
) to make documentation easily accessible to consumers of the data.
Seeds
SeedsĀ are a dbt feature, and are meant to be small CSV files that can be loaded as tables in a differentĀ seeds
Ā schema.
SageRx doesn't currently utilize seeds for much, but one potential use could be a repository of "fixes" or "flags" for incorrect or out of date open data (i.e. take the raw source data and āpatchā it with corrections or modifications stored in seeds).
Macros
MacrosĀ are a dbt feature that utilizesĀ Jinja 2Ā templating to allow you to create SQL "functions" or easily re-usable SQL code that can be updated in one place and instantly update all the models where it is used.
ā Previous
Next ā
On this page