Community Medicine: Sankey flow diagram

Many clinical trials collect prospective categorical data from participants to chart changes in the study population over time. Common examples would be quality of life questionnaires or risk scales, which provide a quick, standardized assessment of participant outcomes at a given time point.

A popular method for reporting prospective categorical data is to show results in a stacked bar chart. Consider the stacked bar chart below which reports number of risk factors participants exhibited at each of a series of visits.

sankey bar chart

This stacked bar chart is useful for quickly identifying trends in the overall study population - in this case, we can observe an increase in risk factors reported over time - but it does not provide much information about subgroups in the study. In the era of personalized and precision medicine, subgroup analysis is increasingly important for identifying which groups of people are most likely (or least likely) to respond to a particular treatment.

In our example above, we can see that there is a sizable increase in participants reporting 3 risk factors (dark green bar) from the 30-month visit to the 60-month visit. Where did these high-risk factor participants come from? We might assume they came from the group who had previously reported 2 or more risk factors, but the bar graph alone does not answer this question.

One solution is to overlay a Sankey flow diagram to the chart to shed some light on this mystery. Sankey diagrams were popularized by Matthew Henry Phineas Riall Sankey, a 19th-century Irish engineer, who created flow diagrams where the size of the arrow between two nodes is proportional to the magnitude of the flow.

With a Sankey Bar Chart, we can get the following visualization of our data:

sankey bar chart

Now we can see how our data flow between each time point, which helps us identify patterns in our data.

Let's revisit our question from earlier. Where did the 29% of high-risk factor participants at 60 months come from? According to the diagram, some came from the groups reporting 2 and 3 risk factors at 12-months, but more than half came from the groups previously reporting 0 or 1 risk factor - not what we might have expected from just looking at the bar chart.

For those wanting to really dive into their data, we can provide an interactive version allowing users to explore the chart by selecting individual bar sections or flows and isolating the data for those sections.

sankey bar chart

Like all good data visualizations, the Sankey bar chart is designed to communicate the story behind the data. The bar chart alone tells part of the story, but adding a Sankey overlay provides a richer and more detailed understanding of our data.

Community Medicine

Sunday, April 12, 2020

Sankey flow diagram

No comments:

Post a Comment