Lesson #3 - Visualization and Exploratory Analysis

Opioid overdose analytics

In this notebook we utilize publicly available datasets for conducting exploratory data analytics with visualization. We will explore:

  1. Standard Visualization - Standard visualization techniques such as bar, line, scatter, geo-spatial plots.
  2. Advanced Visualization - Advanced data visualization techniques such as drawing attention, histograms, bubble plots, etc.
  3. Dynamic Visualization - Incorporating dynamic nature to visualizations with time.

The notebooks is viewable via any browser.

Software used (open source):

Developed for CSTE Grantees Webinar Series

Datasets

1. Drug Overdose Dataset

The overdose death/cause dataset was obtained from CDC Wonder (https://wonder.cdc.gov/ucd-icd10.html). The dataset is from the Underlying Cause of Death database contains mortality and population counts for all U.S. counties. Data are based on death certificates for U.S. residents. Each death certificate identifies a single underlying cause of death and demographic data.

2. Local Area Unemployment Statistics

The Local Area Unemployment Statistics (LAUS) program produces monthly and annual employment, unemployment, and labor force data for Census regions and divisions, States, counties, metropolitan areas, and many cities, by place of residence.

3. Guilford County Emergency Medical Services (EMS) call data

The third dataset in this notebook is the EMS call dataset for the Guilford County metro area. The dataset contains various types of calls made to 911 from the areas of Greensboro, Highpoint, and Jamestown regions in North Carolina.

Exploration of Drug Overdose Dataset

1. Standard Visualization

Lets explore the data. We are going to use 2018 data to see which states and counties have high opioid overdose rates per 100,000 people (Norm_Deaths).

1. Lets look at 2018 data for aggegrate of states for per capita opioid deaths. We are going to use mean number of normalized opioid deaths.

A simple Bar chart

Looks like Kentuky (KY) currently has the highest mean number of opioid deaths within US for 2019.

2. Another simple way to visualize the data is to increase the number of dimensions. Lets compare the per capita deaths to the population of state to see if there are any anomalies.

We can use a scatter plot to visualize.

The previous graph is really complex because of 50 states data. A way to simplify this is to choose 2-3 states to compare and contrast.

A better way to visualize would be to reduce the number of elements in the graph. We can select sepcific states and compare data between them.

3. If we compare KY and NC in a scatter plot.

Now lets look at causes of deaths within the data.

4. We can use a Pie Chart to see what the ratio looks like.

Looks like a large portion of cases are unintentional drug poisioning which leads to death.

5. As we know KY has the highest number of opioid cases in US for 2019. How does the different types of cases compare against NC. We can use a stacked bar chart for that.

Stacked charts can be used to compare categories within variables. Here we observe while Suicide number of cases are similar between NC and KY, Unintentional and Undertermined cases are way higher in KY.

6. Now lets look at overall yearly trends of states. A line plot would be a good approach to get that sorted.

But before we get started with that we need to merge few datasets and then utilize them. *CDX Wonder was a bit picky about downloading large data.*

2. Advanced Visualizations

Here we will see some advanced visualizations where we are drawing the attention of viewers to specific elements of the data. Or demonstrating other aspects of your variables in visualizations.

In the previous line plot there was a lot of informaiton to visualize. This can be distracting to viewers.

To avoid this we can select states and compare.

6. We are going to highlight the line plots for the choosen states and compare them to others.

Here we are able to achieve the following:

Within data exploration we can also analyze aspects of variables which are not shown in simple plots.

7. We can compare two states (NC and KY) to see what the distribution of opiod cases looks like in their counties. We will use a histogram to visualize.

The plot gives us information about:

8. Another way to visualize the same data is to utilize boxplots. This is also great to observe the spread and detect individual outliers in the data.

We can also analyze the same data by adding another dimension into the mix.

9. Here we will use bubble plot, which visualizes per capita opioid mortality, population, and unemployment rate in counties across NC and KY.

In order to do this, we will use the employment dataset we had mentioned before. Specifically, the unemployment rate for different counties.

We choose unemployment rate to be the third variable in our analysis (out of intuition).

But a better way to check if a certain variable contributes is to evaluate the correlation of variables in your data to your target variable.

10. A quick way to check on this is to use heatmap to visualize the correlations.

Here we observe:

Another way to visualize data is using Geographical Maps. We can use the county level data to visualize concentrations of opioid mortality cases in 2018.

10. We can use a Geo-Spatial Choropleth Map plot to visualize that.