- Statistical Modeling for Count Data
Statistical modeling was the second main approach used for the assessments. Statistical modeling can be used to identify the specific factors associated with the outcome of interest. Using this approach analysts can calculate the estimated rate of the outcome and better understand the factors relevant in the community. Count (or rate) outcome data were principally modeled using Poisson or Negative Binomial regression. To determine which candidate variables to include in a multivariable model, two steps were generally taken. First, each independent variable was bivariably fit, and outcome associations with p-values less than 0.1 (i.e., marginal significance) were retained for multivariable model consideration. Second, the remaining independent variables were checked for multicollinearity. Independent variable pairs that strongly correlated together were flagged (generally >0.7), and the independent variable from the pair with the stronger outcome association was retained, with the other being dropped. Thus, independent variables that had at least a marginal bivariable association with the outcome and low collinearity with other characteristics were considered for multivariable modeling. Additional characteristics that may epidemiologically confound or bias outcome associations were further included, when appropriate. Multivariable regression modeling used either forward, backward, or stepwise selection, and the final models were selected after evaluation of fit statistics, adjustment for confounding, and error residuals. Predicted outcome rates were generated at the unit of analysis level (e.g., county/ZIP code) and ranked from highest rates to lowest rates. A task flow diagram for regression modeling best practices is provided.
- Regression Model Options for Count Data
When assessing count or rate data as an outcome, Poisson or Negative Binomial are the preferred regression methods. Both Poisson and Negative Binomial regression methods appropriately weigh both the count of events and exposure, either in time or population. For example, in 2017, the crude death rates in Ada County and Clark County, Idaho were 7.1 and 8.0 per 1,000 population, respectively. Ada County had 3,229 deaths with a population of 456,849; Clark County had seven deaths with a population of 873. Poisson and Negative Binomial regression both assign a weight to Ada County hundreds of times larger than that of Clark County (proportional to the population size difference); whereas an ordinary least squares regression would inappropriately weigh the counties equally, despite the massive population difference.
In a Poisson model, the analyst assumes the outcome variable to follow a Poisson distribution, and that the natural-log of the expected value (i.e., the mean) of the outcome can be estimated by a linear combination of the independent variables. A critical assumption of the Poisson distribution is that the raw mean and variance of the outcome are approximately equal. In practice, this is rarely true, with variances either far exceeding the mean (overdispersion, most common) or being below the mean (underdispersion, less common). Use of Poisson regression in the presence of overdispersion or underdispersion can lead to biased standard errors and incorrect conclusions regarding statistical significance.
Negative Binomial regression is a generalization of Poisson regression that uses a mixture of Poisson and Gamma distributions to model the outcome, loosening the assumption of an equal raw mean and variance. In practice, Negative Binomial regression is often appropriate over Poisson regression, but the analyst should not simply default to Negative Binomial regression for all analysis cases. When the outcome data truly follow a Poisson distribution, Poisson regression will provide the least biased standard errors and p-values, compared to Negative Binomial regression. A chi-square goodness of fit (GOF) test for overdispersion exists and should be considered when equality of mean and variance is in question. Both this question and others regarding statistical modeling are addressed in the FAQ document. For SAS  users, code and a discussion on GOF testing and regression modeling are shown in the previous webinar presentation.
The outcome variable for count-based regression models should always be a raw count, with the natural-log of the population included as an offset characteristic. It is not advisable to transform the count outcome or to include the population as a numeric independent variable. Likewise, as seen in the Idaho example, using rates for the outcome gives improper weight. In count-based regression, offset variables determine exposure, and are modeled with the outcome variable, not the independent variables.
- Independent Variables
When determining independent variables for an analysis, jurisdictions should consider a variety of data sources a priori. See the Data Collection and Exploration sections for common sources. Common independent variables used in the assessments were: drug overdose deaths, prescription opioid sales, mental health services, insurance coverage, urgent care access, vehicle access, access to interstate, buprenorphine prescribing potential, education, income, poverty, race/ethnicity, unemployment, population density, urban/rural status, and drug-related arrest data. Additional characteristics germane to specific states and/or geographic regions should be further considered and included.
For count-based regression analysis, it is not recommended to transform the outcome data. The continuous independent variable data, however, can be transformed to meet model assumptions and assist with interpretability. Log10 and natural-log transformation were most prominently used in the assessments. Log10 transformations are useful if an independent variable is on a scale where order of magnitude differences are important, for example, per capita income. Natural-log transformation are useful for data that are right-skewed, for example, miles to nearest healthcare facility. Other common transformations include: arcsine (for values between 0 and 1), square root (for count data), and inverse hyperbolic sine (for right skewed data that include 0 and/or negative values).
- Variable Selection
Given a moderate to large set of potential independent variables, an analyst must choose a limited number for regression modeling, contingent on sample size and multicollinearity (i.e., high correlation among two or more independent variables). The process most frequently used for assessments involved three steps: 1) bivariable regression modeling, 2) correlation of independent variables, and 3) selection procedures for the final multivariable model. In Step 1, each independent variable is separately associated with the count outcome using either Poisson or Negative Binomial regression, as appropriate. Variables that are statistically significant at p<0.1 and/or have epidemiological significance are retained. Remaining variables are then correlated with each other using Pearson or Spearman correlations (Step 2). Independent variable pairs that correlate >0.7 are flagged for multicollinearity, and the variable from the pair with stronger outcome association is maintained. Finally, in Step 3, the remaining independent variables are considered for multivariable inclusion and modeled using either stepwise, forward, or backward selection. For advanced models (e.g., those modeling longitudinal data), stepwise procedures are often not available, and the analyst must choose between either forward or backward selection. As previously inferred, the analysis should not be guided by statistical significance alone, but also informed by industry standards and subject matter expertise, while building the final multivariable models. For analysts modeling correlated (or longitudinal) data, issues with model convergence sometimes arise. In these circumstances, please see the Convergence Questions resource that provides tips for such situations.
In addition to correlating variables for checks on multicollinearity, a more sophisticated approach employs methods from principal components and factor analysis. Principal components are a type of dimension (data) reduction that clusters correlated variables into groups (i.e., components). The first step is generating a covariance matrix between variables, then calculating the components and eigenvalues (entities that represent the variance explained by each component). Sums of these eigenvalues are visualized using a Scree plot. The user is tasked with picking the number of components that explain the most variance, without overfitting, a technique called the “elbow method”. In Figure 5, a 24-item (variable) psychological measure has most of the variance explained by the first three or four components (at the “elbow” of the curve). As such, a researcher would consider three or four principal components as candidates to include in a follow-up factor analysis.
A factor analysis rotates principal components in geometric space to create orthogonal (i.e., uncorrelated) constructs on which variables load. The advantage of factor analysis is that it separates variables into independent groupings, with higher absolute loadings representing greater associations with the factor constructs. Consider the results below from a state involved in the assessment (Table 3). Based on this table, one of YPLL (Years of Potential Life Lost), MME90 (90 Milligrams Morphine Equivalent per day), or opioid prescribing could be selected to represent Factor 1 (top three absolute value (+/-) loadings); one of rural, poor health, or HIV could be selected to represent Factor 2; and one of percent without high school diploma, teen birth rate, or percent vacant housing could be selected to represent Factor 3. A user may choose more than one variable to represent each Factor (e.g., both YPPL and MME90 for Factor 1), but would need to again check for correlation between these characteristics.
Factor 1 Factor 2 Factor 3 Gini Index 12 53 8 HIV Rate 2 57 14 MME90 79 9 0 Opioid Prescribing 82 -23 4 Percent Unemployed 4 53 56 Percent Vacant Housing 40 -4 57 Percent Without High School Diploma 11 38 60 Poor Health 48 66 5 Rural 25 -67 6 Syphilis Rate -26 4 48 Teen Birth Rate 2 1 57 YPLL 85 35 2
Once candidate variables have been identified, appropriate statistical software (e.g., SAS, R, STATA ) can be used to create the final multivariable model using previously described procedures. Coefficients for Poisson and Negative Binomial regression are measured on a logarithmic odds scale, and as such, are interpreted as the difference in logs of expected counts per unit change in the variable, given the other independent variables in the model are held constant.
Table 4 is an example of how to present results from a regression analysis. In addition to descriptive statistics (mean, median, interquartile range), it has the risk ratio and p-values for the bivariable and multivariable models, as well as the regression coefficients and standard errors from the multivariable model.
Table 4. Measures of central tendency, regression coefficients, standard error, and significance level for variables used in vulnerability assessment.
Bivariable Multivariable Variable Mean Median First Quartile Third Quartile Risk
SE Risk Ratio P-Value Drug overdoses for all drugs,
2017 to 2018, per 100,000 population
414.61 414.13 359.10 468.93 1.09 0.004 Log of MME for all prescribed drugs 17.66 17.81 10.24 24.39 1.70 <0.001 0.49 0.04 1.63 <0.001 Percent of population that is non-Hispanic white 65.40 70.19 54.29 93.72 1.49 0.001 0.39 0.15 1.48 0.014 Rate of sexually transmitted infections,
per 100,000 population
717.23 555.07 430.29 803.30 1.12 <0.001 0.09 0.03 1.09 0.005
For ease of interpretation, and in addition to the multivariable regression coefficients, many states chose to report ranked vulnerability scores. Vulnerability scores are simply the predicted rates for study outcomes, calculated from multivariable Poisson/Negative Binomial regression models. To get these rates, a user calculates predicted values directly from their multivariable regression equation. These raw, predicted values are in the form of log-counts per person. These values can be exponentiated to get predicted counts per person (i.e., person-rates). This rate is not population-specific, and can be comparable across other sampling units (e.g., counties). Many times, it was reasonable to multiply this person-rate by a constant to get an epidemiologic rate (e.g., rate per 100,000). These epidemiologic rates were used as the final vulnerability scores and ranked for reporting.