This section includes two types of tools useful for beginning work with ICD-10-CM coded data: Standardized Validation Datasets.


ICD-10-CM Standardized Validation Datasets


This toolkit includes several ICD-10-CM validation datasets:



Developing statistical programs that are 100% accurate can be challenging. Often programmers can identify errors within their programs by examining the output. However, if the output seems reasonable, errors in the program may go undetected.  A validation dataset can be used to ensure that your statistical program obtains consistent and accurate results. A validation dataset is a “fake” dataset for which answers to specific questions are either known a priori or agreed upon by a “key of three”.   This toolkit contains a “key of three” for the general injury validation dataset only. See the Glossary of Terms/Abbreviations and Concepts for further explanation of this concept.


Analysts working with ICD-10-CM coded injury and overdose data can run their statistical analysis programs on the validation datasets and compare their results to an “answer key”, to make sure their program is functioning in the expected manner. Using a validation dataset can help analysts reconcile any programming errors and can also ensure that people using different statistical packages or different approaches to programming are obtaining accurate results. The validation datasets included in this toolkit contain fictitious data; therefore, they should not be used for any purposes other than validation of statistical programming.



Jurisdictions are frequently asked to collaborate and share data with organizations such as CSTE and CDC for a wide range of projects. With the advent of ICD-10-CM, statistical programs must be updated to reflect the new coding schema with all its added complexities. Not all jurisdictions have the capacity or time to internally validate statistical programs. Currently, there are very few resources publicly available to help practitioners validate statistical programs for analyzing ICD-10-CM coded data.


The intent behind the ICD-10-CM Standardized Validation Dataset project is to provide a simple way for analysts to check the accuracy of their statistical analysis programs, providing the programmer confidence in the results. The tools give jurisdictions an opportunity to work with other jurisdictions on a common dataset and to share programming ideas and techniques. These datasets could also serve as training tools for students and new staff.


This concept was initially developed by Scott Proescholdbell (NC), Dan Dao (TX), and Thomas Largo (MI). Since its inception, the utility of this approach has been demonstrated through various projects.  The general injury validation datasets were developed by two CSTE consultants, Kristen Allen and Kristina Lai, with direction and editing from CSTE staff and members. The drug overdose validation dataset was developed by two CSTE workgroup members, Thomas Largo (MI) and Hannah Yang (MT), with direction and editing from CSTE staff and members. If you would like additional information on the how the datasets were developed, please contact Mia Israel at


Standardized General Injury Validation Dataset Materials:


Supplemental Drug Overdose Validation Dataset Materials:

Page last updated: November 19, 2019