Wednesday, January 31, 2024

This blog post is part of our Research Article of the Month series. For this month, we highlight “Evaluating the Criterion Validity and Classification Accuracy of Universal Screening Measures in Reading,” an article published in the journal Assessment for Effective Intervention in 2021. Important words related to research are bolded, and definitions of these terms are included at the end of the article in the “Terms to Know” section.

Why Did We Pick This Paper?

In an educational context, a universal screener is a tool educators use to assess student skills and abilities in specific areas, such as reading or math, to identify which students are below, at, or above benchmark—an expected or target score for a particular skill based on a student's grade level and the time of year. Screening helps teachers identify students who could benefit from additional support or accelerated instruction in reading. 

Measures of Academic Progress (MAP) and Strategic Teaching and Evaluation of Progress (STEP) are two reading assessments that have been used as universal screeners. The Iowa Department of Education has approved MAP for universal screening, making this assessment of particular interest to Iowa educators. 

Educators may administer STEP in conjunction with MAP in what is known as a “gated approach” to screening. With this approach, students who score below the grade-level cut score on the first screener are assessed again with a different screener. The second screener essentially serves as a “second opinion” as to whether the student could benefit from additional support. For example, with a gated approach, educators might choose to administer MAP first and then administer STEP only to those students who scored below the MAP grade-level cut score. Then, only students who scored below the grade-level cut score on both screeners would be identified as at risk of not meeting grade-level expectations in reading. Thus, the goal of the gated screening approach is to minimize the number of students who are inaccurately identified as at risk, potentially fine-tuning educators’ ability to identify struggling students.

This article examines the predictive validity, incremental validity, and classification accuracy of MAP and STEP, providing useful information for the individuals responsible for selecting the screeners that will be used in their district. Additionally, this article sheds light on the effectiveness of the gated approach to screening. Identifying valid and reliable screeners and screening approaches are important steps in ensuring students receive appropriate instruction that allows them to become proficient readers. 

What Are the Research Questions or Purpose?

The researchers investigated several screening approaches with two universal screeners, MAP and STEP, by addressing the following questions:

  1. What is the predictive validity of MAP and STEP with the end-of-the-year state assessment?
  2. What is the incremental validity, if any, of administering STEP with MAP to predict students’ performance on the end-of-the-year state assessment?
  3. What is the classification accuracy of MAP and/or STEP scores in relation to students’ performance on the end-of-the-year state assessment?

What Methodology Do the Authors Employ?

The researchers obtained the assessment data (MAP, STEP, and state assessment) from two cohorts of second-grade students (225 in cohort 1, 122 in cohort 2) in an urban district in the southeastern United States. MAP and STEP test scores were collected in the spring and then again in the fall when students were in the third grade. After that, state assessment scores were obtained in the spring of the third grade. 

Hierarchical linear regression was used to assess the predictive validity of MAP and STEP—the ability of these screeners to predict student performance on the state assessment. 

Next, the researchers assessed the classification accuracy of MAP in relation to students’ performance on the state assessment—in other words, whether MAP correctly identified students who were at risk or not at risk of failing the state assessment. The study also examined the classification accuracy of STEP when combined with MAP by simulating a gated approach to screening. To simulate the gated approach, the researchers identified students as at risk only if they scored in the at-risk range on both screening measures. Students with one at-risk score were not considered at risk.

Based on screening scores obtained from MAP or STEP, students may be accurately identified as at risk (true positives) or not at risk (true negatives) of failing the state assessment. In contrast, students may be inaccurately identified as at risk (false positives) or not at risk (false negatives). Accurate screening is a necessary first step in supporting at-risk students; without it, these students may not receive the support that they need (Klingbeil et al., 2015). 

The researchers calculated statistically optimized cut-scores for MAP and STEP by statistically analyzing their sensitivity (the ability to accurately identify at-risk students) and specificity (the ability to accurately identify students who are not at risk). These optimized cut scores were then compared with the benchmarks recommended by the publishers of these tests. This comparison aimed to assess how well the statistically optimized cut scores and benchmarks of both tests identify at-risk students. 

What Are the Key Findings?

Research Question 1: What is the predictive validity of MAP and STEP with the state assessment?

  • Scores from both screening assessments had statistically significant associations with scores on the state assessment. This indicates that students’ performance on MAP and STEP is linked to how they might perform on the later state assessment. 
  • Overall, MAP scores had a slightly larger association with state assessment scores compared to STEP scores across time. This implies the relationship between MAP scores and state assessment scores was marginally stronger than the relationship between STEP scores and the state assessment scores, indicating that MAP scores may be a better predictor of student performance on the state assessments than STEP scores. 

Research Question 2: What is the incremental validity, if any, of administering STEP with MAP to predict performance on the state assessment?

STEP scores added statistically significant predictive ability to MAP when predicting students’ performance on the state assessment. This suggests that STEP scores, when used in conjunction with MAP scores, provided additional predictive power for forecasting students’ performance on the state assessment.

Research Question 3: What is the classification accuracy of MAP and/or STEP in relation to students’ performance on the state assessment?

  • Classification accuracy fell in the good to acceptable range when using statistically optimized cut scores for MAP, indicating that the accuracy of identifying at-risk students based on their MAP scores is adequate when using the cut scores calculated by the researchers. Nevertheless, publisher-recommended cut scores demonstrated greater accuracy.
  • However, the classification accuracy for STEP and the gated screening approach was found to be below adequate—these approaches had specificity (they correctly identified students who were not at risk), but sensitivity was low (they failed to identify students at risk). This finding may suggest that, although including STEP scores with MAP scores can help predict a student’s performance on the state assessment, it does not significantly enhance the ability to detect those students who are at risk and who might benefit from additional support. 

What Are the Limitations of This Paper?

The study aimed to evaluate the predictive validity and classification accuracy of two screening measures, MAP and STEP, for predicting students’ performance on an end-of-year state reading assessment. Although the research provides insight into the predictive validity of MAP and STEP assessments, it may have limitations in examining their predictive ability over time. The study’s design, focusing on specific grade levels and assessment periods, did not track students’ long-term performance on these assessments; thus, it may not fully capture how student performance on early assessments changes, predicts, or correlates with future academic achievement and growth in the following years.

Another limitation is the study did not take into account the potential impact of reading interventions that may have been implemented between the fall and the spring screening period. These interventions are usually assigned based on initial screening results, and they could impact the predictive validity or classification accuracy of later screening assessments. Including information about which students received such interventions, and possibly the type and dosage of these interventions, would allow for a more comprehensive evaluation of how accurately the screenings can predict students’ future performance on a later state assessment.

In addition, the participants of this study shared a similar background in terms of race, ethnicity, and socioeconomic status, which limits the generalizability of this study’s findings. This lack of diversity in the sample means the findings of this study may not be applicable to more diverse student populations in different regions or schools with varying demographic characteristics. For future research, incorporating a more diverse sample would allow for a more comprehensive understanding of a wider range of educational settings. 

Terms to Know

  • Cut score: Cut scores are selected points on a score scale for a test that categorize test takers into groups like “proficient” and “not proficient.” A cut score is used to determine which students may need additional support to reach the end-of-year benchmark. 

  • Predictive validity: Predictive validity refers to the extent to which a test taker’s score on one assessment predicts another outcome, such as their score on a different assessment (e.g.,  whether their score on a screener predicts their score on a standardized state assessment like the Iowa Test of Basic Skills). Thus, screeners with high predictive validity are important for identifying students who are at risk for not meeting proficiency on state assessments. 

  • Incremental validity: Incremental validity is the degree to which an additional assessment can improve measurement accuracy beyond the current assessment methods. Understanding incremental validity can help educators determine whether administering two screening assessments would help them identify at-risk students more accurately than administering a single screener. 

  • Classification accuracy: Classification accuracy refers to the extent to which one measure (e.g., a universal screener) accurately identifies students as “at risk” or “not at risk” based on their performance on another measure (e.g., a standardized state assessment). An assessment with high classification accuracy minimizes false positives (i.e., proficient readers who are incorrectly identified as at risk) and false negatives (at-risk students who are incorrectly identified as proficient readers). Using screeners with high classification accuracy is important to ensure that time and resources are allocated efficiently and that students receive the appropriate level of support in reading. 

  • Cohort: In research, a cohort is a group of research subjects who share some characteristic, such as grade level or language background. A cohort study follows a group of people over time, tracking how certain factors affect them.

  • Hierarchical linear regression: Hierarchical linear regression is a way to show if variables of interest (e.g., screening scores) explain a statistically significant amount of variance in a dependent variable (e.g., student performance on a standardized test). For example, this model could be used to assess the ability of a student’s score on a fluency assessment to predict their performance on an assessment of overall reading comprehension.

    • Dependent variable: A dependent variable is a factor that may change in response to another variable. For example, a student’s composite reading score (dependent variable) may change in response to the length of reading intervention they receive in total minutes.

    • Statistically significant: If a study’s findings are statistically significant, it means they are unlikely to be explained by chance alone. 

  • Sensitivity: Sensitivity is the ability of an assessment tool to correctly identify students at risk in reading proficiently.

  • Specificity: Specificity is the ability of an assessment tool to identify students with proficient reading skills.

  • Generalizability: Generalizability refers to the extent to which the findings of one study can be extended to other people, settings, or past/future situations. 


Thomas, A. S., & January, S.-A. A. (2021). Evaluating the criterion validity and classification accuracy of universal screening measures in reading. Assessment for Effective Intervention, 46(2), 110-120. 

Klingbeil, D. A., McComas, J. J., Burns, M. K., & Helman, L. (2015). Comparison of predictive validity and diagnostic accuracy of screening measures of reading skills. Psychology in the Schools52(5), 500-514.