Setting the bar high: PoP’s proficiency benchmarking project

Emil Hafeez
Data Analyst
May 9, 2019

As a literacy organization working hard to support quality education for students in Ghana, Guatemala and Laos, it’s important that we carefully define and evaluate our efforts by a rigorous standard.

Most often in the literacy space, reading proficiency is defined by 1) reading fluently and 2) with comprehension1; therefore, oral reading fluency (ORF) and reading comprehension are both used as key performance indicators (KPIs) at Pencils of Promise (PoP).

Photo credit: Timmy Shivers

Recently, we’ve been thinking about how to tune our standards to the dynamic and complex contexts in which we work. For example, we just thoroughly evaluated and modified our student literacy assessment in Guatemala. How well do our other existing tests and standards apply? Do test scores tell us what we need to know about the student learning process and its outputs? How do we quantify student literacy goals by grade and/or by country? What does it look like to uphold high standards, and yet also make standards with an understanding of nuance and students’ life experience? PoP’s literacy and teacher support expertise, program evaluation efforts and scientific inference came together to answer these questions.

Pre-existing standards

We started with two existing benchmarks, one for ORF and one for reading comprehension. ORF is simply defined as how fast someone can read words aloud. During testing conducted by the Learning & Evaluation (L&E) team, we ask students to read an age-appropriate and culturally-relevant story, and learn how many words a student is able to correctly read aloud per minute. Then, we ask students to answer 10 questions about the passage, which yields a score for reading comprehension. This procedure has already been part of our study design using EGRA3-based tools and fits right in with our Learning and Evaluation efforts like Baseline and Endline assessments.

The following images are an example of what this looks like for students in Ghana (we use Ghana as an example here because the Guatemalan passage is in Spanish). On the left is the passage, and on the right are the comprehension questions about it.

Previously, our proficiency standards for the reading task on the left was 45 correct words per minute (CWPM) in Ghana and 60 CWPM in Guatemala. We also separately used 80% comprehension as proficiency for the task on the right. Students achieving these goals are considered achieving at a proficient level.

These proficiency benchmarks are sourced from globally recognized standards and research focusing on literacy in high-income countries. We maintain these have been effective standards and they do align well with academic research on the topic (e.g., Spanish is what’s called a more transparent language than English and, therefore, it should have a higher CWPM standard); however, we decided to revisit this topic and challenge ourselves to innovate.

Photo credit: Nick Onken

New benchmark, who this?

While our Impact Manager of Research and Development in the PoP NY office, Julia Carvalho, was conducting research as part of one our internal evidence-based decision making processes, she discovered resources from Room To Read (RTR) that provided guidelines for calculating and establishing proficiency benchmarks. We then used this guide to produce new benchmarks that are tailored to the student population included in PoP schools, which enables us to have a better understanding of PoP Teacher Support programming on student literacy outcomes.

The results of the analysis found that the benchmark for students in Ghana is 60 CWPM. In PoP Guatemala, the benchmark is 63 CWPM. This means that in Ghana, students reading at 60 CWPM or above are reading at a proficient level; in Guatemala, students reading at 63 CWPM or above are reading at a proficient level. In both circumstances, this reading rate predicts substantial comprehension of the passage.

The benchmarking process offers a grassroots standard by creating our benchmark with data from students in PoP-built schools, in the specific communities where we work. We grow our understanding of literacy benchmarks from the children we support. In this case, we can use our student test data to quantify the relationship between reading and comprehension: we use the speed at which a student reads a passage (i.e., ORF) to predict how much they comprehend based on this speed. Here’s our PoP Ghana example based on the RTR guidelines:

This graph shows ORF in CWPM on the x-axis (horizontal), and the associated probability of a student scoring at 80%+ comprehension on the y-axis (vertical). The red lines are the confidence intervals for this predicted probability. As you can see, a higher CWPM predicts more comprehension. For example, a student reading at 100 CWPM is about 65% likely to achieve 80% comprehension. Also, a student reading at 150 CWPM is about 90% likely to achieve 80% comprehension. However, to expect such a high CWPM from first- to sixth-grade students in PoP schools is unfair, and it wouldn’t tell us much about the majority of students who aren’t scoring so highly.

So how do we decide on the benchmark from this? The process that RTR uses sets the standard using a 50% likelihood cutoff in the logistic regression, meaning that it checks the CWPM value on the horizontal axis that corresponds with the 50% likelihood from the vertical axis (halfway up). PoP, instead, made a modification to this process. To solve for the best CWPM benchmark in this graphed curve, we analyze all possible CWPM values for their ability to correctly classify students’ comprehension in our sample data. We select the benchmark with the optimal ability, and therefore improve the fit of the benchmark with our partner communities’ students.

These are, by themselves, effective and reasonable standards for how fast a student can read. They can also predict student comprehension very well. The chart below visualizes how we quantify our ability to correctly identify students’ comprehension based on their reading speed, again using students in Ghana as an example. Here’s a straightforward way to think about sensitivity and specificity if you’re curious.

Students on the right half of the chart are scoring at or above 80% comprehension, and the proportion on the right above the red line are correctly classified as such. The bottom left (left chart, under the red line) shows students correctly classified as scoring less than 80% comprehension. This gives us an idea how strong our classification is and it turns out that we have something that works really well.

Jump to the conclusion

By using this repeatable and rigorous process, we can define literacy proficiency based on existing relationships in students’ abilities (60 CWPM in Ghana, 63 CWPM in Guatemala). With these benchmarks, we can hold ourselves to a high standard that is informed by the performance of students in PoP-supported schools.

By setting benchmarks, we can quantify our fidelity to our goals. For example, we can (and will) now set targets (i.e., goals) for student proficiency. Do we expect 10% of third graders and 80% of sixth graders to meet this benchmark? Are we meeting that goal, and why or why not? Currently, PoP’s Impact team is establishing what percentages of each grade we expect to reach this benchmark in our target setting work. By using our existing data and this benchmark borne from our students’ performance, we can create realistic but high standards for student achievement.

Photo credit: Nick Onken

A useful feature of this is that it allows our L&E team to disaggregate student achievement by different demographic and structural factors; then, we can examine which of the characteristics we measure are relevant to achievement we see. For example, are equal proportions of boys and girls meeting this benchmark? How about students who have Ewe or Kotokoli as a mother tongue? Are equal proportions of students who report reading often with others outside of school reaching this benchmark, and what about students who report never reading outside of school?

We can take this information and interrogate our hypotheses and efforts, and understand how to better adapt them more effective and equitable. For example, maybe reinforcing a teachers’ propensity to promote reading outside of school would be helpful depending on the results of the last breakdown. Findings are presented to our Programs teams and we work together to understand if what we’re seeing in the data makes sense, how it can be useful, and what it can apply to at a granular level.

Given that we now collect student test scores with the same students over several years, we can track our progress towards this predefined standard within specific cohorts, and examine how the fidelity to our goals (across all and within certain groups of students) changes over time.


  • Room to Read. (2018). Guidance Note: Setting Data-Driven Oral Reading Fluency Benchmarks. Retrieved December 14th, 2018, from
  • Room to Read. (2018). Data-driven Methods for Setting Reading Proficiency Benchmarks. Retrieved December 14th, 2018, from
  • RTI International. (2017). Early Grade Reading Assessment (EGRA) Toolkit: Second Edition. Global Reading Network. Retrieved December 14th, 2018, from