Is a lighthouse just a lighthouse?

The contribution of background or content knowledge to reading comprehension is becoming better understood (see for example, here, here, here, and here) and there are long-standing concerns about persistent differences in literacy levels among rural and metro students, and Indigenous and non-Indigenous students.

Put these things together and you have the basis for an important experimental question: Would students in a rural area with a relatively high Indigenous population perform better on reading comprehension assessments if they were likely to be familiar with the content?

Late last year, the results of such a study by researchers from the University of NSW Business School, under the auspices of the Economics of Education Knowledge Hub, were reported in an article by Jordan Baker in The Sydney Morning Herald. The results reported were dramatic: changing the content of the reading assessment to be more culturally relevant was estimated to close the rural-urban reading gap by 33% and the Indigenous-non-Indigenous reading gap by 50%.

The working report is now available (hereafter cited as Dobrescu et al., 2021). A lead researcher and corresponding author on the study, Professor Richard Holden, kindly provided me with the paper in December and talked with me about it. The following is my analysis of the paper, outlining some concerns that I have — not about the quality of the research or its intent, or about the way in which it has been reported, but about how the findings might be misinterpreted.

The aim of the study was to “provide causal evidence of the extent of cultural bias in standardized tests” (Dobrescu et al., 2021, p. 4). To do this, two versions of NAPLAN-style reading and numeracy assessments were given to groups of Year 6 and Year 8 students in Dubbo — a large regional town in the mid-West of NSW — and surrounding areas. Around half of the students in the study completed the standard assessments, and the other half completed a modified version of the assessment that contained items that were ‘contextualised’ specifically for the study.

According to what is described as a full psychometric report for the reading assessment, the only substantial difference between the two tests was one item — about the Cape Lighthouse in the standard version and the Parkes ‘Dish’ in the contextualised version. A table in the psychometric report says all other items are “identical”, however another page in the psychometric report describes a number of ways in which the two tests differed. One other difference was that an item about looking after a guinea pig (standard version) was changed to looking after a dog (contextualised version).

The researchers reported the following findings in terms of average treatment effects (ATE):

  • The ATE of the contextualised text for all students was 0.27 standard deviations.
  • The ATE of the contextualised text for Indigenous students was 0.3 standard deviations.
  • The ATE of the contextualised text for non-Indigenous students was 0.24 standard deviations.
  • Interestingly, there were no differences in the numeracy results, even though the test items are word-based and require good reading comprehension skills.

Effect sizes of 0.24 to 0.30 standard deviations are what can be considered ‘large’ for an education randomised control trial, according to the benchmarks proposed by Kraft (2020). (Note: These differ from the usual conventions for Cohen’s d for reasons set out in the article by Kraft.)

However, to the extent that there are measurable differences, they are place-based rather than culturally-based. According to the report, there was no significant difference between performance on the contextualised test between Indigenous and non-Indigenous students: “The ATE [average treatment effect] is statistically equivalent across groups” (Dobrescu et al., 2021, p. 16).

I must state here that in our conversation Professor Holden strongly argued that the Indigenous/non-Indigenous difference is educationally meaningful, even if there is insufficient power to provide the conventional level of statistical significance. The conversion of standard deviations to NAPLAN standard scores does suggest a meaningful difference, even if the statistical difference is not large.

The reading results are described as showing that all students in the Dubbo study, but particularly indigenous students, performed significantly better on the NAPLAN reading assessments when the content was tailored to their context. It seems to provide a neat confirmation of the hypothesis that reading comprehension assessments have a an ‘item bias’ which leads to underperformance of certain ‘cultural groups’. The standard deviations are extrapolated to estimate the effect that a culturally contextualised assessment would have on NAPLAN performance of students in this area.

I don’t think it does confirm this hypothesis. I think it provides some evidence that students perform better on reading comprehension questions when they are based on content that they are more likely to have background knowledge about. The ‘cultural’ element is an unfortunate red herring in this particular study. There is nothing particularly ‘cultural’ or Indigenous about the Parkes Dish. Familiarity with the Dish would be more accurately classified as local knowledge rather than cultural knowledge. This is borne out in the results.

There was also an important methodological issue that places constraints on even this more cautious conclusion. While the researchers say the tests are of “equal difficulty”, this was not tested or proven properly. The full psychometric report was just an item analysis and comparison. In order to be sure that the tests were of equal difficulty, both tests should have been given to a group of students outside of the Dubbo region to see whether they did equally well on the contextualised test and the standard one. Similarly, it can be reasonably speculated that the item that changed ‘guinea pig’ to ‘dog’ might have been easier for all students. The word ‘dog’ is a much more straightforward decoding task than ‘guinea pig’.

Furthermore, the results are reported only in terms of the average score for each version of the test. It would be useful to know if there were group differences in performance on the specific test items that were contextualised, especially the lighthouse/Dish item and the guinea pig/dog items. How much did these particular items contribute to the averages?

Accepting that the overall performance differences between the standard and contextualised assessments are meaningful and demonstrate a knowledge-based bias in the tests, there are a number of ways that the results might be interpreted:

  1. Students in Dubbo know different things and standardised reading comprehension tests are not tapping into this knowledge, which puts them at a disadvantage.
  2. All knowledge is of equal value so it doesn’t matter if students from Dubbo don’t have as much background and general knowledge beyond the boundaries of their local knowledge.
  3. If reading comprehension tests cannot be purified from apparent knowledge-based biases then we have to construct tests that equally favour the knowledge of different groups.

Or, there is an alternative interpretation:

  1. High levels of background knowledge can mask low levels of reading skill, therefore better performance on assessments with familiar content is not a truer measure of general reading ability. There is research evidence that this is often the case — high background knowledge compensates for low reading ability in performance on comprehension tasks (see here and here).

These various interpretations raise the following critical and complex questions:

  1. Should reading comprehension assessments justifiably assume some knowledge and vocabulary?

For example, is it culturally/contextually irrelevant for students who don’t live near the coast to know what a lighthouse is? Furthermore, is a lighthouse just a lighthouse, or does it represent a broader knowledge of the world that will facilitate reading comprehension of a wide range of texts, not just texts that are locally contextualised or culturally relevant.  Professor Holden agreed this is a key question, and added that it could even represent other omitted variables like parent education.

  1. Is it actually realistic or possible to create contextually relevant assessments that level the background knowledge playing field for all students, even if it was justifiable?

For example, different tests for students from all immigrant backgrounds, for students from different socioeconomic groups, and for students from different parts of the country? What happens when these categories intersect? And what sort of dubious assumptions would have to be made about what they might know and not know about?

Fortunately, the idea of customised assessments was not put forward by the researchers as the solution to item bias in reading assessments. Professor Holden agreed that different tests for different groups of students would be a bad interpretation of the research. Instead, the researchers seem to lean toward changes such as “adapting educational materials such as as textbooks, slides, and multi-media content, to students’ cultural context” (Dobrescu et al., 2021, p. 21). They write: “In New South Wales, the Department of Education, in cooperation with the Aboriginal Education Consultative Group (AECG), has focused specifically on improving the cultural relevance of curriculum for students of different backgrounds.” (Dobrescu et al., 2021, p. 8)

Even so, it is not clear what this means. Does it mean a different curriculum for Indigenous students and/or country students, or that all students will learn a curriculum that has more elements drawn from Indigenous and country contexts? Different curricula for different groups of students can perpetuate and exacerbate disparities in post-school outcomes. Will a Year 12 student from Dubbo be able to succeed at Sydney University or be able to get employment in Adelaide if they have studied the country curriculum and not the city curriculum? Will a student from Sydney make a successful transition to living and working in a rural community if they have only had the city curriculum? And once the curriculum has changed, would the assessments have change to reflect it?

These serious concerns about the potential for the findings to be overstated, and to influence policy in misguided directions are important to address. Given the methodological limitations, the results are not completely conclusive. It is important to acknowledge that the strongest findings were related to geographic location and local knowledge, rather than pertaining to Indigenous students or cultural knowledge. Even so, it is far from clear how they ought to be interpreted.

Having said all this, the ‘Dubbo study’ is a significant piece of research and has generated some vigorous and intelligent discussion. The study design is clever (even if not perfect, for the reasons described above). Few experimental studies are conducted with students in public schools, and one of the secondary objectives of this project was to try to encourage more research of this type. The outcomes of research are more than just the data; there are lessons about what to do to make the experiment better next time. Professor Holden said the research team went to great lengths to make sure that the experience was as enjoyable and as valuable for the participating schools as possible, so that the Department of Education and public schools will be more likely to participate in research in future. This is highly commendable, and those of us who conduct experimental research will be grateful if they have succeeded.

Thanks to Professor Richard Holden for his willingness to share and discuss the research with me, and for providing feedback on a draft version of this post. Thanks also to Professor Kevin Wheldall for his expertise and significant input. The views expressed and any remaining errors are my responsibility.