Building a Fair Measure of English Writing Skills: An Interview With Larry Davis
Below is a conversation between ETS’ Director of Research, Larry Davis, who has played a leading role in TOEFL® research for more than a decade, and John Clark, Director of Strategic Initiatives. You can read more of Larry’s research here.
Larry, I wanted to start with a question about your academic background. Is it true that you first earned a bachelor’s degree in fisheries science?
Yes, I had a bachelor's of science in animal science with an emphasis in aquaculture and then did a master's in fisheries science.
Wild! This may be an unfair question, but are there any ties between these fields and language assessment, the career you've chosen?
They’re very different fields of study, to be sure. But there are some commonalities. And those have to do with needing to figure out how to measure things and then analyze what you measure.
In my work in fisheries, we studied salmon physiology and migration behavior. And there weren't always settled ways of measuring phenomena related to these things.
So a big part of that job is figuring out, first of all, how do you measure something that's going to tell us anything of interest? And then once you have those data, how do you evaluate it or analyze it to inform decision making?
In language testing, it's the same problem. What sort of evidence do we collect of someone's ability to communicate in English? How do we collect this data? And how do we evaluate it in a way that can be useful for informing decision-making?
So they’re very different fields, but they both face a similar type of problem.
That’s a very helpful comparison. By the way, I've been to the salmon ladders on the Willamette River in Oregon, where they've built structures that allow salmon to swim around dams to spawn upstream. That's the extent of my salmon knowledge.
I've been deep into places like that, including inside big hydroelectric dams where you probably can't even get access to anymore because of security concerns.
Ah! You seem to have selected a less dangerous field. But you’ve laid the groundwork for the topic I wanted to discuss.
One of the thorniest challenges in English assessment is figuring out how to collect meaningful indicators of a student's English writing ability. How do you think about the challenges inherent to testing writing skills on a standardized exam?
I think a fundamental challenge is that, just as you have implied, we can only feasibly collect a very brief sample of what someone can do in writing.
And then on the basis of that sample – whether it's ten minutes or an hour or even a couple of hours – that's just a small portion of all the writing that someone might do, both in terms of the number of words they write over their academic career, as well as the different kinds of writing that someone might do in their academic study.
So the game is really about prediction. We're collecting a sample of what they can do. And then on the basis of that sample, we're making some extrapolations to what we think this person is likely to be able to do in the real world. So that's the fundamental challenge.
There are different reasonable approaches to addressing that challenge. In one sense, you can take a relatively brief sample and combine it with other data to get a sense of someone's general ability. And this is the approach that's typical of language proficiency tests.
At the other end of the spectrum, you can have someone do tasks that are very specific to a given situation, and that would inform more direct inferences about what somebody can do in that situation.
This type of ‘specific purposes’ test might be something like a bar exam, which is probably a little bit closer to the writing that a lawyer would be expected to do as opposed to the kind of very general writing that we tend to assess on language proficiency tests.
Specific to TOEFL, you and our colleague, John Norris, led our efforts to research the impact of a new question type called Write for an Academic Discussion. Why did ETS see fit to revisit how we tested writing on TOEFL?
Well, there's a variety of reasons that motivated the development of that task. One is that since the time that the TOEFL IBT was originally developed, starting in the mid-1990s and through the early 2000s, the writing that happens in university environments has arguably changed.
But the test hadn’t changed. And so we felt, in this case, there was some justification to consider recently developed types of writing. And these genres tend to be shorter. They also often tend to be more conversational.
We wanted to develop a task that captured some of this. So that was one motivation. Another added benefit is it would ideally help reduce testing time. In the earlier version of the test, the writing section of TOEFL IBT basically took an hour and had two items.
From a psychometric standpoint, that's not giving you a lot of information for the amount of time that people are spending on that section of the test. So that economy in testing time was another added advantage in terms of designing the task.
Aside from making this section more time-efficient, what were other motivations behind the development of the Write for an Academic Discussion task?
Another goal was to provide additional context for writing. The task that Write for an Academic Discussion replaced was a very traditional essay task. You get an opinion question, you know – which do you prefer, dogs or cats? And that's all the input you get.
This is a very traditional and long-used type of test item. But it doesn't provide context. And it doesn't tell you who the audience is. It also doesn't tell you anything about the broader situation. This lack of context has been criticized in the writing community, but also as a practical matter, it creates issues in deciding whether or not a response is appropriate.
For example, you might have one student that writes in an academic style and another student that writes in a colloquial style. Raters will tend to want to give the student with the more academic style a higher score, but there's not really any principled reason for privileging that type of writing versus the slangy one because we didn't tell them who the audience is.
So that's another important issue as well. Clearly defining purpose and audience helps us score these responses in a more rational way.
For those who haven't taken TOEFL recently, the Write for an Academic Discussion task has a prompt from a professor, as well as two responses from students. And the test taker is expected to engage with those prompts just like they would in a modern academic forum.
Yes, that’s correct.
How do we gain confidence that a task type like this is suitable for the exam?
That's a really great question. And test validity – which is what this question gets at – is something that graduate students in language assessment spend a lot of time studying. This is an issue that the field has really given a lot of attention to over many decades now. And as a result, we have some very well-established procedures for thinking about how you justify a test task.
This usually takes the form of what's called a validity argument that should consider certain kinds of evidence. This kind of evidence might be the relationship of the task to real-world tasks. So how close is it or what does it tell us about what someone can do in the real world?
It would also include evidence about how the task is scored and whether that scoring is consistent and fair. And does scoring actually capture the important parts of what people need to do on that task?
It would also involve collecting evidence about how this measure relates to other similar measures of the same type of ability. For example, if we have a writing task, it should have some positive relationship with other assessments of writing.
Finally, there’s the question of how test relates to performance in the real world. So if people get a high score on the test, does that mean that they're going to perform well in real world situations, like in their writing coursework? And then finally, what's the washback?
And by washback, I mean, if people are going to prepare for this task, does this actually benefit their language ability? Does that preparation actually help them to improve their skills? Or are they just kind of learning to jump through hoops? And, people will prepare, if it's a high stakes test.
So there's this whole framework and chain of reasoning that goes into justifying these tasks. And this framework provides a basis for thinking about how we decide whether a test or a test task is suitable for use.
In the paper where you compared the Write for an Academic Discussion task to the independent essay, you found “similarities in the quality of text produced by test takers in terms of the syntactic complexity, grammatical accuracy, lexical variety, discourse, cohesion, and elaboration, and fluency of their writing.”
And these terms are important because they're part of how we score students' performance. But what in the world do you mean when you say “syntactic complexity”?
Syntactic complexity has to do with the grammatical structures that are used in the writing. Some listeners may have diagrammed sentences back in their school days and will know what I’m talking about here, but a more syntactically complex sentence will have a longer and more complex diagram. And it will tend to include various things such as multiple clauses.
To use a metaphor: If a simple sentence is like a stick of bamboo that just goes straight up, a complex sentence is more like a tree that has many branches that ideally all contribute to a coherent meaning.
Thanks for clarifying that term – bamboo, I understand! Tell me a little more about the study on the newly refined writing task.
The basic issue here in the study that we did was that when the Write for an Academic Discussion task was introduced, we didn't want to change the interpretation of test scores. So the idea is we're changing the task, but it still should support the same kinds of inferences about someone's ability.
And in that case, it's important to look at the kind of evidence we get from the, at that time, the existing task versus this new task. So we took data from people who had done both tasks and then we analyzed the various features of the writing.
Syntactic complexity was one point of comparison along with others that you mentioned, grammatical accuracy, use of vocabulary, cohesion, discourse markers, these kinds of things.
Can I ask about one more phrase? Lexical variety. What does that mean?
It's vocabulary. Or range of vocabulary, specifically. And the reason we look at that is that it's not just about using lots of different words or using big words. It's about precision. If you have more words in your word bag, that allows you to be more precise in communicating your meanings.
Understood! Larry, thank you for the behind-the-scenes look at how we design a portion of our test. I have a child who's learning how to write and, to me, it's a miracle that I don't understand. But Larry, you've helped demystify the process of measuring English writing. Very grateful for your time.
It’s been a real pleasure chatting, John – and always happy to talk about how the sausage gets made, as it were.
Yes, well, and speaking of dinner, let's talk salmon soon, too.
Sounds great.