The Undoing Project: A Friendship that Changed the World(49)



He explained that the Oregon Research Institute had completed a study of doctors. They had found a gaggle of radiologists at the University of Oregon and asked them: How do you decide from a stomach X-ray if a person has cancer? The doctors said that there were seven major signs that they looked for: the size of the ulcer, the shape of its borders, the width of the crater it made, and so on. The “cues,” Goldberg called them, as Hoffman had before him. There were obviously many different plausible combinations of these seven cues, and the doctors had to grapple with how to make sense of them in each of their many combinations. The size of an ulcer might mean one thing if its contours were smooth, for instance, and another if its contours were rough. Goldberg pointed out that, indeed, experts tended to describe their thought processes as subtle and complicated and difficult to model.

The Oregon researchers began by creating, as a starting point, a very simple algorithm, in which the likelihood that an ulcer was malignant depended on the seven factors the doctors had mentioned, equally weighted. The researchers then asked the doctors to judge the probability of cancer in ninety-six different individual stomach ulcers, on a seven-point scale from “definitely malignant” to “definitely benign.” Without telling the doctors what they were up to, they showed them each ulcer twice, mixing up the duplicates randomly in the pile so the doctors wouldn’t notice they were being asked to diagnose the exact same ulcer they had already diagnosed. The researchers didn’t have a computer. They transferred all of their data onto punch cards, which they mailed to UCLA, where the data was analyzed by the university’s big computer. The researchers’ goal was to see if they could create an algorithm that would mimic the decision making of doctors.

This simple first attempt, Goldberg assumed, was just a starting point. The algorithm would need to become more complex; it would require more advanced mathematics. It would need to account for the subtleties of the doctors’ thinking about the cues. For instance, if an ulcer was particularly big, it might lead them to reconsider the meaning of the other six cues.

But then UCLA sent back the analyzed data, and the story became unsettling. (Goldberg described the results as “generally terrifying.”) In the first place, the simple model that the researchers had created as their starting point for understanding how doctors rendered their diagnoses proved to be extremely good at predicting the doctors’ diagnoses. The doctors might want to believe that their thought processes were subtle and complicated, but a simple model captured these perfectly well. That did not mean that their thinking was necessarily simple, only that it could be captured by a simple model. More surprisingly, the doctors’ diagnoses were all over the map: The experts didn’t agree with each other. Even more surprisingly, when presented with duplicates of the same ulcer, every doctor had contradicted himself and rendered more than one diagnosis: These doctors apparently could not even agree with themselves. “These findings suggest that diagnostic agreement in clinical medicine may not be much greater than that found in clinical psychology—some food for thought during your next visit to the family doctor,” wrote Goldberg. If the doctors disagreed among themselves, they of course couldn’t all be right—and they weren’t.

The researchers then repeated the experiment with clinical psychologists and psychiatrists, who gave them the list of factors they considered when deciding whether it was safe to release a patient from a psychiatric hospital. Once again, the experts were all over the map. Even more bizarrely, those with the least training (graduate students) were just as accurate as the fully trained ones (paid pros) in their predictions about what any given psychiatric patient would get up to if you let him out the door. Experience appeared to be of little value in judging, say, whether a person was at risk of committing suicide. Or, as Goldberg put it, “Accuracy on this task was not associated with the amount of professional experience of the judge.”

Still, Goldberg was slow to blame the doctors. Toward the end of his paper, he suggested that the problem might be that doctors and psychiatrists seldom had a fair chance to judge the accuracy of their thinking and, if necessary, change it. What was lacking was “immediate feedback.” And so, with an Oregon Research Institute colleague named Leonard Rorer, he tried to provide it. Goldberg and Rorer gave two groups of psychologists thousands of hypothetical cases to diagnose. One group received immediate feedback on its diagnoses; the other did not—the purpose was to see if the ones who got feedback improved.

The results were not encouraging. “It now appears that our initial formulation of the problem of learning clinical inference was far too simple—that a good deal more than outcome feedback is necessary for judges to learn a task as difficult as this one,” wrote Goldberg. At which point one of Goldberg’s fellow Oregon researchers—Goldberg doesn’t recall which one—made a radical suggestion. “Someone said, ‘One of these models you built [to predict what the doctors were doing] might actually be better than the doctor,’” recalled Goldberg. “I thought, Oh, Christ, you idiot, how could that possibly be true?” How could their simple model be better at, say, diagnosing cancer than a doctor? The model had been created, in effect, by the doctors. The doctors had given the researchers all the information in it.

The Oregon researchers went and tested the hypothesis anyway. It turned out to be true. If you wanted to know whether you had cancer or not, you were better off using the algorithm that the researchers had created than you were asking the radiologist to study the X-ray. The simple algorithm had outperformed not merely the group of doctors; it had outperformed even the single best doctor. You could beat the doctor by replacing him with an equation created by people who knew nothing about medicine and had simply asked a few questions of doctors.

Michael Lewis's Books