In part 1 and part 2 of this series, I examined the story of the Computerized Adaptive Test - Depression Inventory (CAT-DI).
Touted as a revolutionary new way of measuring depression, the CAT-DI is a kind of computerized questionnaire, that assesses depressive symptoms by asking a series of questions about how the user is feeling. Unlike a standard questionnaire, however, the CAT-DI is adaptive because it picks which question to ask next based on previous responses. The CAT-DI's creators have said that the commercial release of the product (and related CATs) is under consideration. They've formed a company, Adaptive Testing Technologies (ATT). This commercial aspect has led to fierce controversy over the past few weeks, with accusations of conflicts of interest against some very senior figures in American psychiatry. It was this aspect of the story that I focused on previously. Now, I'm finally going to delve into the statistics to find out: does it really work? The CAT-DI was revealed in a 2012 paper by Robert Gibbons and colleagues in the prestigious Archives of General Psychiatry. In this article (which has been previously criticized), the authors, after introducing the theoretical background of the method, and describing its development, compared the CAT-DI against three other depression questionnaires, the HAMD, the PHQ9, and CES-D. These are all widely used, old-fashioned pen-and-paper scales. Gibbons et al examined the ability of each of these four measures to distinguish between three groups of people: those diagnosed with no depression, with minor depression, or with major depression. An ideal depression scale ought to give, respectively, low, medium and high scores for these three different groups. The importance of this comparison can hardly be overstated. It asks the question: is the CAT-DI any better than what we already have? What, if anything, does the new kid bring to the party? And this is the only head-to-head comparison of the CAT-DI's performance in the paper. However, remarkably, Gibbons et al give almost no details about these crucial results. This is all they say about it in the Results section:
In general, the distribution of scores [on the traditional questionnaires] among the diagnostic categories [no depression, minor, major] showed greater overlap (ie, less diagnostic specificity particularly for no depression vs minor depression), greater variability, and greater skewness, for these other scales relative to the CAT-D
I did a double-take when I realized that this was all we're given. 'In general'? No p-values? No confidence intervals? No numbers of any kind (except for some descriptive stats for the CAT-DI group only)? 'In general', one would expect those things in a scientific paper. The data from the four measures are presented purely in the form of some graphs (their Figure 2, reproduced below). An ideal depression test would have a tight spread within each category (small blue bars) and clearly higher scores with higher severity (right bar higher than middle bar higher than lower bar.) My 'general' impression from eyeballing the graphs is that the CAT-DI is only slightly better than the other questionnaires, if at all. In particular the humble CES-D (bottom right), which dates to 1977, seems to me to have performed just as well as the fancy new contender - 'in general'. But I don't like generalities. So (for want of any better way!) I measured the height of the blue bars (in pixels) on Figure 2, and thus estimated the degree of overlap between the 10th-90th percentiles of the distributions for the CAT_DI vs the CES-D (the central 80 percentiles being what the bars indicate).
For the CAT-DI, the overlap between the 'none' and 'minor' bars was 47.2% of the 'none' spread and 62.5% of the 'minor'; for the 'minor'-'major' overlap, it was 80.0% of the minor and 64% of the major. For the CES-D, the corresponding overlaps were 48.5%, 63.4%, 76.9% and 62.5% - almost identical. Overall proportional overlap - which I defined as the total of the two overlaps between the adjacent bars, divided by the total length of the three bars - was identical to within the margin of error (i.e. 1 pixel) but for what it's worth, the CES-D was marginally better (with 0.397 ratio vs 0.399). This is an... unorthodox approach to psychometrics I'll be the first to admit, but it's the best that I could do given the (lack of) information provided in the paper, and I feel that it's more rigorous than just saying 'in general'. But there's a deeper issue. Even assuming that the CAT-DI were better than those three others, would that mean everyone would need to start using it? Or might there be an easier way to get the same level of performance? Quite possibly there might. Back in 2000, Reise and Henson were developing a CAT for personality testing. They found that the CAT performed very well, but, they also found that a minimalistic non-computerized questionnaire (made up of the test items that were most highly correlated with the total score in the calibration dataset) did equally well. The fancy adaptive algorithm was actually unnecessary! In 2010, Reise went on to study a CAT for measuring depression. This time around, the CAT did do slightly better than a best-item questionnaire, but the authors concluded that the CAT provided only "marginally" superior efficiency. Furthermore, these researchers also found that a simple 'branching' procedure - essentially, one of those if-you-answer-yes-please-go-to-question-8 rules - was even better, and basically just as good as the CAT. Have Gibbons et al read this cautionary tale of the limits of CATs? They should have, given that one of them, Paul Pilkonis, helped to write it. Yet they omitted to examine a best-item comparison scale in their paper. In summary, I'm not convinced that the CAT-DI is a more effective and useful way of measuring depression than the available alternatives. That's not to say it isn't better, and I do find the idea of computer adaptive testing a fascinating one. But in my view there's just not enough data in the Gibbons et al 2012 paper to tell us whether their new product offers any added value. Gibbons et al seemed to acknowledge this lacuna, promising that:
We will explore the extent to which [the bifactor model] translates to gains in measurement precision, reliability, and validity in a future statistical article.
That was over two years ago. The 'statistical article' has yet to appear, to my knowledge, and I'm not quite sure why the Archives peer reviewers didn't just require the authors to just include those statistics in the original. Promissory notes are not - 'in general' - worth much in science.
Gibbons RD, Weiss DJ, Pilkonis PA, Frank E, Moore T, Kim JB, & Kupfer DJ (2012). Development of a computerized adaptive test for depression. Archives of General Psychiatry, 69 (11), 1104-12 PMID: 23117634