

Common and reader-specific mammograms were mixed, and radiologists read all cases in one mode before moving to the next in the mode-balanced, case-randomized study that was managed by a comprehensive computer program. The set read by each radiologist included a “common” set of 155 screen-film mammograms originally read in the clinic by other radiologists not participating in the study and a “reader-specific” set of mammograms that the reader had read clinically 2–6 years earlier. Radiologists interpreted each mammogram as they would in the clinic and rated the right and left breast separately. In the future, we plan to report the results with the two other modes (ROC and free-response ROC) and their relationship to the results from the clinic-BI-RADS mode.įour-view screen-film mammograms (ie, “current” mammograms) as well as a comparison mammogram (obtained at least 2 years before the study, when available, or 1 year before the study when it was the only available mammogram) as used during the original clinical interpretation were made available to the radiologists during the readings. The results of the clinic-BI-RADS mode are the focus of this article because it is most similar to clinical practice.

The study was “mode balanced” in that three radiologists read mammograms with each of the modes first, three radiologists read each of the modes second, and three radiologists read each of the modes third (last) by using a block randomization scheme. Images were read in a mode we termed clinic–Breast Imaging Reporting and Data System (BI-RADS) ( 20), a mode rated under the ROC paradigm with an abnormality presence probability rating scale of 0–100, and a free-response ROC mode ( 21). Radiologists read mammograms three times during a 20-month period (September, 2005 to May, 2007). The need for informed consent was waived. Each reader interpreted 276–300 screen-film mammograms, which were obtained under an institutional review board–approved, Health Insurance Portability and Accountability Act–compliant protocol. Nine board-certified, Mammography Quality Standards Act–qualified radiologists (with 6–32 years of experience in interpreting breast imaging studies and who perform more than 3000 breast examinations per year) were selected to participate in the study on the basis of the number of screening mammograms read during the period from which we selected the images. Data have been collected in the attempt to assess the possibility of a “laboratory effect” in observer performance studies and how it could affect the generalizeability of results ( 14).īecause large observer variability has been reported in many studies, in particular during the interpretation of mammograms ( 15–19), we performed a comprehensive, large observer study designed to compare radiologists' performance during the interpretation of screening mammograms in the clinic to their performance when reading the same images in a retrospective laboratory study. Experimental conditions that are required in the vast majority of observer performance studies could affect human behavior in a manner that would limit the clinical relevance of inferences made ( 13). The most relevant question of interest in all of these studies is not whether the results can be generalized to cases, readers, abnormalities, and modalities under the study conditions, but rather whether the results of a given study lead to valid inferences on the potential effect of different technologies or practices in the actual clinical environment.

A frequently used approach is a receiver operating characteristic (ROC)–type study that provides information about how sensitivity varies as specificity changes while accounting for reader and case variability ( 9– 12). Important progress has been made in our understanding of the use of retrospective observer performance studies in the evaluation of diagnostic imaging technologies and clinical practices as well as the methods needed for the analysis of such studies ( 1–8).
