AI datasets are filled with errors. It’s warping what we know about AI


Yes, but: In recent years, studies have found that these datasets can contain serious flaws. ImageNet, for example, contains racist and sexist labels as well as photos of people’s faces obtained without consent. The latest study now looks at another dimension: the fact that many of the labels are just flat out wrong. A mushroom is labeled a spoon, a frog is labeled a cat, and a high note from Ariana Grande is labeled a whistle. The ImageNet test set has an estimated label error rate of 5.8%. Meanwhile, the test set for QuickDraw, a compilation of hand drawings, has an estimated error rate of 10.1%.

How was it measured? Each of the ten datasets used for evaluating models has a corresponding dataset used for training them. The researchers, MIT graduate students Curtis G. Northcutt and Anish Athalye and alum Jonas Mueller, used the training datasets to develop a machine-learning model and then used it to predict the labels in the testing data. If the model disagreed with the original label, the data point was flagged up for manual review. Five human reviewers on Amazon Mechanical Turk were asked to vote on which label—the model’s or the original—they thought was correct. If the majority of the human reviewers agreed with the model, the original label was tallied as an error and then corrected.

Does this matter? Yes. The researchers looked at 34 models whose performance had previously been measured against the ImageNet test set. They then re-measured each model against the roughly 1,500 examples where the data labels were found to be wrong. They found that the models that didn’t perform so well on the original incorrect labels were some of the best performers after the labels were corrected. In particular, the simpler models seemed to fare better on the corrected data than the more complicated models that are used by tech giants like Google for image recognition and assumed to be the best in the field. In other words, we may have an inflated sense of how great these complicated models are because of flawed testing data.

Now what? Northcutt encourages the AI field to create cleaner datasets for evaluating models and tracking the field’s progress. He also recommends researchers have better data hygiene when working with their own data. “If you have a noisy dataset and a bunch of models you’re trying out, and you’re going to deploy them in the real world,” he says, you could end up selecting the wrong model without cleaning the testing data. To this end, he open-sourced the code he used in his study for correcting label errors, which he says is already in use at a few major tech companies.