SPRITE Case Study #5: Sunset for Souper Man.

  • this is the FORTY-SIXTH paper to come under scrutiny from me and mine… it might even be more;
  • so far, his career count for problematic papers is five retracted, thirteen corrected, which is a truly stunning number;
  • one of these was seriously unusual in the fact that it was retracted twice, removed once in favour of a correction, then the corrected version was also judged to be insufficiently correctable for publication… this is unprecedented in my experience;
  • the issues involved have been covered in the Chronicle of Higher Education, Slate, Forbes, the Boston Globe, Ars Technica, Boing Boing, The Guardian… and probably elsewhere.
  • most importantly, all of the above was covered in a long series of articles on Buzzfeed, which included not only full coverage of the issues but some excellent investigative journalism as well, and a similarly long series on Retraction Watch.

There’s a glass trophy case filled with items Wansink has made famous, like the bottomless soup bowl that he used to prove that people will eat 73-percent more if their bowl is constantly replenished via a hidden tube. A cool study, if there ever was one.


Cool, indeed.

Forcing SPRITE

Consider a question answered on a Likert-type scale, with answers from 1 to 9, which is given to n=20 people.

  • make up a basic histogram including 14 values we can shuffle to change the SD, and 6 values we cannot (which, in this case, would all be 5).
  • shuffle just the remaining places until we get the right SD for the whole sample
  • chuck the solution out if it has the wrong amount of 5’s (yes, there’s a better way to do it, but it doesn’t matter now)
  • repeat the above until we have a set of viable histograms

Wansink et al. (2005)

This study investigated whether visual food cues interfered with the physical experience of being full. Every experimental session ran 4 people, 2 with normal bowls of soup (18oz), and 2 with special doctored ones, which were sneakily connected by a tube beneath the table to a whole separate POT of soup, and were set to refill slowly — so, you could eat and eat and eat and the level of the soup bowl would only go down SLOWWWWWWLY.

  • these people either don’t know what ounces are (they might be international students, after all),
  • or they didn’t answer the question seriously (undergraduates!),
  • … or the data is incorrectly described.
  • We have not even attempted the unbearably messy task of trying to reproduce the correlations between the actual and estimated consumption. However, compare the estimated bottomless (2 crazy people who thought they ate a whole Warhol catalogue of soup with 29 reasonable values) with the actual bottomless (if anything like the pilot, a few low values and a lot of normally hungry people). Sewing these together seems like it would dictate a surprising amount of people being really, really terrible at soup estimation.
  • The paper carefully describes how the diners were laid out in groups of 4, with 2 normal bowls and 2 refillable bowls at each sitting. This is how the table is built! Consequent to that, it is not explained how the cell sizes are n=23 and n=31. These are wildly dissimilar, and neither is divisible by 2. And the overall sample size is n=54 i.e. not divisible by 4. Even if exclusions have been made, the sample sizes contradict the methods as described. Of all the points here, this is also the only one a reviewer could realistically have spotted.
  • The 10 p-values on the bottom of Table 1 make me suspicious, because they are all quite non-significant. These aren’t reported but are easily calculated — p=0.5404, 1, 0.5371, 0.6647, 0.8884, 0.8907, 0.4678, 0.3667, 0.2383, 0.8371. However, there is nothing I can reliably glean from these using Carlisle’s method. Two reasons: (1) the problem of non-independence arises here, for instance, the question “I carefully paid attention to how much I ate” is very closely related to “I carefully monitored how much soup I ate”, i.e. these p-values should be closely related to each other; (2) using Stouffer’s Method to combine p-values goes very queer with p-values of 1, as we have here. If this is the case, then the method returns p=0 (which is uninterpretable). Entering the 1 as some version of 0.99999… gives wildly changing p-values depending on the precision. This is not a criticism of the paper in question, just a note of interest. The appropriateness of different omnibus p-values is discussed in a great pre-print here.


The conclusion is this — in my opinion, this paper should be retracted. A lot of soup-related history should be re-written.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
James Heathers

James Heathers

I write about science. We can probably be friends.