The GRIM test — a method for evaluating published research.

12 min readMay 23, 2016

HEADER NOTE: My follow-up piece to the below is now **here**.
HEADER NOTE: My new technique SPRITE is **here**.

Ever had one of those ideas where you thought: “No, this is too simple, someone must have already thought of it.”

And then found that no-one had?

And that it was a good idea after all?

Well, that’s what happened to us.

(Who is us? I’m going to use pronouns messily through the following almost-3000 words, but let the record show: ‘us’ and ‘we’ is Nick Brown and myself.)

The pre-print of this paper is HERE — “The GRIM test: A simple technique detects numerous anomalies in the reporting of results in psychology”.

The GRIM test.

The GRIM test is, above all, very simple.

It is a method of evaluating the accuracy of published research. It doesn’t require access to the dataset, it just needs the mean (the average) and the sample size.

It can’t be used on all datasets — or maybe even on most datasets.

It will be much more use in the social sciences than other fields, so that’s where we’ve published it. We’ll assume from here that we’re working with samples of people, rather than animals or cells or climatic conditions etc.

Psychology is an area where we often collect data from small samples which are made up of whole numbers.

So for instance, we might ask 10 people how old are you?

Or we might ask 20 people on a scale from 1 to 7, how angry are you? (where 1 means not angry at all and 7 means furious)

Or we might ask 40 people which ethnicity are you? Choose the most appropriate from : Causasian, Asian, Pacific Islander, Other.

Or we might ask 15 people how tall are you (to the nearest centimeter)?

Small samples, whole numbers.

These sorts of samples are drawn from different data types (for instance, race data is categorical and anger survey data is ordinal and age is continuous) but those distinctions don’t matter here. If we add them up, they all return means or percentages which have a special property: they have unusual granularity. That is, they are composed of individual sub-components (grains) which means they aren’t continuous.

Think of it this way: if everyone was reporting their age to the nearest year (i.e. 33 years old), that is coarser than to the nearest month (i.e. 33 years, 4 months). Age to the nearest day (i.e. 33 years, 4 months, 20 days) is finer than age to the nearest month or year.

(This doesn’t apply so much to categorical data, because it involves the sorting of whole numbers into bins. You’ll see why this is important in a second… )

An Example of Granularity

Let’s make a pretend sample of twelve undergraduates, with ages as follows:

17,19,19,20,20,21,21,21,21,22,24,26

The average age is 20.92 (2dp), and we run the experiment on a Monday.

However, the youngest person in our sample is about to turn 18. At midnight, their age ticks over, and we all run to the pub for a drink.

(If you live in a real country, of course. Sorry Uncle Sam.)

Now, hangovers notwithstanding, we run the experiment again on Tuesday. Now our has the following age data:

18,19,19,20,20,21,21,21,21,22,24,26

The average age is 21 exactly.

Now, consider this: the sum of ages just changed by one unit, which is the smallest amount possible. It was 251 (which divided by 12 is 20.92), and with the birthday of the youngest member, became 252 (which divided by 12 is 21 exactly).

Thus, the minimum amount that the sum can change is by one, hence the minimum amount that the average can change is one twelfth.

So, what happens when you are reading a paper and you see this?

Participants
Participants and Design
Twenty-four female first-year psychology students at the University of Relevant Errors (M age = 20.67, SD = 2.22) were randomly allocated to either the drug (N = 12; M = 20.95) or placebo (N = 12; M = 20.33) condition in return for course credit.

Well, usually, absolutely nothing whatsoever. This looks plausible, and it would be a cold day in hell before anyone ever thought to check it.

But: if you do check it, you find it’s wrong. The ages are impossible.

If you remember from before, when you’re adding up whole numbers, the minimum amount the age of a sample of twelve people can change by is one-twelfth. Now, look at the mean of the drug condition…

(N = 12; M = 20.95)

You can’t take the average of twelve ages and get 20.95. This is inconsistent with the stated cell size (n=12). The paper is wrong. Not ‘probably wrong’ or ‘suspicious’ — it’s wrong.

We formalised the above into a very simple test — the granularity-related inconsistency of means (GRIM) test. It evaluates whether reported averages can be made out of their reported sample sizes.

How did this come about?

From analysing data that we were certain was fraudulent.

(Unfortunately, I can’t tell you what the data is. Or how we knew. That’s actually another story, and one that’s not available for the telling. At least, not yet.)

Data like this, where you’re certain there’s problems hidden in it, you can kick around forever. With enough poking and prodding from various angles, checking its normal statistical properties, correlations, assortment, etc. you always run across something which doesn’t fit properly.

Here’s a great quote on the process from Pandit (2012) I found recently:

Those wishing to invent data have a hard task. They must ensure that all the data satisfy several layers of statistical cross-examination. Haldane referred to these as the ‘orders of faking’ [1].
In his words, ‘first-order faking’ is to ensure simply that the mean values match what is expected. For his ‘second-order faking’, things become more difficult since the variances of these means must also be within those expected, and further consistent with several possibly inter-related variables. His ‘third-order faking’ is extremely difficult because the results must also match several established laws of nature or mathematics, described by patterns like central limit theorem, the Hardy-Weinberg Law, the law of conservation of energy or mass, and so on.
It is therefore always so much easier actually to do the experiment than to invent its results.

I laughed like a drain when I found this paper, because of a note I’d left myself through the investigative process, which — with the Australian expressions redacted — said:

My considered opinion, after doing this now for far too long: it’s HARDER to convincingly fake data than it is to run real experiments.
The amount of ***(toil) required to actually create data like this from scratch is *** (very) nightmarish. It’s a task drastically out of reach of the **** ****(foolish people) who’d try such a bush league stunt in the first place.
The only refuge of the ‘scoundrel’ here is just the lack of sunlight. Open data policies are utterly fatal to the ability to distort or manipulate.

But, what do you do when you don’t have the data?

We had a few fake data sets to analyse, so it was easy enough to detect the problems.

(Note: as yet, the story as to why still can’t be told in full. That’s another tale for another day. Watch this space.)

But we also had a lot of other accompanying papers with no data whatsoever. And we were very unlikely to get any more data. If someone does dishonest research, and you start requesting more and more data from them, you’d be unsurprised to find out how often their dog would turn up and ‘eat their homework’.

So, for some of the simpler papers with smaller datasets, we tried to reverse engineer them — I wrote a series of Matlab scripts which would try to recreate various numerical combinations that gave us the right mean, standard deviation etc. and then reproduce the same statistics. Nick did the same in R.

The problem was, sometimes the scripts would work, and we could find a possible dataset. But sometimes everything went horribly wrong.

Originally, I thought it was my rather slapdash programming. But it wasn’t. The code was fine.

We eventually realised the problem was that we often failed to recreate a dataset from the described summary statistics because the mean was actually impossible.

It was such a simple and brutally straightforward observation that we went scuttering around looking for where this observation had been previously published. It must have been, right?

To our lasting surprise, it hadn’t been. At least, as far as we can tell. We discreetly asked a few people who know about these sorts of things — they hadn’t seen it published either.

And the best piece of evidence I’m aware of: if this had been published somewhere, and it was so easy to do, then why in this era of increased accountability wasn’t everyone already using it?

(It might be published out there somewhere… we just don’t know where.)

So, in summary

The GRIM test detects inconsistencies in the published means of small samples.
It is embarrassingly easy to understand and to run.

So, we put it to work.

Using the GRIM test — What we did.

We took hundreds of papers recently published in psychology journals and GRIM-tested them.

Specifically, we drew samples from Psychological Science, Journal of Experimental Psychology: General, and Journal of Personality and Social Psychology, using search terms that meant almost all contained scale data reported from the last five years. We used n=260.

As above, the applicability of the GRIM test changes with:

sample size,
the amount of decimal places reported in the mean, and
the amount of subcomponents in each thing that’s measured (i.e. a scale with 7 items has more sub-components than a scale with 2 items).

Most papers didn’t actually have any numbers that could be checked.

Of the subset of articles that were amenable to testing with the GRIM technique (N = 71) , around half (N = 36; 50.7%) appeared to contain at least one reported mean inconsistent with the reported sample sizes and scale characteristics, and over a fifth (N = 16; 22.5%) contained multiple such inconsistencies. We requested the data sets corresponding to N = 21 of these articles , received positive responses in N = 9 cases, and were able to confirm the presence of at least one reporting error in every one of these cases, with N = 2 articles requiring extensive corrections.

I’m going to repeat that, because I think it bears repeating:

1 in 2 papers that we checked reported at least one impossible value.

1 in 5 papers that we checked reported multiple impossible values.

In papers with multiple impossible values, we asked the authors for the datasets they used. This was so we could a) see if the method worked and b) check the numbers up close.

What’s going on?

First of all, the GRIM test works very well, because we found an inconsistency in every dataset we received. These errors had a variety of sources:

1. Us evaluating a mean incorrectly / making our own mistakes

Make no mistake about it, we made some errors. This was a Herculean task, which involved hand-checking all the results from all 260 papers. Nick, whose focus and attention to detail is much better than mine, did most of the work. It was marvelous fun and by that I mean it was dreadful. We found two instances where we misunderstood the paper and checked something that turned out OK.

2. Incorrect reporting of cell sizes

This was very common — a paper would split a group of 40 people into two groups… and not tell you how big the groups were. You’d assume 20 each, right? Well, not so fast. Sometimes the groups were uneven (and this meant checking not just one mean, but all the possibilities) and everything appeared to be correct when we found consistent solutions with the published data. Other times, the cell sizes were wrong.

3. Bad reporting of composite measures

Sometimes, what we thought was a mistake might be the result of the items we scored having sub-items. For instance, if there was an impossible mean from a sample of n=20, but each person answered four questions to make up the mean, what appears to be a mistake might not be. Some papers left these details out.

4. Typo

Version control between authors, late nights, bad copy/paste job, spreadsheet mistake… it happens. We found Excel formulas that terminated in the wrong line, for instance. This was probably the major source of inconsistency overall.

5. Not accounting for missing data

Sometimes people in your study go missing, or drop out, or your equipment fails, or you spill coffee on your memory stick. Some papers report their overall sample sizes (how many people were enrolled in the study in the first place) but not how many people completed the study. Bit dodgy leaving these figures out — it never makes your paper look good to say “20 people started the study, but only 15 finished it” — but not a crime.

6. Fraud

… Yes, let’s talk about this.

Fraud.

Obviously, this is the big one.

Do we know how much of it we found? Absolutely not.

Do we know who’s guilty or innocent of it? Not at all.

Are we accusing anyone of anything? Not on your life.

But.

Is it likely we found some?

Maybe.

Let me run a scenario by you:

After going to the trouble of running an experiment, an experimenter tallies up the results, and the primary result of interest is ‘almost significant’. That is, a statistical test of some form has been run and the means just aren’t far enough apart for the difference to meet our arbitrary (i.e. ridiculous) criteria for determining meaningful differences.

What would a dishonest person do? Well, change the means around a little. Not by much. Just a tick. (Assuming you couldn’t give them some kind of cheeky rounding procedure.)

Now, the statistical test which was almost-sort-of-significant is now reporting actually-significant.

The only problem is, of course, that sometimes when you do this, the means will be changed from a real mean to an impossible one.

And, with our technique in place, our dishonest researcher may have made a terribly grave error by publishing that mean in broad daylight, for everyone to see. Now, if the paper is amenable to GRIM testing, someone can come along at any time, and determine that this fictitious mean could never have existed.

In fact, now this result is out there, they probably will. Every study in the published record is now up for grabs. Don’t believe the paper? Well, check the means with the GRIM calculator and go from there.

Of course, by itself, a single inconsistent mean doesn’t mean much.

But say a paper had multiple inconsistent means…

And the authors’ previous work did also, going back several years…

Questions will be asked.

The Real Problem

And while I’m being ominous here: we are far more concerned with the data we didn’t receive than the data we did receive.

We requested 21 papers worth of data.

We received 9.

What’s in the final 12?

Some of the below overlap, but here were some of the issues:

2 authors, even though we confirmed their institutional emails were current, never replied at all to any email
2 authors who were … let’s be charitable and say ‘hostile’ to the process
2 authors who were perfectly happy to talk about the process but gently faded away when it was clear that we wanted to see their data
2 authors who replied with identically worded refusals to share data, even though they seem to have no formal connection otherwise
2 papers where one of the authors or associates is known to have committed research fraud previously…

And, if you have any anxiety that we’re being unreasonable, these papers are almost all published in a journal where the authors have explicitly signed a document affirming the fact that they must share their data for precisely this reason — to verify results through reanalysis.

Make no mistake about it, we chose these journals not just for their profile, but because we are unambiguously entitled to check them as a condition of publication.

Which we’ll be doing.

(Should note: JEP:G and JPSP explicitly guarantee this, PS only has a looser broader commitment to open science. And two out of three ain’t bad. Apologies to music lovers.)

Conclusion

What happens from here will be interesting.

At best, a lot of researchers who weren’t previously much interested in meta-research will now have a simple tool for evaluating the accuracy of (some) published means in research papers. Especially pre-publication — we’re hoping first and foremost this will make a useful tool during review.

If there’s any uptake of this, we should start to see questions being asked at a level which allows a greater attention to detail than previously.

Imagine approaching a row of houses, where you want to look inside. Only some of the houses have windows, and only some of the windows you can reach. But even a tiny, smudged, crooked, frosted-glass window is useful— when circumstances line up right, it will let us see inside the house. And that’s better than what we had before.