Assessing Science: A Perspective From The Cheap Seats

James Heathers
15 min readJul 6, 2016

Everyone and their mum is at present thinking, and thinking very hard, about how we should structure the system we use to assess science.

Here are my opinions, straight from the cheap seats.

Before we start…

None of the below is precious to me. I am open to changing any of it.

I may well be wrong. Feel free to tell me (nicely).

I might even be mad. We shall see.

1. You can’t kill traditional peer review

My strong suspicion is that peer review in the traditional format, a formal process of pre-publication review by elected experts, is necessary and will remain necessary.

Why?

Because of the psychological distinction between critical reading and critical evaluation. I think people perform these tasks differently, as they have different assumptions about what’s involved.

More specifically: I think that when people read papers, I think the level of skepticism they deploy depends very heavily on what they’re doing at the time.

When you read any given paper less-than-critically, and we sometimes do, you will hop straight to the conclusions listed and use them to bolster your own arguments. This is a quick and superficial process. Say you need to find a quick citation or you’re trying to get an overview of a field. Of course, if you don’t have a good command of the material being discussed, your reading will be at this level by necessity.

If you read any given paper critically, you will make a judgment as to its ‘significance’, methodology, accuracy, and so on. How well it solves a problem. How it fits into the broader literature. Everyone else who reads the paper will also (hopefully) do this. Sometimes they will do it both quickly and expertly because they have a deep knowledge of the relevant literature, good heuristics for what is and is not acceptable, and an excellent radar for arrant bullshit.

But when you read a paper during peer-review, you assume a mantle of responsibility for the basic accuracy present, in excess of the previous. You check things very carefully. You investigate values that look suspicious. Maybe you even re-do analyses. You ask for more details. You formally investigate if references are relevant in the manner they are cited. You will raise, pissily and at great length, all issues more important than incorrect margin width. You consider broader issues, like the specific fit for the paper into the journal which has received it. You consider yourself the line of defense against everything which can go wrong or right.

The often drab and boring sorts of tasks involved here — no-one will do them at first glance. There is no sweaty-palmed pedant anywhere willing to plod through the basic nuts and bolts of review in all the work they read. This is a monstrous idea. The skepticism involved would be exhausting. You’d get nothing else done. Basically, there’s also only so much time in the damned day, and at some point your critical eye is tempered with trust.

Drawing this distinction between different styles of reading, it’s easy to see how a thousand cursory readings of a paper by different parties may produce absolutely nothing like the same kind of insight as one single, very thorough detail-driven roughing up.

And it’s good that we have an explanation for how these could differ, because it happens ALL THE TIME.

Here’s the story about Nick, my coauthor from this long streak of misery, originally became interested in the Losada and Fredrickson paper now infamous for being the only partially retracted paper in scientific history:

“The mysteries of love, happiness, fulfilment, success, disappointment, heartache, failure, experience, random luck, environment, culture, gender, genes, and all the other myriad ingredients that make up a human life could be reduced to the figure of 2.9013.

It seemed incredible to Brown, as though it had been made up. But the number was no invention. Instead it was the product of research that had been published, after peer review, in no less authoritative a journal than American Psychologist — the pre-eminent publication in the world of psychology that is delivered to every member of the American Psychological Association. Co-authored by Barbara Fredrickson and Marcial Losada and entitled Positive Affect and the Complex Dynamics of Human Flourishing, the paper was subsequently cited more than 350 times in other academic journals.

And aside from one partially critical paper, no one had seriously questioned its validity.”

Andrew Anthony, The Guardian

How many people have read a paper if it has 350 citations?

Answer: many, many thousands.

How many seminars and journal clubs and emails and half-replications have been leveled at, to, from, about, or on it?

Answer: a great deal.

This paper is definitely part of the top 1%. Bernie Sanders would happily hold a protest on its lawn.

But what happened on the first critical reading by someone with the slightest knowledge of fluid dynamics? The underlying mathematics, the core conception at the center of the paper, turned out to be a howitzer shell of military-grade bollocks, that’s what.

Was this paper ever initially reviewed by someone who understood the mathematics? Extremely unlikely. Obviously, this represents a failure on the part of the journal to assign correct reviewers.

So, here’s another related question: in this paper’s life as a social phenomenon, widely read and discussed, did someone with a passing knowledge of fluid dynamics ever read it and think “wait on, that’s fishy”?

Well, we don’t know. But I think probably.

And the historical record shows, of course, that they did nothing whatsoever with this piece of insight from the time of publication until Nick came along a few years later. No-one adopted this as a problem. They weren’t ‘the reviewer’.

How many papers have you yourself read where the central result couldn’t be right or accurate, and you had arguments which stood against the presented conclusions?

Haven’t seen one of them since yesterday, have you?

And what did you do about it?

You wrote it off as a bad job, and didn’t get involved.

Post-publication papers have a very strong SEP field.

In other words, I suspect that if you leave the task of figuring out the basic accuracy of scientific work to the post-publication environment, hoping that the boring technical kinks will just come good run in front of enough eyeballs, then it won’t. Very few people will ‘casually’ take a great interest in the raw, necessary, crucial details. Welcome to Jack’s 60 hour week and Jill’s grant deadline. Welcome to human nature. We’re here for the conclusions, please. Lay those out for us, in short order.

(Exceptions: the paper is a) very important, b) very high-profile, or c) from the lab of someone who deserves, let’s say ‘heightened scrutiny’. Overall, this is actually very, very few papers.)

Finally, we’re all conditioned by a culture that has some deeply strange ideas. For instance, collegiality often seems to mean “acquiescence in the face of other people’s work you don’t like or trust”. People confuse ‘collegial’, as in ‘a shared sense of responsibility’, with simply ‘agreeable’. They’re not the same thing. Sorry.

2. “But! But! It doesn’t work like that. Peer review is often awful. Not your diligent, detail-oriented fantasy version.”

I agree. So, train better reviewers.

“James! I kind of agree with what you wrote above, but bear in mind that as a quality control mechanism, peer-reviewing kind of sucks. It’s a lottery. It’s capricious. It misses important issues all the time.”

Glib answer: sure, but does it miss more or less things than NOT peer-review?

Serious answer: I agree, the process can be utterly sketchy. I also think:

a) This concern (‘review is awful and capricious! rabble rabble! sky falling at 7!’) is somewhat overblown for non-fancy journals, and while I’m not quite as positive as this guy, I’m grudgingly on the same side. Fancy journals have artificially low acceptance rates and are much more capricious.

and

b) The obvious solution here, to me at least, is not replace the system of peer review but improve the reviewing process.

In an only mildly circuitous fashion, this is what the GRIM test is all about. Quickly take a look behind the veil of the numbers presented to you in a study you’re reviewing, and see if they exist. Other tools exist for checking the consistency of t, p and r values. Have you seen Stat Check? It’s very cool. These things, and anything like them, should be mandatory.

These very fast and relatively straightforward techniques, and many others besides, could detect a LOT of basic errors in data during peer review. They require no mathematical skill to use — as difficult mathematical techniques go, they’re a gnat’s wing above splitting the cheque 8 ways at a Chinese restaurant.

Why don’t people use these simple data-review tools for, well, reviewing data?

Because no-one’s heard of them.

You know we should probably tell them, right?

Personally, I studied at a good university for a very long time, and then went to another one for a postdoc, and now another one again. I’ve reviewed for … let’s say eight or so different journals (I’m a young, young man) and no-one has ever told me a damned thing about how to perform peer review.

Zero formal training, zero informal training, nothing. No books, no pamphlets, no websites.

Nothing.

I am literally making it up as I go along.

I ape the various styles of reasonable and straightforward reviews I have happened to receive myself. And I’ve had some that have been so, so good. There are some amazing (albeit occasionally long-winded) senior people in my area. I rip off their style, and then after that, I more or less busk it.

The only feedback I get — ever — is that generally editors I review for have an annoying tendency to send me more papers. This could be either a mark of confidence in what I’ve said, or simply the fact that I have revealed myself as the kind of Sufficiently Non-Serious person who returns reviews early, instead of claiming that my weighty genius will take weeks to confer.

Certainly there’s no META-review, where someone tells you how you did:

Dear Dr. Heathers,

Thank you so much for your review of the paper “More Lies About Physiological Measurement: A Milkmaid’s Tale” by Bogart et.al.

Everyone at the Journal of Squiggly Heart Signals appreciates your attention to detail, but a few points should be raised:

a) six pages on why the authors should ‘get in the goddamn sea already’ is four pages too many, and

b) your discussion of the frequency analysis used by Bogart et.al. set off several of our spam filters. Please excise the seventeen words in the attached document from all future correspondence as they are inappropriate in tone and unprofessional.

(In addition, three of your suggestions of what the authors should do to themselves are unhygienic and in defiance of human physiological limits. Please refrain from all further suggestions in future, we feel they are uncollegial and medically treacherous.)

yours,

McLovin
(Associate Junior Vice-Deputy Sub-Editor)

Yes, I’m being silly.

But even the silly letter above would be more guidance than I’ve ever seen.

And I’ll warrant that’s similar for most everyone else, too.

It seems that for peer review, there IS no set format that needs to be communicated, no formal expectations, no course you can take to learn the trade, nowhere you can go to be told what to do. You start off flying completely blind. Ever received a review that was short, dumb, and negligent? Note these can be either bad OR good. I’ve had plenty of reviews which were four sentences of crappy pseudo-English saying my paper was fine. I resent those as much as four sentences of crappy pseudo-English that says my paper is awful.

And, yes, some journals have little check-boxes and ratings for various attributes of the paper… but I have never seen one of those that I didn’t instantly resent.

What’s the “impact” of this paper? (Rated from 1 to 5)

How the hell should I know? Wait and see!

Now, while we’re at it, here’s a pair of related questions:

  • How many papers did you have to send to journals before you got your first really good review, one which significantly improved your paper dramatically? Have you ever got one?
  • How often, on average, do you receive a really comprehensive detail-oriented clearly-written expert review to something you’ve written?

The first really good, comprehensive, scary-accurate review I ever saw wasn’t for me, it was another reviewer on a PLoS ONE paper I reviewed. I flagged up a bunch of technical issues on a poorly written technical paper on sensors, and called it a day. But Reviewer #3 pulled the paper completely to pieces, and prodded the authors through re-analysis, re-structuring and re-writing. When the process was finished (the poor authors had to do FIVE revisions), the paper was infinitely better.

This is very far from normal. What’s normal is a lot less insight, time and effort.

So, maybe that’s why reviewers miss details, do no data-wrangling, don’t apply their own appropriate post-hoc tests, check statistics, and so on. Maybe peer review sucks at least in part because we have invested exactly no time whatsoever in defining how it should be performed, and teaching people to meet those benchmarks.

And while it’s anonymous, the problem isn’t really anonymity.

It’s anonymity with immunity.

So, speaking of which…

3. Reviewing sucks. But don’t pay for it, rate it.

More research than ever before means more reviewing than ever before. Naturally, the collective response to this situation has been: “Oooooohhhhh what fun. Just what we all need: more unpaid work.”

So it was only a matter of time before we saw both journals which propose to pay reviewers, and services which you can pay to review things for you. The ones I know off the top of my head, although there may be more:

Veruscript.

Rubriq.

Collabra.

In any case, I’m not sure if this model will work.

Paying for reviews changes the incentive structure. Will getting paid change a reviewer’s attitude towards accepting reviews? Completing reviews? Will journals want more positive reviews if they’re paying for them? (i.e. allowing them to collect publication costs to recoup review costs?)

Welcome to submission charges. Publishers never knew a cost they didn’t like to pass on. Perhaps journals will start charging money up front in order that review is paid for, instead of or as well as publication fees?

Pass-on costs creating two tier systems. If journals have expedited review which costs money, you very quickly end up with a two-tier system where submissions from better funded workgroups have an more streamlined review process. Unfair. And not a hypothetical unfair, because it already happened.

If there’s an alternative, perhaps it lies in getting credit for the task you’re doing without compromising the anonymity. We rate everything else already. Citation metrics are many and various. We divvy up how much grant money people have. We think of how many students they’ve graduated. How many years they’ve done on what committee.

But review? You could do one bad review a year or one fantastic review a week, and no-one would have the slightest idea.

Here’s a potential system:

A) Reviewers are assigned a known and public ID which identifies them. If we can figure out PMIDs and ORCIDs and DOIs then this shouldn’t be hard. ORCIDs would even work.

B) Real, non-bullshit, indexed journals (not the International World Journal of Psychology, Psychiatry, Cardiology and Basketweaving) are issued with the ability to create editor accounts.

C) During the review process, an editor has the ability to add to this record only a very few details. These minimal details are all the rating contains. There are no names of papers, no journals, no details about the review. Instead, the following is recorded:

  • The ID of the journal reviewed for.
  • Whether or not the review was completed to the minimum standards of the journal (probably about 10–15% of reviews meet this standard, these are your classic Reviewer #2 moments, usually quite obvious to all concerned and which presently go unpunished and unobserved). This includes a) failures of promptness, b) lack of technical engagement or cursory remarks, c) unprofessional behaviour. Note the editor gets access to ALL the remarks left by reviewers, not just the ones they send through to the authors…
  • Whether or not the review was meritorious (Also probably about 10%, I’d rank on clarity, fairness, technical sophistication, and utility for improvement, usually quite obvious to all concerned and which presently go unrewarded and unobserved). I’ve had some reviews which were so, so good — and by good I mean long, critical, and intensely focused on improving the manuscript. The idea that someone did that in secret for nothing is unfair.

And the researcher’s public ID would display:

# different journals reviewed for (i.e. breadth)

# reviews complete (i.e. total contribution)

# reviews completed to minimum standards of journal (i.e. soundness)

# reviews of merit (i.e. excellence)

So, mine would look something like this:

# 8

# about 25

# HOPEFULLY also about 25

# maybe 1–2, I don’t know

Just from these four numbers (or something similar), you get a remarkably complete version of someone’s history of review.

Someone who reviews only for very specialist journals, in great detail? Small number of reviews, high percentage of meritorious ones.

Someone who reviews everything, the department workhorse who will read anything and leave good general comments? High number of reviews, mostly up to minimum standard.

The martinet idiot who’ll punish any transgression in unprofessional language? A lower than average percentage of minimum standard reviews.

Obviously I’m not wedded to the details — instead, what interests me is that in an environment where everything else is a metric, and we argue about metrics all the time, we have a task which is absolutely central to our collective work which is a) easy enough to measure b) with a reasonable degree of protection for all parties, and c) is presently unmeasured.

Last point: a system already exists which attempts something similar. It’s called Publons, and as might be expected, I’m very interested in it. It overlaps a great deal with my sketchy system as outlined above. I hope it’s successful.

4. Open review — good, bad or ugly?

That is, should the formal remarks you leave as the academic assessment of someone else’s research be public, and publicly identifiable as yours?

I think yes, on balance. I think you are more thoughtful and considerate if you know other people will read your reviews. I prefer to sign my reviews when possible.

A few things concern me about mandatory open review, though:

Sunshine also grows weeds — open reviewing can expose you to vindictive, difficult people

(Let’s make the very important assumption here that I know what I’m talking about when I review something. Without that, the following is way off.)

Ever reviewed a paper that’s truly dreadful from a workgroup full of people you know, from a workgroup that’s important and influential in your area, or a workgroup who don’t like you personally?

I have. You have, too.

If I leave a very mild-mannered, encouraging but essentially destructive review, and the editor pays close attention to this and spikes the paper, then I might not be your favourite person.

What if I catch you fiddling with your clinical trial endpoints? Or using inconsistent, bad or incorrect statistical methods? What if you’re p-hacking? What if your whole methodological approach is a busted flush, and I can demonstrate that? Even less favourite.

And if you know my name, then I’m trusting you to be an adult about the fact that I think your crap paper is crap.

Will this always happen? Absolutely not.

Have you ever met a vindictive academic? Someone with an inflated level of self-confidence? Someone who did a good line in spite? Someone with a long list of enemies? Someone whose TED talk was writing cheques their research couldn’t cash?

I have. You have, too.

Of course, most people are careful to cloak a disagreement like this in terms other than “I have done bad work”. Instead, it drifts personal: “she lacks collegiality” or “I don’t like his tone” or “this is unnecessarily critical”.

Assuming that criticism of your work is criticism of you personally, and assuming that your critics are bad people are a marvelous refuge if you are awful at your job. Especially if you don’t know you’re awful.

Everything becomes ‘fine’

The moment you start making your ratings public, a tremendous middling emerges.

What significance is this paper? 3 out of 5. What is the ‘merit’ of this paper? 6 out of 10. What is the ‘general interest’ of this paper? 2 out of 3.

The vast majority of scores offered are somewhere in the middle of whatever endpoints we’ve decided on. Apparently with open ratings no-one ever does something wrong but brilliant, or solid but potentially usable, or not immediately refutable and potentially very dangerous, or safe and wet and deadly dog-dick boring. Open ratings turn everyone into the British Home Secretary from a period drama.

Things which are awful become ‘concerning’.

Things which are excellent become ‘solid’.

Public scrutiny cuts both ways in the production of drab opinions. In general, I don’t like that. Especially because people don’t allow themselves to get publicly excited. Please be excited. I want to be excited too.

Conclusion:

I can’t pretend I have insights people haven’t had elsewhere. Or even that I know enough about this to change your mind.

Of only one thing am I sure: the next few years will be an interesting time in publishing, and things will change. When that change comes, it will be fast.

Consequently, we should be ready with opinions on how the process should work, because the opportunity to use them opinions will be here soon.

What do YOU think?

--

--