This is the next entry in our series on the work of Dr. Nicolas Guéguen.
At this point, I’m going to skip the majority of the jokes, because I think we’re a bit tired, and we need a Bex and a lie-down at this point.* And there’s more to come yet.
Our new article of interest is Guéguen (2015) “Women’s hairstyle and men’s behavior: A field experiment.” Scandinavian Journal of Psychology, 56, 637–640. http://dx.doi.org/10.1111/sjop.12253
If that title leads you to expect off-brand evo psych exploring massive behavioural changes induced through the complicated medium of hairdressing… well, you’d be right.
Little research has examined the effect of women’s hairstyles on people’s behavior. In a field study, male and female passersby, walking alone in the street, were observed while walking behind a female-confederate who dropped a glove and apparently was unaware of her loss. The confederate had long dark hair arranged in three different hairstyles: one with her hair falling naturally on her shoulders and her back, one with her hair tied in a ponytail, and one with her hair twisted in a bun. Results reported that the hairstyle had no effect on female passersby’s helping behavior. However, it was found that the hairstyle influenced male passersby with men helping the confederate more readily when her hair fell naturally on her neck, shoulders and upper back.
Basically, a female confederate with a variety of hairstyles drops a glove in full view of a carefully selected passersby (180 of them, actually). Participants are rated for their behaviour: 1= does nothing, 2=points out glove, 3=hands glove back — as in, the score increases as the participant increases in ‘helpfulness’.
(Glovefulness? Not a word, James. Just say helpfulness.)
The cell sizes are n=30 (men/women vs. hair natural/hair in a ponytail/hair in a bun).
That’s all. Not exactly the Riemann Hypothesis, is it.
Leaving to one side the appropriateness of using the Least Significance Difference test on 6 groups, and that fact that these seem categorically different acts by the observed participants (do nothing vs. do something, intervene physically vs. intervene verbally), there are two things here that are more than a bit whiffy.
1. The EFFECT SIZE
Dear God in heaven, the effect size.
Cohen’s old-school rule of thumb that is casually cited everywhere is easy to remember:
d~=0.20 (SMALL, negligible practical importance)
d~=0.50 (MEDIUM, moderate practical importance)
d~=0.80 (LARGE, crucial practical importance)
Well, we’ve left those bleeding in the dust as we speed into a bright future. The between-subject effect sizes for the men/women group difference for natural hair is a ‘healthy’ d=2.44.
To put this into perspective, I have graphed it on a table with some similar effect sizes from other areas of research.
The first eight points are comparing GAD patients or similar (four studies) to community samples (two studies) on an anxiety subscale of the DASS-21. (GAD stands for Generalized Anxiety Disorder; DASS stands for Depression-Anxiety-Stress Scale.)
That is, we’re comparing clinically anxious people to regular couch-dwellers on a validated measure of anxiety.
The next four points are comparing the heights of men to women in a variety of cultures. Because biology is the same everywhere, men are consistently about 4–6 inches taller. This is straightforward and uncontroversial.
The last two points are from a research area that consistently produces great big whopping effect sizes — anything to do with bodily damage (disgust, injury, blood and needle phobia etc.) Experiments in this area often involve showing people intense, nasty stimuli and they provoke well-conserved responses, generally somewhere between a very loud EWW and vasovagal syncope a.k.a. fainting, falling off the chair in the experiment, having me rush to review your ECG, get you a glass of water, and writing out ethics forms for an Adverse Experimental Event.
And considering the obvious relevance for medical intervention, populations of people particularly sensitive to bodily damage stimuli are fairly well studied. (A lot of people give blood or get their booster shots. Some of them slide off the chair when they do so.)
Thus: the two effect sizes I’ll throw in are comparing (a) the disgust rating of watching open heart surgery (trust me, it’s FULL. ON.) vs. a boring university promo video (and these are stunningly, deadly, dog-dick dull); and (b) the rating of ‘how much will an injection hurt?’ in people with a needle phobia vs. healthy controls.
The above are the meaty, straightforward effect sizes that people in the social sciences — the messy, inexact, frustrating, conceptually muddy social sciences — usually only dream about. And, atop them all, our small sample of hairdo-appraising men, for whom the sight of a falling glove from a woman who forgot her scrunchie is like a plate of lamb chops to a starving lumberjack.
The other results in this paper are similarly big.
So, perhaps we should look at that data more carefully.
2. Those Digits Are Terminal
Here’s the table.
I’ve highlighted something you might overlook unless you were specifically searching for it:
Six means, six trailing digits. All of them zero.
We might instead expect some 3s and 7s in that second decimal place, because the total scores (all integers) were divided by the number of participants per cell (again, n=30).
At first we thought that this pattern might be due to a numerical formatting problem — for example, perhaps the numbers had been rounded to one decimal place, then expanded to two decimal places for display purposes — but the non-zero final digits of the standard deviations (SDs) and row totals are not consistent with this.
Assuming a uniform distribution of scores (and maybe we can’t, but it’ll be something like that) the chance of all six means ending in zero in this way is (1/3)^6 = .0014.
Anyway, here comes the curious part: we discovered that in all five distinct cases (the cells for the “Men — Ponytail” and “Men — Bun” conditions are identical, reducing the number of unique combinations of mean and SD to five), there is only one possible combination of scores of 1, 2, and 3 that can give the means and SDs shown in Table 1.**
In other words, we can recreate the whole dataset just from the summary statistics. That’s how little information we have here, we can gonk the whole dataset out just from the descriptives.***
And when we do that, we find it’s… let’s say “strongly ordered”.
It is not difficult to see that this dataset contains a remarkably regular distribution of scores.
Specifically, in every condition (participant sex–hairstyle), each possible individual score (1, 2, or 3) occurs exactly 0, 6, 12, 18, or 24 times. They’re all in multiples of 6. No other counts of individual scores are present.
We originally bollocksed around with the binomial probability of the above, until becoming deeply uncertain that it was the right way to understand the problem. So, in a fit of not understanding how to address this elegantly (and feeling like my undergraduate degree in economics was some kind of horrible vocational mistake) we have to simulate a few things.
Let’s consider two scenarios:
- Every person in a group is basically an amoeba i.e. they turn up and slop out any old glove-based behaviour at random. We therefore assign each of their individual actions at random. This means the group memberships have a central tendency — 10 in each condition is much, much more likely than 30 in one condition and 0 in the other two. This is a pretty harsh assumption, because people’s behaviour is directed.
- Every group outcome is set by the loving hand of God. We therefore assign each of the group combinations at random. All 30 participants doing one action is as likely as any other combination. This is a pretty generous assumption (as we include unlikely scenarios which work as being equally possible as the others).
If we simulate these little piggies, the probability of being totally-divisible-by-6 is, in the first condition, about 2.5%, and in the second condition, about 6.8%. We don’t know which is right, but they both return what we expected — that all-up-6’s are a fairly unlikely outcome.
But we have six groups which need to have this property.
And those groups are full of independent people making independent glove-mediated decisions. In other words, this unlikely result needs to happen six times on the trot — (6.85%)⁶.
(Or, is it more fair to say that after we observe the first one, then we need to observe that property in the subsequent 5? So maybe it’s (6.85%)⁵… ?)
So, our most favourable assumption is that the likelihood of seeing this pattern is a kingly 1 in 687789. This, as usual, is the Steel Man number — note that changing any of the above probably makes the outcome in the paper less likely.
(NOTE: please be nice when you write to me to tell me I’ve completely misunderstood the binomial theorem, but also note that your inevitable objections won’t alter the outcome of deeming this extremely unusual.)
It also presents some unusual behaviour. In the Women-Ponytail condition, more than half of the sample watch another woman accidentally drop something, and fail to mention it to her entirely. That’s just cold, man.
But in the Man-Natural condition, almost all of them rush over to pick it up for her. Something about the lack of a hair tie turns every man jack of them, with very few exceptions, into Mr. Darcy. No-one is distracted, or fails to respond to the hypothesized ‘reproductive cues’ due to being tired, angry, in a hurry, gay and hence not subject to unstoppable Pepe Le Pew-style impulses, or simply suspicious about the people surreptitiously watching holding clipboards.
This paper, with its gargantuan effect size and perfectly regular data, just seems really, deeply, seriously unlikely. I cannot think of a possible veridical explanation for how this hyper-powerful ultra-regulated effect could be drawn out of such a potentially messy field experiment.
For those of you playing along at home, this is the last in what has been a four paper series from the same author. From here, we will pivot to a broader discussion of general issues within a wider body of work (10–15 papers).
* Non-Australians can look it up, and shame on you for being from elsewhere.
** Historical note: this was one of the first observations which helped catalyse the development of both GRIM and SPRITE.
*** We asked the author for this data and received it. Our recreation was correct.