The insignificance of significance testing.
I mentally gave thanks to Jack Miles (https://www.research-live.com/article/news/significance-testing-is-insignificant-to-modern-marketing/id/5052802) for raising this prickly issue. For me, a quantie-quallie (as opposed to quallie-quantie, or quallie or quantie) the question of significance testing has been a thorny one for years, largely because the statistical testing we do in our commercial market research world is so biased that, strictly, it’s impossible to state that any measure has “statistical significance” unless it’s so big it’s glaringly obvious. The prop significance testing gives researchers and marketers is illusory (and may explain in part why so many new products still fail). More damagingly, it focuses the attention on individual measures and takes the eye – and mind – away from the fuller picture, the story that really needs to be told about the matter being researched. It’s the fuller story that enables and informs actions and gives meaningful direction.
I was schooled as a Social Researcher and even now, many years later, can still remember my surprise that Market Researchers (as we were called in those days) wilfully (as far as I could tell) ignored two issues that have a serious impact on any statistical testing. These issues have, if anything, become more serious as time has passed.
There’s no such thing as random. Statistical theory assumes we have drawn probability samples for our research. This means every person in our target universe (parents, chocolate eaters, soft drink buyers, breakfast cereal buyers, whoever) has an equal chance of being selected for interview. This means random selection. And it assumes that, when invited, they will take part. We could almost pretend in the old days when doing hall tests and using brilliant interviewers that we were getting towards this, but nowadays? Face to face intercept interviewers are mistaken for suggers, people cross the road to avoid being interviewed, and it’s increasingly difficult to get people to stop. Telephone research cannot be possibly be random since the dawn of TPS and the rise of the mobile phone. Door knocking – once the closest to random sampling that was possible, and particularly expensive because of the time and trouble that was taken to deliver good samples – cannot deliver the response rates it once used to. Online panels – now a quantitative panacea for many (cheap, fast, who wouldn’t use it?) – can’t even pretend. Yet even when we are using a sample that is drawn from a self-selecting group of people who choose to take surveys because they are curious, or get paid for it, or have nothing better to do, or take a genuine interest in research, who then agree to take our survey, there is still a drive to apply statistical significance testing to it.
We’re all biased. Statistical theory also assumes we have introduced no bias at all into our surveys through the order we have asked questions, or the way in which we have asked those questions. Statistical theory was born of clinical research, where people in white coats and protective goggles would explore the effects of chemical, physical or biological interventions. People, with our different feelings, attitudes, educational levels, mental and physical capabilities, didn’t get in the way. Bias was usually found in the research design and could be identified, measured, and accounted for. Market researchers don’t have that luxury. We know there is bias but we usually can’t identify it let alone measure it.
A technical paragraph. Lack of randomness and known bias should be taken into account when applying statistical significance testing using something called the Design Factor. It would be lower for a face to face sample selected in a sensitive way and achieving decent response rates than it would be for a large online sample collected over a couple of days. At the lower end the factor would be more than 2, which in turn would imply that our 400 sample that gives +5% reliability at the 95% confidence interval should in fact be +10%. Which would require a 20% difference in scores around the 50% point. When do we ever see that? And if we did, would we need a significance test to tell us it’s meaningful?
I’m not arguing against KPIs and benchmarks. These are hugely important in decision making, but action standards should not be linked to a requirement to have “statistically significantly” better scores. We shouldn’t get hung up on individual measures of so-called statistical significance. We live in an age where there’s never been so many ways to collect information, analyse and present it. Combined methodologies give us unprecedented opportunities to paint wider, fuller pictures which we should be exploring. The consistency (or not) of response, the absolute performance in the round coupled with nuanced interpretation, gives us the information we need.