Testing is the core activity of good paid search management. We’re trying to find the keywords, match types, bids, ad copy, and landing pages that produce optimal results. The only way to figure that out is to test.
But testing in paid search is very difficult, for a number of different reasons.
- There are a lot of interacting variables – it’s a complex system.
- Critical data is hidden from us – we can’t see what’s happened.
- Our tools don’t facilitate testing – which is just crazy.
On top of these we have a statistics problem. Two of them actually.
- The math we need isn’t trivial
- The data we usually have is sparse and dirty
All of these issues come together every time we look at something in our account – be it a keyword or text-ad or ad group – and based on the data we see decide to make, or not make, some kind of a change.
Of course, we’re just acting on the information we have using the tools at our disposal. But the fact is that we’re making a boat load of assumptions and accepting a lot of averages and approximations. But the biggest risk we’re often taking is in assuming that we’ve got enough data to decide.
Very few of our keywords or text ads – out of the thousands upon thousands in our accounts – get hundreds or even dozens of clicks or conversions. And when there are double or triple digit numbers, the truth is the data isn’t pure at all – it’s a roll up of many different queries and geographies and days and times.
If a keyword converts 4.5% of the time based on 100 clicks that turn out to be from 39 different queries, 76 different cities, 13 times of day, and all on Sunday – does that tell you anything about what’s likely to happen with the next 100 clicks that originate from different queries, cities, times, and days? Maybe. Maybe not.
The risk of turning off a winning text ad, due to not waiting for significant data based on a narrow and known set of variables, is real. In many cases the changes made in paid search accounts, due to all the limitations listed above, are essentially coin flips masquarading as educated guesses pretending to be informed decisions.
And now it turns out we’re even farther from data we can trust that we knew. Even real scientists practicing controlled experiments with what is, by comparison, perfect data, aren’t getting the right answers because even their ‘perfect data’ isn’t enough to tell the truth.
In this amazing article called The Truth Wears Off is worth the time to read fully. In it they explain how scientists are finding that their carefully controlled and ultimately professionally vetted and published results are, over time, proving to be far less certain than they originally appeared.
Here’s one example:
“In the late nineteen-nineties, John Crabbe, a neuroscientist at the Oregon Health and Science University, conducted an experiment that showed how unknowable chance events can skew tests of replicability. He performed a series of experiments on mouse behavior in three different science labs: in Albany, New York; Edmonton, Alberta; and Portland, Oregon. Before he conducted the experiments, he tried to standardize every variable he could think of. The same strains of mice were used in each lab, shipped on the same day from the same supplier. The animals were raised in the same kind of enclosure, with the same brand of sawdust bedding. They had been exposed to the same amount of incandescent light, were living with the same number of littermates, and were fed the exact same type of chow pellets. When the mice were handled, it was with the same kind of surgical glove, and when they were tested it was on the same equipment, at the same time in the morning.
The premise of this test of replicability, of course, is that each of the labs should have generated the same pattern of results. “If any set of experiments should have passed the test, it should have been ours,” Crabbe says. “But that’s not the way it turned out.” In one experiment, Crabbe injected a particular strain of mouse with cocaine. In Portland the mice given the drug moved, on average, six hundred centimetres more than they normally did; in Albany they moved seven hundred and one additional centimetres. But in the Edmonton lab they moved more than five thousand additional centimetres. Similar deviations were observed in a test of anxiety. Furthermore, these inconsistencies didn’t follow any detectable pattern. In Portland one strain of mouse proved most anxious, while in Albany another strain won that distinction.
The disturbing implication of the Crabbe study is that a lot of extraordinary scientific data are nothing but noise. The hyperactivity of those coked-up Edmonton mice wasn’t an interesting new fact—it was a meaningless outlier, a by-product of invisible variables we don’t understand.”
The article, nor the scientific community, doesn’t seem to have a conclusion about why strictly followed scientific procedure is turning out inaccurate or at least inconclusive. The options are a publishing bias, a world that has no stability, or the need to vastly expand the minimum definition of required data for a valid test.
In the meanwhile, be careful cutting bids or or killing keywords based on a few dozen clicks over a couple of days…
The issues of testing, getting good data, and doing a better job managing the statistics that most of us (and our tools) ignore is one I hope to cover a lot in the coming year. If we want to genuinely optimize our spend and our revenues, the status quo is clearly insufficient.
Happy New Year