The Limits of A/B Testing · Orbital Brand Science

A/B testing is one of the best tools marketing has invented in fifty years. It is also a quietly destructive instrument when used as the only tool, because it optimises for clicks and erodes brands.

If you work in digital marketing, the gospel of A/B testing has been preached to you many times. Test everything. Trust the data. Iterate constantly. Optimise based on what works. The promise is that, if you measure rigorously and follow the numbers, you will arrive at the best version of any piece of creative or any campaign. The promise is half true. The other half is where the trouble lives.

A/B testing is excellent at finding the version of a thing that wins on the metric you measured, in the time window you measured it. It is not excellent at telling you whether the metric you measured was the right one, whether the time window was long enough to see the real effect, or whether the version that won today is the version that will be winning a year from now. Those questions sit outside what the test was designed to answer, and the discipline of treating them as outside the test is rarely practised.

What the test sees and what it does not

An A/B test on a digital ad measures, typically, click-through rate. Sometimes it goes further and measures conversion rate or revenue per impression. These are all behavioural signals, captured at the moment of behaviour, and they are real. The version that produces more clicks is, definitionally, the version that produced more clicks.

What the test cannot see is what happened in the head of every viewer who did not click. The viewer who scrolled past and forgot the brand. The viewer who scrolled past and noticed the brand and stored a small, slightly negative impression. The viewer who paused for half a second, registered the brand, did not click but added the brand to a mental short list for the next time the category came up. Three viewers. Three different long-term outcomes. The test counts all of them as a single non-click, and then averages them out against the click that someone else made.

This is fine if the only thing you care about is the next click. It is bad if you care about whether your brand is being remembered, trusted, or built. The thing about a click is that it happens in the moment of the impression. The things that build a brand happen over months and years, in moments where the data trail is much harder to read.

The optimisation trap

Imagine a brand that runs a continuous A/B testing programme on its digital ads for two years. Each test selects the higher-performing creative based on click rate. The losing creative is dropped. The winning creative is then tested against a new variant, and so on, generation after generation.

What this process produces, with mathematical reliability, is a steady drift toward whatever cues maximise the probability of an immediate click. That tends to be loud colours, clickbait headlines, urgency framing, novelty stimuli, faces with intense expressions. Each individual step in the optimisation looks rational. The result over two years is creative that bears no resemblance to the brand the marketer thought they were building, and that may be actively damaging the long-term equity of the brand.

The brand has been hill-climbing, and the hill it has climbed is a short-term click maximisation hill, not a long-term brand health hill. The two hills sometimes point in similar directions, and sometimes in directly opposite ones. The data could not tell the difference, because the data was never asked.

What is missing from the click

A click is one of the lowest-fidelity emotional signals a brand can collect. It tells you that someone, at one moment, was sufficiently activated to tap a finger or move a mouse. It tells you nothing about whether they felt good about doing so, whether they regretted it within ten seconds, whether they will recognise the brand the next time they see it, or whether they will tell anyone about it.

Neural and behavioural research consistently finds that the strongest predictors of long-term brand equity are emotional engagement, memory salience, and the consistency of distinctive brand assets over time. None of these are measured by a click. All of them can be measured by other instruments, and most click-optimised creative tests poorly on all of them when you actually look.

There is a now-classic study from Les Binet and Peter Field that examined hundreds of award-winning marketing campaigns and found that campaigns oriented toward short-term sales activation outperformed brand-building campaigns in the short term, and underperformed them substantially over multi-year horizons. The optimal mix sat somewhere around 60 percent brand-building and 40 percent activation, with significant variation by category. Pure-activation strategies, the kind that an aggressive A/B testing programme tends to converge on, performed worst over time.

The version that won today is not necessarily the version that will be winning a year from now.

What neural measurement adds

The contribution of neuromarketing measurement here is not to replace A/B testing. It is to add the dimensions A/B testing cannot see. The same two creative variants that show a 10 percent click difference might show very different emotional traces, attention curves, and brand-asset linkage when measured neurally. The variant that wins on clicks might be the variant that loses on memory consolidation. The marketer who only sees the click result is making a confident decision on incomplete information.

A combined approach uses A/B testing for what it is good at (rapid behavioural feedback on specific tactical variants) and neural measurement for what A/B testing cannot do (depth of engagement, emotional valence, brand linkage over time). The two methods together produce a picture neither can produce alone. The marketer who uses both has a more honest view of what the work is actually doing.

The selection bias problem

There is a more subtle problem with A/B testing that even experienced practitioners often miss. The audience that sees the test is, in most platforms, biased toward the audience the platform's algorithm thinks will engage. The test result tells you which variant won among that biased sample. It does not tell you which variant would have won among the audience you actually want to reach.

For brands trying to grow, the relevant audience is usually the audience that is not yet engaging with them. By definition, that audience is under-represented in any test that runs on the brand's existing channels. The test, in other words, is most accurate for the audience the brand already has and least accurate for the audience the brand most needs to reach.

Neural pre-testing, conducted on a sample designed to represent the full target audience including current non-buyers, sidesteps this bias. The sample is recruited for its match to the brand's strategic target, not for its predisposition to engage with the brand's existing content. The signal is therefore a better guide to growth than any in-flight optimisation test can be.

The role A/B testing plays well

None of this is an argument against A/B testing as a tool. A/B testing is the right method for many questions. Which subject line gets more opens. Which landing page converts more sign-ups. Which CTA copy increases form completion. These are tactical, short-term, behaviourally-defined questions. A/B testing is built for them, and it answers them quickly and accurately.

The trouble starts when A/B testing is treated as the right method for all questions, including the ones it was not designed for. Brand positioning. Creative strategy. Long-term equity. Distinctive asset choice. None of these are A/B testable in any meaningful way. Each requires longer feedback loops, broader measurement, and more interpretation than the test framework provides.

The mature marketing operation uses A/B testing where it fits and uses other instruments where it does not. The immature one applies A/B testing to everything and ends up with a fragmented, click-optimised, brand-eroded mess that the data dashboards declare a success even as the brand fades.

The Caribbean angle

One regional observation worth making. The marketing teams we work with across the Caribbean tend to have less access to high-volume A/B testing infrastructure than their counterparts in larger markets, because the audience sizes do not support the same statistical power per test. This is sometimes treated as a disadvantage. It can be turned into an advantage.

Brands that cannot run hundreds of A/B tests per year are forced to make creative bets with smaller numbers of variants, longer cycle times, and more reliance on qualitative and neural pre-testing. The result, when done well, is creative that is more considered, more brand-consistent, and less prone to the optimisation drift described above. The constraint produces a discipline that some larger markets have lost.

This is not a recommendation to give up the testing tools that are available. It is a recommendation to treat the test as one input rather than the verdict, and to invest the saved test cycles in deeper research on the work that is going to run.

The honest middle

A/B testing is a tool. Like any tool, it is excellent for the jobs it was designed for and harmful when used for jobs it was not. The temptation to declare it the universal answer, because it produces clean numbers and clear winners, is one of the things that has quietly hurt brand-building in the last decade.

The honest middle is to use A/B testing for tactical optimisation, neural pre-testing for creative decisions, brand tracking for long-term equity, and judgement for everything in between. None of these methods on their own answers the full question of whether a marketing programme is working. Together, they get closer than any one of them alone.

The brands that will look back in ten years and find they made good decisions will not be the brands with the most A/B tests run. They will be the brands that knew when to test, when to measure differently, and when to commit to creative bets that the test framework could not validate. Optimisation is useful. Direction matters more. Direction is set by judgement informed by the right measurements, not by clicks alone.

The limits of A/B testing, and what it cannot see.