Lessons learned from running 127,000 experiments

The way you run your A/B testing and experimentation program has a huge effect on what you get — both in terms of optimizing experiences and ROI.

This report is packed with data from 127k experiments — revealing insights, techniques, and examples for turning practitioners into champions.

Dive in to start reading The Evolution of Experimentation research.

The research that can bring your A/B testing and experimentation program to the next level

Experimentation is never easy. Practitioners are consistently facing new, significant challenges; lower uplifts, downsized teams, and a misguided focus on win rates (which are low).

So how can practitioners do better? What makes tests successful? How are business outcomes achieved?

We were eager to provide deeper answers to these questions and more, so we looked at multiple years of experimentation programs across different industries and program maturity levels. Teams were running more experiments, improving processes, and reducing bottlenecks, yet their experiments still weren’t seeing higher uplifts.

Of course, myriad factors made it difficult to determine why uplifts weren’t changing, but based on what we saw throughout our study, there's room for improvement on what things companies choose to experiment on in order to drive measurable impact.

We talk more about this in the findings, below.

The insights are here. But from where?

Data from 1.1k companies
Analysis of over 127K experiments
Excerpted Optimizely analysis
Customer interviews, case studies, and surveys

A LOOK INSIDE EVOLUTION OF EXPERIMENTATION

Experimentation has changed in the last 5 years

It is more focused on variations, velocity, and uplifts, not just win rate

88% of tests

do not win.

Only a third

of experiments test more than one variation.

The top 3%

of companies run over 500 experiments per year.

41% more

expected impact driven by personalized experiments.

Here, a pattern emerged: There is a significant disconnect between the understanding of experimentation practices and the reality of their outcomes.

88% of tests do not win. This is important, but we're seeing a lot of focus on whether an experiment 'won' or 'lost'. Win rate alone is arguably a vanity metric. You could make many tiny, very safe changes, or you could spend half a year researching to run one experiment that's likely to win. Yes you might get a winning result, but you still won't see significant uplifts or outsized returns. We found that focus should remain on overall impact, where win rate is measured alongside average uplift per winner, velocity of launch, and time-to-results.

Less than 10% of experiments test 4 or more variations, but see 2x impact. Meanwhile, these experiments are more than twice as impactful compared to A/B tests. It highlights another disconnect: People talk about A/B testing all the time; It's about two variations, the A version and the B version. Turns out, the data reveals that it's one of the lowest-performing methods of experimentation. Simple fix: just add more variations to your tests!

Let's learn more about the state of experimentation in the next part.

CHAPTER 1: THE STATE OF EXPERIMENTATION

Win rate isn't the only thing that matters

Around 12% of experiments win on the primary metric. They reach statistical significance. The remaining 88% combine both non-winners and inconclusive tests. It might sound a bit negative, but it is not.

You now know what worked and what you only assumed was working. In a world without experimentation, you would’ve rolled out that feature or functionality. But, here, you get to eliminate features harming your business. Plus, you can identify areas your customers don't care about to minimize additional time and resources spent on those areas.

The inconclusive tests that were neither a winner nor a loser, are still valuable too. You know you haven't identified something where you can roll it out and get value immediately. You haven't mitigated a risk, but you've still learned that something in your functionality customers don't care about, and that's still useful to feed into future hypotheses.

So, you’re not losing with all experiments that are not winning. These numbers are in line with similar statistics published by companies like Google, Airbnb, Microsoft, Netflix, and more. On average, they report a 10 to 20 percent win rate as well.

The science of testing velocity

The median company runs 34 experiments per year. The top 3% of companies run over 500. To be in the top 10%, you need to be running 200 experiments annually.

Companies ramp up testing quickly from launch and grow velocity by 20% year on year on average.

The number of companies testing their experimentation velocity and the share of feature experimentation has consistently grown since 2018.

The ‘top 5 metrics’ conundrum

Over 90% of experiments target 5 common metrics

CTA Clicks
Revenue
Checkout
Registration
Add-to-cart

However, the data shows that 3 of those top 5 metrics have relatively low expected impact. There's a greater impact opportunity if metrics were reprioritized. For example, replace "Revenue" with "Menu/Navigation". Or replace "Checkout" with "Scroll/Engage".

Yet those higher impact metrics are still under-prioritized. Could you accidentally be ignoring metrics that can make a difference? Your website visitor will decide based on the improvements each metric brings to the buying journey.

So, start focusing on:

Finding decision points that lead visitors to the buying moment
Choosing metrics that affect each decision point
Delivering high-impact with each metric

Impact -> More uplifts -> Higher sales.

Revenue matters

There appears to be some competitive advantages for companies with over $1B of revenue. Traffic and the ability to create high-quality tests (testing pipeline) are the primary drivers of velocity.

Large companies use their resource advantage to run high-velocity programs and generate more revenue.
More visitors mean more chances for tests to reach statistical significance, faster.
With more revenue, win rates improve as well.

Testing is about outsized returns

Running an experiment is a chance for improvement. However, folks can get disheartened when every test doesn’t win. The value of a testing program is in 2 parts.

You think all of the losing tests drag down your successes. If your company releases 100 features over a year, only 10 or 12 would be an improvement. Testing works because it helps to separate these 2.

So even though only around 1 in 8 experiments tend to be a winner for most companies, the tests that do win have a substantial return on the metrics people care about.

The top 5% of experiments that companies run are responsible for around 50% of the impact. It might seem like it's a lot of effort to get towards that one successful test. But we've seen experiments make millions of dollars of incremental revenues just for making a simple site change, tweak, or modification to an app or functionality.

CHAPTER 2: GREAT EXPERIMENTS

The ‘what’ behind great experiments, revealed

The performance of teams is stable over a three-year timeframe. So how good you are today is often a good indicator of how good you will be in 3 years.

Improving performance requires continually changing the system by which you research, ideate, and develop experiments.

Don’t be stuck in your comfort zone. Follow these steps:

Do ABCD instead of only AB. Experiments that test multiple treatments are 3x more successful than A/B tests.

Conduct complex experiments. Tests that make major changes to the user experience (pricing, discounts, checkout flow, data collection, etc.) are more likely to win and with higher uplifts.

Choose the right metrics. Experiments leveraging bandit algorithms are more successful.

Two elements

The highest uplift experiments around the world have two things in common:

They make larger code changes with more effect on the user experience (>99.9% significance)
They test a higher number of variations simultaneously (>99.9% significance)

Great experiments need to try large leaps in the user experience balanced with an openness to multiple paths.

However, less than 10% of experiments test 4 or more variants. Yet those experiments are twice as impactful compared to A/B.

Here’s how you change as you test more variants:

You take more risks. With a single variant, teams often play it safe. But when teams test 4+ variations, the safe options are covered. You can test increasingly risky but novel ideas without worry.
You take greater ownership. Teams that only test 2 variants often choose them through a hierarchy. With more variants, there is a better chance for more ideas to be validated. Everyone's job is now to contribute to the likelihood of a change succeeding.
Your program becomes more open-minded. Usually, teams can only test one path. Now, you can test multiple approaches in one go and change direction based on results.

The standard experiment run around the world is an incremental A versus B test. While these tests are easy to run, they are rarely associated with performance breakthroughs. Our data shows that the largest breakthroughs come from the test that follows a very different model. Tests designed to test complex, interdependent changes- but within a single variant and across multiple variants- are more likely to be among the top 5% performing experiments in our sample. Rather than shying away from complexity, firms can potentially harness it to deliver high performance in testing. The key is to pair complex tests with a theory for how the multiple elements work together to deliver returns. Theory and testing together can help unlock breakthrough performance.
Dr. Sourobh Ghosh
Economist at Amazon/Audible

It takes more than a change

Only a third of experiments make more than one change, yet they show much better returns. While counting the number of different change types per test is not a perfect measure of complexity, it yields higher insights into a pattern that's been long seen: complex tests outperform.

Why experiment complexity matters:

Forget the low-hanging fruit. You'll only invest time and effort in a complex experiment if you're certain about the value that can be delivered. But, it's not a surprise that you can only change the color of a button so many times. So, gaining access to engineering resources to make bigger changes is critical.

Move beyond cosmetic changes. Minute tweaks have minute effects and uplifts. To really impact user behavior and change how visitors interact with your website/app, redesign customer journeys in a way that takes them to the buying moment.

Reflect ownership and responsibility. Programs focused on minute optimization have limited freedom and resources. As your program gathers more resources and gains trust, you'll receive the power to test more meaningful changes.

It’s not just revenue

Digital commerce overwhelmingly prioritizes revenue. We agree it is the most valuable business metric. However, huge early funnel optimizations like search and add-to-cart are underexplored.

Businesses tend to experience greater test impact by focusing experiments on improving micro conversions, such as getting more users to search, add to cart, and register accounts.

The search rate is the most undervalued experiment goal. Even though it is used 1% of the time, it has the highest expected impact at 2.3%. It is important to note that those who search typically convert at 2x-3x the conversion rate of all other users.

Measuring every experiment on revenue is like measuring every player on points scored. Someone also needs to pass.
Hazjier Pourkhalkhali

Personalization theory

Personalized experiments drive 41% more expected impact on specific audiences than general experiences.

Experiments that include targeting are 16% more likely to win when compared to untargeted experiments.

Personalized experiences generate 22% higher uplifts on average.

The 41% higher expected impact is mitigated by the reach of the audience

The need

When companies switch to testing 3, 4, or 5 variations, they start to take bigger risks. When changing user experience, it's not a simple tweak that every other website might have tried by now. It could be a new idea that they're the first to try.

The whole point of experimentation is that you don't know what's going to work. You're not guessing anymore. You're validating it by testing multiple approaches to find out which is the most impactful.

CHAPTER 3: CULTURE OF EXPERIMENTATION

Its need and how to create one at your company

Great companies and their experimentation culture are built differently. Their experimentation program isn't run in a vacuum and is often backed by enough resources and a culture that promotes risks. Data and analytics are key to formulating great hypotheses and the right people execute the experiment variations.

Great data makes the difference

Great experimentation is based on effectively diagnosing and prioritizing user problems. If you insufficiently leverage data, you're likely to rely on assumptions and guesses. Having data is not enough, you need to use it to make decisions that add value to the business.

Companies that use advanced analytics are far more successful at experimentation. Teams with analytics outperform teams without by 32% per test. Teams that added heatmapping were an additional 16% more successful.

The integration with the analytics tool allows KLM to automatically import experiment data for further analysis within a wider business context. Heatmaps can be automatically tagged with the information about the A/B test variation that a particular user has seen. This way the analysts can differentiate between experiences during their analysis.
KLM customer story

Customer Data Platform (CDP) works

Companies with integrated CDP appear to be much more successful with experimentation and see up to 80% more expected impact. CDPs enable experimentation platforms to access a single source of experimentation data from your entire ecosystem.

Yes, there are likely confounding factors here. More digitally mature customers are more likely to have a CDP - but this data helps to highlight the need for a CDP as part of a digital maturity journey.

More tests ≠ more impact

To scale your experimentation program, you need to carefully invest in your developer resources. Conducting more tests is not the answer. Here's why:

The highest experiment quality occurs at 1-10 annual tests per engineer.
Once a developer moves to 11-30 tests per year, the expected impact drops by 40% per test.
If you move beyond 30 tests, the expected impact drops by a whopping 87%.

Testing velocity happens when you have sufficient developer resources. Without scaling engineering, experiment velocity becomes a vanity metric that worsens program outcomes.

Senior leaders have the experience. However, it may close them off to more modern methods that can result in larger breakthroughs. Junior teams appear to take more risk, with fewer wins but more uplifts. So, great leaders should encourage teams to take risks and explore alternatives.

Common risks of seniority:

Senior leaders can often overestimate their ability to influence the future. It closes them off to outside advice or feedback.
Senior leaders are likely to use out-of-date practices. It causes them to focus on smaller improvement opportunities.
Senior leaders are less likely to revise their opinions when presented with new data that conflicts with their beliefs.

Advantages of seniority:

Senior leaders can accelerate the adoption of new strategies and techniques through investments, strategies, and guidance.
Senior leaders can improve their employees' psychological safety and freedom to take risks.
Senior leaders can balance exploitation and exploration. It allows teams to take the right risk when the opportunity occurs.

Centralized or Decentralized - the choice is yours.

There is no one-size-fits-all governance model as companies report success with varying approaches. Large programs appear evenly split between Centralized and Decentralized teams, with limited performance differences observed. Companies must select the right model based on their team and business needs.

Factors to consider when determining the right governance model for your business.

Control. Ensure other teams have learned the fundamentals of what makes a good experiment. It will help you decide who can run an experiment, review results, and ultimately determine if a winner has been implemented.

Capabilities. Having enough resources is the first step towards running complex experiments.

Connection. Having a close relationship with the changing priorities of the wider business is essential for the prioritization of your tests and the growth of your team. Avoid being siloed.

What are you building next year?

This is just a sneak peek into some of the great insights we've gathered as part of the full report. These are the insights that have helped us deliver exceptional digital experiences for the world's leading digital brands.

And we understand. It's easy to get caught up in the best practices of others and look at the winning experiments that people ran in the past but to miss the work that it took people to get there.

If you're trying to scale your experimentation program in the next few months, think about the quality of your experiments and the developer resources you have. It will give you the velocity you want and communicate experimentation value across the organization.

At Optimizely we can help you get started with just that.

Get in touch