Get the full research report
Download the full Evolution of Experimentation research report (60 pages) and start learning from 127,000 experiments today!
The way you run your A/B testing and experimentation program has a huge effect on what you get — both in terms of optimizing experiences and ROI.
This report is packed with data from 127k experiments — revealing insights, techniques, and examples for turning practitioners into champions.
Dive in to start reading The Evolution of Experimentation research.
Experimentation is never easy. Practitioners are consistently facing new, significant challenges; lower uplifts, downsized teams, and a misguided focus on win rates (which are low).
So how can practitioners do better? What makes tests successful? How are business outcomes achieved?
We were eager to provide deeper answers to these questions and more, so we looked at multiple years of experimentation programs across different industries and program maturity levels. Teams were running more experiments, improving processes, and reducing bottlenecks, yet their experiments still weren’t seeing higher uplifts.
Of course, myriad factors made it difficult to determine why uplifts weren’t changing, but based on what we saw throughout our study, there's room for improvement on what things companies choose to experiment on in order to drive measurable impact.
We talk more about this in the findings, below.
The insights are here. But from where?
A LOOK INSIDE EVOLUTION OF EXPERIMENTATION
It is more focused on variations, velocity, and uplifts, not just win rate
do not win.
of experiments test more than one variation.
of companies run over 500 experiments per year.
expected impact driven by personalized experiments.
Here, a pattern emerged: There is a significant disconnect between the understanding of experimentation practices and the reality of their outcomes.
88% of tests do not win. This is important, but we're seeing a lot of focus on whether an experiment 'won' or 'lost'. Win rate alone is arguably a vanity metric. You could make many tiny, very safe changes, or you could spend half a year researching to run one experiment that's likely to win. Yes you might get a winning result, but you still won't see significant uplifts or outsized returns. We found that focus should remain on overall impact, where win rate is measured alongside average uplift per winner, velocity of launch, and time-to-results.
Less than 10% of experiments test 4 or more variations, but see 2x impact. Meanwhile, these experiments are more than twice as impactful compared to A/B tests. It highlights another disconnect: People talk about A/B testing all the time; It's about two variations, the A version and the B version. Turns out, the data reveals that it's one of the lowest-performing methods of experimentation. Simple fix: just add more variations to your tests!
Let's learn more about the state of experimentation in the next part.
CHAPTER 1: THE STATE OF EXPERIMENTATION
Around 12% of experiments win on the primary metric. They reach statistical significance. The remaining 88% combine both non-winners and inconclusive tests. It might sound a bit negative, but it is not.
You now know what worked and what you only assumed was working. In a world without experimentation, you would’ve rolled out that feature or functionality. But, here, you get to eliminate features harming your business. Plus, you can identify areas your customers don't care about to minimize additional time and resources spent on those areas.
The inconclusive tests that were neither a winner nor a loser, are still valuable too. You know you haven't identified something where you can roll it out and get value immediately. You haven't mitigated a risk, but you've still learned that something in your functionality customers don't care about, and that's still useful to feed into future hypotheses.
So, you’re not losing with all experiments that are not winning. These numbers are in line with similar statistics published by companies like Google, Airbnb, Microsoft, Netflix, and more. On average, they report a 10 to 20 percent win rate as well.
The median company runs 34 experiments per year. The top 3% of companies run over 500. To be in the top 10%, you need to be running 200 experiments annually.
Companies ramp up testing quickly from launch and grow velocity by 20% year on year on average.
The number of companies testing their experimentation velocity and the share of feature experimentation has consistently grown since 2018.
Over 90% of experiments target 5 common metrics
However, the data shows that 3 of those top 5 metrics have relatively low expected impact. There's a greater impact opportunity if metrics were reprioritized. For example, replace "Revenue" with "Menu/Navigation". Or replace "Checkout" with "Scroll/Engage".
Yet those higher impact metrics are still under-prioritized. Could you accidentally be ignoring metrics that can make a difference? Your website visitor will decide based on the improvements each metric brings to the buying journey.
So, start focusing on:
Impact -> More uplifts -> Higher sales.
There appears to be some competitive advantages for companies with over $1B of revenue. Traffic and the ability to create high-quality tests (testing pipeline) are the primary drivers of velocity.
Running an experiment is a chance for improvement. However, folks can get disheartened when every test doesn’t win. The value of a testing program is in 2 parts.
You think all of the losing tests drag down your successes. If your company releases 100 features over a year, only 10 or 12 would be an improvement. Testing works because it helps to separate these 2.
So even though only around 1 in 8 experiments tend to be a winner for most companies, the tests that do win have a substantial return on the metrics people care about.
The top 5% of experiments that companies run are responsible for around 50% of the impact. It might seem like it's a lot of effort to get towards that one successful test. But we've seen experiments make millions of dollars of incremental revenues just for making a simple site change, tweak, or modification to an app or functionality.
CHAPTER 2: GREAT EXPERIMENTS
The performance of teams is stable over a three-year timeframe. So how good you are today is often a good indicator of how good you will be in 3 years.
Improving performance requires continually changing the system by which you research, ideate, and develop experiments.
Do ABCD instead of only AB. Experiments that test multiple treatments are 3x more successful than A/B tests.
Conduct complex experiments. Tests that make major changes to the user experience (pricing, discounts, checkout flow, data collection, etc.) are more likely to win and with higher uplifts.
Choose the right metrics. Experiments leveraging bandit algorithms are more successful.
The highest uplift experiments around the world have two things in common:
Great experiments need to try large leaps in the user experience balanced with an openness to multiple paths.
However, less than 10% of experiments test 4 or more variants. Yet those experiments are twice as impactful compared to A/B.
The standard experiment run around the world is an incremental A versus B test. While these tests are easy to run, they are rarely associated with performance breakthroughs. Our data shows that the largest breakthroughs come from the test that follows a very different model. Tests designed to test complex, interdependent changes- but within a single variant and across multiple variants- are more likely to be among the top 5% performing experiments in our sample. Rather than shying away from complexity, firms can potentially harness it to deliver high performance in testing. The key is to pair complex tests with a theory for how the multiple elements work together to deliver returns. Theory and testing together can help unlock breakthrough performance.
Dr. Sourobh GhoshEconomist at Amazon/Audible
Only a third of experiments make more than one change, yet they show much better returns. While counting the number of different change types per test is not a perfect measure of complexity, it yields higher insights into a pattern that's been long seen: complex tests outperform.
Forget the low-hanging fruit. You'll only invest time and effort in a complex experiment if you're certain about the value that can be delivered. But, it's not a surprise that you can only change the color of a button so many times. So, gaining access to engineering resources to make bigger changes is critical.
Move beyond cosmetic changes. Minute tweaks have minute effects and uplifts. To really impact user behavior and change how visitors interact with your website/app, redesign customer journeys in a way that takes them to the buying moment.
Reflect ownership and responsibility. Programs focused on minute optimization have limited freedom and resources. As your program gathers more resources and gains trust, you'll receive the power to test more meaningful changes.
Digital commerce overwhelmingly prioritizes revenue. We agree it is the most valuable business metric. However, huge early funnel optimizations like search and add-to-cart are underexplored.
Businesses tend to experience greater test impact by focusing experiments on improving micro conversions, such as getting more users to search, add to cart, and register accounts.
The search rate is the most undervalued experiment goal. Even though it is used 1% of the time, it has the highest expected impact at 2.3%. It is important to note that those who search typically convert at 2x-3x the conversion rate of all other users.
Measuring every experiment on revenue is like measuring every player on points scored. Someone also needs to pass.
Hazjier Pourkhalkhali
Personalized experiments drive 41% more expected impact on specific audiences than general experiences.
Experiments that include targeting are 16% more likely to win when compared to untargeted experiments.
Personalized experiences generate 22% higher uplifts on average.
The 41% higher expected impact is mitigated by the reach of the audience
When companies switch to testing 3, 4, or 5 variations, they start to take bigger risks. When changing user experience, it's not a simple tweak that every other website might have tried by now. It could be a new idea that they're the first to try.
The whole point of experimentation is that you don't know what's going to work. You're not guessing anymore. You're validating it by testing multiple approaches to find out which is the most impactful.
CHAPTER 3: CULTURE OF EXPERIMENTATION
Great companies and their experimentation culture are built differently. Their experimentation program isn't run in a vacuum and is often backed by enough resources and a culture that promotes risks. Data and analytics are key to formulating great hypotheses and the right people execute the experiment variations.
Great experimentation is based on effectively diagnosing and prioritizing user problems. If you insufficiently leverage data, you're likely to rely on assumptions and guesses. Having data is not enough, you need to use it to make decisions that add value to the business.
Companies that use advanced analytics are far more successful at experimentation. Teams with analytics outperform teams without by 32% per test. Teams that added heatmapping were an additional 16% more successful.
The integration with the analytics tool allows KLM to automatically import experiment data for further analysis within a wider business context. Heatmaps can be automatically tagged with the information about the A/B test variation that a particular user has seen. This way the analysts can differentiate between experiences during their analysis.
KLM customer story
Companies with integrated CDP appear to be much more successful with experimentation and see up to 80% more expected impact. CDPs enable experimentation platforms to access a single source of experimentation data from your entire ecosystem.
Yes, there are likely confounding factors here. More digitally mature customers are more likely to have a CDP - but this data helps to highlight the need for a CDP as part of a digital maturity journey.
To scale your experimentation program, you need to carefully invest in your developer resources. Conducting more tests is not the answer. Here's why:
Testing velocity happens when you have sufficient developer resources. Without scaling engineering, experiment velocity becomes a vanity metric that worsens program outcomes.
Senior leaders have the experience. However, it may close them off to more modern methods that can result in larger breakthroughs. Junior teams appear to take more risk, with fewer wins but more uplifts. So, great leaders should encourage teams to take risks and explore alternatives.
Common risks of seniority:
Advantages of seniority:
There is no one-size-fits-all governance model as companies report success with varying approaches. Large programs appear evenly split between Centralized and Decentralized teams, with limited performance differences observed. Companies must select the right model based on their team and business needs.
Factors to consider when determining the right governance model for your business.
Control. Ensure other teams have learned the fundamentals of what makes a good experiment. It will help you decide who can run an experiment, review results, and ultimately determine if a winner has been implemented.
Capabilities. Having enough resources is the first step towards running complex experiments.
Connection. Having a close relationship with the changing priorities of the wider business is essential for the prioritization of your tests and the growth of your team. Avoid being siloed.
This is just a sneak peek into some of the great insights we've gathered as part of the full report. These are the insights that have helped us deliver exceptional digital experiences for the world's leading digital brands.
And we understand. It's easy to get caught up in the best practices of others and look at the winning experiments that people ran in the past but to miss the work that it took people to get there.
If you're trying to scale your experimentation program in the next few months, think about the quality of your experiments and the developer resources you have. It will give you the velocity you want and communicate experimentation value across the organization.
At Optimizely we can help you get started with just that.