What is Statistical Power

Marketing_strategy Marketing_Analytics Data_Analysis

Date: 2024-01-25 | Time of reading: 9 minutes (1668 words)

Defining statistical power or the "sensitivity" of a test is an important part of planning before launching A/B testing. It allows you to introduce more positive changes to a website and increase your revenue.

Definition of statistical power

Statistical power is the ability of a criterion to detect differences where they actually exist. The sensitivity of a test will show a significant result and reject the incorrect null hypothesis.

Before delving into the elements of statistical power, it's essential to understand the types of errors that can occur in tests and how to prevent them.

Two types of errors

Type I error

Type I error is a false positive result: rejecting the null hypothesis that is actually true.

The null hypothesis states that there is no difference between a pair of events or phenomena.

A false positive test shows that there is a difference between the compared variants, even though there isn't one in reality. Such a situation can occur due to an error in the test or if randomness creeps in.

The probability of a type I error is denoted by the Greek letter alpha (α) and indicates the threshold level at which an A/B test should be trusted. If a test is reliable at 95%, the remaining 5% represents the probability of a type I error.

If a business cannot tolerate a 5% chance of error, considering it too high, the probability of false positive results can be reduced by increasing the test's reliability. If the reliability is increased to 99%, the probability of a type I error decreases from 5% to 1%. However, there are some "pitfalls" in this calculation.

The decrease in the probability of type I errors increases the probability of type II errors because there is an inversely proportional correlation between these values. Therefore, reducing the alpha error level (for example, from 5% to 1%) ultimately reduces the overall statistical power of the test.

Type I error and its proportional connection with type II error

Sometimes it's necessary to consciously increase the risk of type I errors (for example, from 5% to 10%) to achieve high test sensitivity.

Type II error

Type II error is a false negative result, meaning the inability to reject a false null hypothesis. Such a test will not show the advantage of any of the options, even though there are options that are more effective than others.

The probability of a type II error (β-error) is inversely proportional to statistical power (1 - β). If the risk of a Type II error is 20%, the sensitivity of the test is 80%.

Since alpha and beta errors have an inversely proportional relationship, if a test has an extremely low alpha error value (for example, 0.001%), the risk of a type II error is very high.

Finding a balance is not an easy task. Through A/B testing, a business aims to determine the best combination of elements and identify the most effective variant among many. If the tests lack sufficient sensitivity, there's a great risk of not noticing a genuinely good option.

Variables affecting test statistical power

Next, we'll examine four variables that influence the statistical power of a test. While studying these variables, it's important to remember that the primary task of a tester is to control the probability of errors. This can be achieved through:

sample size;
minimum detectable effect (MDE);
significance level (α);
desired level of power (implied Type II error coefficient).

Sample size

The sample needs to be sufficiently large; otherwise, conducting a good split test is impossible. It's crucial to calculate the sample size to ensure it's of the necessary and adequate size. A small sample won't provide enough test power, while an excessively large one will extend the test duration (long tests are more expensive and time-consuming).

Each analyzed variant or segment should have a significant number of users. For this reason, the sample size should be planned in advance, ensuring that the test always has good statistical power. Otherwise, exiting the test might reveal too many variants or segments. If this realization comes too late, the test will have been conducted in vain.

Expect statistically significant results within a reasonable period (e.g., no less than a week or a business cycle). In most cases, it's recommended to conduct testing for 2-4 weeks. Any duration exceeding this might lead to sample contamination and issues with cookie file deletion.

Therefore, determining the sample size and testing duration beforehand is crucial. This helps avoid the common mistake of conducting A/B testing "blindly," ending before obtaining statistically significant results.

Minimum detectable effect (MDE)

MDE is the difference between variants that needs confirmation. Small differences are hard to detect, requiring very large samples. Substantial differences are noticeable even in small samples.

However, there exists a paradox where small samples can be unreliable, leading to erroneous tests. There's always a risk of error, and there's a narrow range within which results can be considered reliable. Since there's no single rule for determining sample size, all nominal levels are not 100% reliable.

Significance level

Testing results are considered statistically significant under the condition that the null hypothesis is incorrect.

In simple terms, if during a split test, one landing page can confidently be preferred over another (with a test confidence level of 95%), there's only a 5% chance that the observed improvement is due to chance or error. Conversely, there's a 95% chance that the observed difference is not due to chance.

The 5% significance level is the most commonly used initial level in online testing, representing a 5% chance of a type I error, as mentioned earlier. An alpha of 5% means there's a 5% chance of incorrectly rejecting the null hypothesis.

Under equal conditions, reducing alpha from 5% to 1% simultaneously increases the likelihood of a type II error. An increased risk of a type II error reduces the validity of the test.

Desired level of power

If the statistical power of a test is 80%, there's a 20% chance that the real difference between variants won't be detected. If a 20% chance is too risky, it can be reduced to 10%, 5%, or 1% to achieve powers of 90%, 95%, or 99%, respectively.

Before concluding that a 95% or 99% power analysis will solve all your problems, note that each increase in analysis power significantly increases sample size and analysis time.

So, what level of test sensitivity do you really need? The acceptable level of false negative risk for conversion optimization is often considered to be 20%, corresponding to an analysis power level of 80%.

There isn't a universally recognized standard for 80% power, but it represents a reasonable balance between alpha and beta errors. Consider the following:

What is an acceptable level of risk for missing out on a quality improvement for you?
What is the minimum sample size required for the test to achieve the desired sensitivity?

Calculating the statistical power of a test

You can use an A/B test calculator. Enter known variable values to determine, for example, the required sample size to achieve sufficient test sensitivity.

For instance, you calculated that testing each variant requires a sample size of 681 individuals. Your calculations were based on the following initial values: test power of 80%, alpha of 5% (95% statistical significance). You also knew the conversion rate for the control group is 14%, and you assumed an expected conversion rate for the better variant of 19%.

How to calculate the statistical power of a test

Using the same initial data, knowing the sample size, alpha value, and the desired level of statistical power (for example, 80%), you can determine the minimum detectable effect (MDE).

Determination of MDE

As we can see, an A/B test calculator is quite convenient when you know the values of three variables and aim to find the fourth one.

What to do if the sample size cannot be increased?

In some cases, there's a need for a more high-powered test, but increasing the sample size is not possible, such as when the page traffic is too low.

Let's consider a scenario: you've inputted your data into the A/B test calculator, and it shows that a sample size of 8000 or more is required.

How to use a sample size calculator

But you don't have the ability to gather such a sample; it would take several months or even longer. In this case, you need to increase the minimum detectable effect (MDE). In the example below, increasing the MDE from 10% to 25% reduces the sample size to 1356.

Sample size depends on increasing the MDE

Is it really always possible to increase the MDE to 25%? After all, this will reduce the quality of the test.

The second option comes into play if you're willing to accept a 10% risk of type I error. In that case, the optimal option would be to decrease the test's significance level to 90%.

Balance between errors within sample size calculator

Iterating through data in hopes of obtaining satisfactory figures isn't the most reliable strategy. Analysts recommend starting by determining the sample size and then arbitrarily adjusting other values until the results are deemed satisfactory.

Conclusion

Statistical power helps avoid errors during testing and identifies genuinely impactful performance factors on your platforms.

Adhere to these simple guidelines:

Choose the test duration wisely: ideally 2-4 weeks
Use a test calculator to determine an acceptable test power
Adhere to the requirements for the minimum sample size
Aggregate segments and test them if needed
Set sensitivity test requirements only after meeting the minimum sample size requirements.

Source: CXL