Think about the largest and smallest you see today, in [dollars]. Write those min and max amounts below.
_____________
Title of Card
What is the average [online purchase value] you see today, in [dollars]? Drag the dot there.
Now, think about the middle 70% of the [online purchase values] you see today, in [dollars]–the ones that aren’t extreme in any direction and Drag the barbells so they shade over that 70%.
Finally, think about how you are hoping to shift those [online purchase values] with this experiment. What is the smallest shift you’d need to see in the experiment, to feel confident rolling out the idea your experimenting with more broadly? Drag the shaded area –the average and the 70% --to represent that minimum shift.
Because an experiment is sampling from a broader population, there is always a chance (even with the best random sampling methods) that we accidentally draw from some extreme set of people and see results that are differentthan the truth of that population. We could see a false positive – an effect in the sample that really isn’t there in the population – or a false negative – no effect in the sample, even though there really is one in the population.

We won’t know their falsity, of course. Which means with a false positive, we may move forward with an idea that really doesn’t work. Or with a false negative, we may withhold an idea that really does work.

Think about this in the context of your hypothesis: [hypothesis]. Which would be worsefor you, financially, PR-wise, politically, operationally...? A false positive, i.e., moving forward with an idea that really doesn’t work? Or a false negative, i.e., withholding one that really does work? Drag the ball to indicate their relative risk.
You run the experiment, and you see an effect! You go tell your boss. But you need to remind your boss there is some % chance that the effect is just a false positive... What is the largest % chance of a false positive that you –and your boss –are comfortable with? Drag the bar to indicate it.
What is the absolute maximum amount of [units] you could use for your sample in this experiment? Consider how more [units] can often mean greater financial cost, greater time to implement, and / or greater risk.

Wizard: Power calculation for comparing proportionsaverages

Is the number of (s) you can experiment with extremely large or unlimited? Or not, due to budget, access, or other constraints?
Sensitivity power calculation for comparing proportions
Categorical Sensitivity Analysis Current Scenario
alpha (risk of false positives or type I error rate): The likelihood of measuring at least the observed result, when in fact there is no effect at all
help_outline
Assigning an alpha level asks you to live in a world where you’ve ran the experiment and found a seemingly statistically significant effect size. If you select a p-value of .1, you’ve decided to live in a world where 1 out of every 10 experiments of this design would produce an effect size as least as large as the one you’ve measured even though there is truly no effect at all.
Power (1-risk of false negatives or 1-type II error rate): The likelihood that, when a treatment has an effect, you will be able to distinguish the effect from zero
help_outline
Determining power asks asks you to live in a world where you’ve ran the experiment and found a seemingly significant null result. If you select power of .95, you’ve decided to live in a world where 1 out of every 20 experiments of this design would produce an effect size of zero even though there is truly an effect
Total # of (s) available for experiment
Expected control group proportion that NEW QUESTION TODO (% success in control group)
help_outline
If we collected 100 samples of the DV from your control group, how many would meet the criteria of success as you define it? (e.g., if your DV is click through rate on an email, how many emails out of 100 do you expect to receive a click in your control group?)
Expected treatment group proportion that NEW QUESTION TODO (% success in treatment group)
help_outline
This value establishes the minimum proportion that your treatment group must achieve in order to produce a statistically significant result. E.g., treatment group expected proportion of 13 or 15 means your intervention must produce an average of more than 15%, or less than 13%, to produce a statistically significant result
-
Minimum Detectable Effect (MDE)
help_outline
This value is establishes the minimum effect size between your control and treatment comparison group(s) that your test must measure in order to achieve statistical significance. If you’re not very confident in your rationale for why it makes sense your treatment would produce an incremental improvement on the control of this size, that is a sign you may want to tighten your expected treatment group estimate
-
Sensitivity power calculation for comparing averages
Continuous Sensitivity Analysis Current Scenario
alpha (risk of false positives or type I error rate): The likelihood of measuring at least the observed result, when in fact there is no effect at all
help_outline
Assigning an alpha level asks you to live in a world where you’ve ran the experiment and found a seemingly statistically significant effect size. If you select a p-value of .1, you’ve decided to live in a world where 1 out of every 10 experiments of this design would produce an effect size as least as large as the one you’ve measured even though there is truly no effect at all.
Power (1-risk of false negatives or 1-type II error rate): The likelihood that, when a treatment has an effect, you will be able to distinguish the effect from zero
help_outline
Determining power asks asks you to live in a world where you’ve ran the experiment and found a seemingly significant null result. If you select power of .95, you’ve decided to live in a world where 1 out of every 20 experiments of this design would produce an effect size of zero even though there is truly an effect
Total # of (s) available for experiment
Pooled standard deviation of DV: How variable or 'spread' is the outcome (dependent) variable in both test & control group(s)?
help_outline
If have no sample data to establish SD, you can estimate standard deviation with the following thought experiment: Excluding outliers, subtract the minimum DV value you'd expect from the maximum DV value and divide by 6
Expected control group NEW QUESTION TODO average
help_outline
If we collected 100 samples of the DV from your control group and averaged it, what number do you estimate?
Expected treatment group NEW QUESTION TODO average
help_outline
This value establishes the minimum average that your treatment group must achieve in order to produce a statistically significant result. E.g., treatment group expected average of 27 or 33 means your intervention must produce an average of more than 33, or less than 27, to produce a statistically significant result.
-
Minimum Detectable Effect (MDE)
help_outline
This value is establishes the minimum effect size between your control and treatment comparison group(s) that your test must measure in order to achieve statistical significance. If you’re not very confident in your rationale for why it makes sense your treatment would produce an incremental improvement on the control of this size, that is a sign you may want to tighten your expected treatment group estimate
-

Please note:

Test group expected proportionaverage & MDE: Our power calculators assume you will run a two-sided test, meaning you’re willing to consider the chance that the treatment may increase OR decrease the value of your primary outcome. The two values indicate a statistically significant average decrease and increase in primary outcome relative to the expected control outcome.

A priori power calculation for comparing proportions
Categorical A-Priori Analysis Current Scenario
alpha (risk of false positives or type I error rate): The likelihood of measuring at least the observed result, when in fact there is no effect at all
help_outline
Assigning an alpha level asks you to live in a world where you’ve ran the experiment and found a seemingly statistically significant effect size. If you select a p-value of .1, you’ve decided to live in a world where 1 out of every 10 experiments of this design would produce an effect size as least as large as the one you’ve measured even though there is truly no effect at all.
Power (1-risk of false negatives or 1-type II error rate): The likelihood that, when a treatment has an effect, you will be able to distinguish the effect from zero
help_outline
Determining power asks asks you to live in a world where you’ve ran the experiment and found a seemingly significant null result. If you select power of .95, you’ve decided to live in a world where 1 out of every 20 experiments of this design would produce an effect size of zero even though there is truly an effect
Expected control group proportion that NEW QUESTION TODO (% success in control group)
help_outline
If we collected 100 samples of the DV from your control group, how many would meet the criteria of success as you define it? (e.g., if your DV is click through rate on an email, how many emails out of 100 do you expect to receive a click in your control group?)
Expected treatment group proportion that NEW QUESTION TODO (% success in treatment group)
help_outline
This value establishes the minimum proportion that your treatment group must achieve in order to produce a statistically significant result. E.g., treatment group expected proportion of 13 or 15 means your intervention must produce an average of more than 15%, or less than 13%, to produce a statistically significant result
Minimum Detectable Effect (MDE)
help_outline
This value is establishes the minimum effect size between your control and treatment comparison group(s) that your test must measure in order to achieve statistical significance. If you’re not very confident in your rationale for why it makes sense your treatment would produce an incremental improvement on the control of this size, that is a sign you may want to tighten your expected treatment group estimate
-
Sample size per comparison group -
A priori power calculation for comparing averages
Continuous A-Priori Analysis Current Scenario
alpha (risk of false positives or type I error rate): The likelihood of measuring at least the observed result, when in fact there is no effect at all
help_outline
Assigning an alpha level asks you to live in a world where you’ve ran the experiment and found a seemingly statistically significant effect size. If you select a p-value of .1, you’ve decided to live in a world where 1 out of every 10 experiments of this design would produce an effect size as least as large as the one you’ve measured even though there is truly no effect at all.
Power (1-risk of false negatives or 1-type II error rate): The likelihood that, when a treatment has an effect, you will be able to distinguish the effect from zero
help_outline
Determining power asks asks you to live in a world where you’ve ran the experiment and found a seemingly significant null result. If you select power of .95, you’ve decided to live in a world where 1 out of every 20 experiments of this design would produce an effect size of zero even though there is truly an effect
Pooled standard deviation of DV: How variable or 'spread' is the outcome (dependent) variable in both test & control group(s)?
help_outline
If have no sample data to establish SD, you can estimate standard deviation with the following thought experiment: Excluding outliers, subtract the minimum DV value you'd expect from the maximum DV value and divide by 6
Expected control group NEW QUESTION TODO average
help_outline
If we collected 100 samples of the DV from your control group and averaged it, what number do you estimate?
Expected treatment group NEW QUESTION TODO average
help_outline
This value establishes the minimum average that your treatment group must achieve in order to produce a statistically significant result. E.g., treatment group expected average of 27 or 33 means your intervention must produce an average of more than 33, or less than 27, to produce a statistically significant result.
Minimum Detectable Effect (MDE)
help_outline
This value is establishes the minimum effect size between your control and treatment comparison group(s) that your test must measure in order to achieve statistical significance. If you’re not very confident in your rationale for why it makes sense your treatment would produce an incremental improvement on the control of this size, that is a sign you may want to tighten your expected treatment group estimate
-
Sample size per comparison group -
Based on your inputs above, your hypothesis should say something like, "If we change , then we will see a difference in average ". Does your written hypothesis generally agree with that statement? Based on your inputs above, your hypothesis should say something like, "If we change , then we will see a difference in the proportion that ". Does your written hypothesis generally agree with that statement?
The items you list as variables should match your hypothesis, if they do not then sync things before you move on! Copy to Blueprint or Click X to return.
Business Experiment Launchpad powered by BehavioralSight (beta) Business Experiment Launchpad powered by BehavioralSight (beta)
  • niraj@scad.ai 🚫
  • linnea.gandhi@behavioralsight.com 🚫
  • Invite Teammate person_add
  • Sign out
  • Terms & Conditions
  • Privacy Policy
Blueprint #5 Last Edited 4 days ago

Part I: Why are you conducting research? (goals)
help_outline
What is the strategic importance and/or impact to the organization of rigorously studying this topic?
help_outline
What do you want to learn in this particular study?

don’t forget to SAVE your work

Part II: What are you measuring? (variables)
help_outline
What change(s) are you going to make, that you think will in turn change the outcome in the world that you care about? E.g., the words in an email subject line, the color of a website button, the greeting a salesperson says when a customer walks in the door, etc.
help_outline
What outcome in the world do you care about, can measure, and want to try to influence with your idea? E.g., customers open our emails more frequently, customers purchase higher value products, employees stay with the company longer, etc.

Categorical: a frequency, probability, or percentage describing an action that is or isn’t taken, a threshold that is or isn’t met, or a label that is or isn’t given, e.g., the % of customer open a savings account in 2020, the % of customers save over 5k a year, the % of customers gave a satisfaction rating of 5 out of 5

Continuous: numerical data with a wide, continuous range of values like money, weight, or count, e.g., the average amount saved by customers, the average number of days an account stays open


help_outline
Select a category of test subjects that you want to compare, for instance “reported the highest satisfaction rating”, or “clicked on the promotion”
help_outline
What other outcomes in the world do you care about, can measure, and want to try to influence with your idea? These will be used in exploratory analyses, while the “primary outcome” will be the focus of this experiment blueprint
help_outline
Through what medium are you running the experiment? E.g., email, call center, website, posters in a physical space, etc.
help_outline
Whose behavior are you trying to change? E.g., all customers, customers in a certain region, customers of a certain product, all employees, employees on a certain team, etc.

don’t forget to SAVE your work

Part III: What do you predict and why? (hypothesis)
help_outline
Your hypothesis should be a statement with an “if-then” logic (e.g., “if I change X then Y will happen”); a counterfactual or control (e.g., “versus if I do NOT change X”); specific (describe, X, Y, and the direction of the changes so that a colleague as well as a stranger on the street could understand); testable (you can make the changes to X that you desire, and you can measure the changes to Y that you expect); and falsifiable (there is at least some possibility that you could see different results than you predict)

help_outline
What is the supporting rationale for your prediction? Your rationale could come from prior experiments (e.g., psychology studies in academia, other experiments you have run), prior non-experimental research (e.g., focus groups, interviews, surveys), user feedback (e.g., comments, complaints, anecdotes), external examples (e.g., competitor initiatives, ex-industry initiatives), or other sources. In any case, the source of your hypothesis should be clearly stated and go beyond mere gut feeling.

don’t forget to SAVE your work

Part IV: What is your experiment setup? (design)
help_outline
How many versions of the independent variable will you have? Will there be a “control” group that doesn’t get any version, or gets a “plain” version? (e.g., 1 control + 2 ‘ new’ versions to test = 3 groups)
help_outline
What version of the independent variable will the control group experience? Nothing? Something ‘vanilla’? The status quo?
help_outline
What version of the independent variable will test group 1 experience?
help_outline
For each additional test group, what version of the independent variable will they experience?
help_outline
This level could be customers, retail locations, training centers, accounts, etc. The lowest level (i.e., individuals or sessions) is best, but higher level groups may be necessary for operational reasons.

don’t forget to SAVE your work

Part V: Who is participating in the experiment? (sampling plan and treatment assignation)
help_outline
Often we cannot run an experiment on the entire population we care about; we must pick a sub-set. Sometimes we cannot even run an experiment on anyone in the population we care about; we must pick a proxy. From where will you source your experiment participants (e.g., a sample of clients attending our annual conference, Amazon Mechanical Turk workers, customer emails)?
help_outline
It may be important to filter out, or select for, different qualities in order to ensure you zero-in on the 'right' (representative) population.
help_outline
It may be important to filter out, or select for, different qualities in order to ensure you zero-in on the 'right' (representative) population.
help_outline
Participants can be assigned to groups based on a variety of methods, but the assignment must be random
help_outline
Ideally participants will not know they are part of an experiment to avoid confounding the interpretation of results. If there is a risk they will, plan ahead to avoid or mitigate.
help_outline
When will you begin and end your experiment? Sometimes this timing is based on calendars; other times this timing is based on the expected length of time to sample enough participants. (“Enough” will be determined in the next part of the blueprint.)

don’t forget to SAVE your work

Part VI: How are you powering your experiment? (statistical & business risk)

help_outline
How many subjects do you need to complete the experiment for it to be adequately powered?

don’t forget to SAVE your work