By: Gagik Chakhoyan, Data Science Team Lead
Every app tries to improve its key performance metrics (recall that in the Zoom example from the previous post, the metric that they were trying to improve was the number of meetings hosted). There are numerous ways you can iterate on your product to improve your key metric. But how do you decide what to change? Even when you have generated multiple ideas and have prioritized them, the moment for implementation arrives and you need to determine the methodology by which you assess each change’s effect on your metrics.
Let’s take a hypothetical example from Sololearn. Here is an app screenshot from the Android version where new users are prompted to choose one of the many courses that they want to start learning:
In this screenshot we see a list of the top seven most popular courses on the platform. It’s easy to guess what the key metric would be -- to increase the number of users who start to learn a course. Specifically, we want to increase the percentage of users who successfully open a course from this page and go to its first lesson.
For the sake of simplicity let’s assume 100 users visit this page each day. We know these 100 users are new, so we can assume that this is their first encounter with the app and that their intention is to discover what SoloLearn can offer them in terms of learning content.
It is the goal of the Product and Design teams to have a UI that best serves the majority of the users that land on this page. Obviously, it would be undesirable for SoloLearn if a user who has the intention to learn to code does not manage to do so because of poor UI -- such as if they arrive on the page and think that SoloLearn only offers seven courses and none of those are what they want to learn. Let’s say that 90% of the people that visit this page go on to open a course. Our goal would be to increase that percentage from 90%.
The team comes up with two ideas around the hypothesis that the reason that 10% of users don’t open a course from this page is that they don’t discover all of the other courses that SoloLearn offers. Idea #1 is to change the page design and list only the names of the courses without any visuals, which would free up space on the screen and allow more items to be listed in the visible area of the page. Idea #2 is to have a square grid of courses instead of a list, which would allow more courses to be listed in the visible area.
One next step we could take is to just make the changes and see if we see any lifts in the metrics. For example, we can implement idea #1 and wait a month -- at the end of the month we observe that the percentage of users who start a course from that screen is 93%. Of course, 93% is a higher rate of conversion than the previous design, but how confident can we be in the results? How can we be sure that the uptick in the metrics is due only to the screen design change and that no other factors were in play during that month? For example, maybe the test month coincided with the start of the school year in September and students naturally turn to SoloLearn to help with their classes at the start of the school year. Someone could also make the argument that the actual rate would be 95% and not 93% if we hadn’t made this change in this month.
One way to overcome this bias is to segment the users into similar groups and measure the conversion rates for them separately -- for example, casual learners, students, etc. This doesn’t entirely solve the bias problem, as there is no guarantee that there are no other factors that could impact the conversion rate in the experiment month.
The same would occur if we implemented idea #2 and rolled it out to all users. Even if we see no changes to the metrics, we can never be completely confident that our ideas and implementations do not negatively impact user engagement. So how do we proceed?
The field of clinical trials has the answer to this question in the gold standard of evidence: Randomized Controlled Trials (RCT).
The key idea in RCTs is that the groups that are being affected by your changes should be completely similar; in other words, they must differ only by the change that you make and the effect of all other factors must be equal.
How do you do this? You do this by splitting the users into several groups during the experimentation period and expose the users of each group to only one variant of the change you are making. So for our example above, we would have three groups. The experience of the users in the first group (Group A) will not change -- we call this group “control”, a term that comes again from the field of clinical trials. The users in the second group (Group B) will see the variant with the list of course names only (which is called “Treatment 1”). The third group (Group C) will be exposed to the variant with the square grid of courses (which is called “Treatment 2”). By segmenting the groups in this way, if we see the following conversion rates:
Group A: 91%
Group B: 95%
Group C: 89%
then we can safely assume that the difference in the conversion rates was caused only by our changes as any other factors should have affected all three groups equally.
Suppose this cohort of users joined in September and they comprise 30% of the total user base. They have higher conversion rates but as they are 30% of all users in all three groups, the conversion rate in all three groups gets lifted equally.
There are several key things to be pointed out here.
Randomization should be done properly. Each new user has a one third, or 33.33%, probability of joining each group and there is no explicit logic assigning them to a specific group. If we were to assign more active students to a particular group, the results and conclusions we draw after the experiment would be biased, and therefore invalid.
One Experiment, One Metric:
Each experiment can test only one thing that is measured. There are always other metrics that we want to measure but adding additional metrics to an experiment will change the outcome either positively or negatively. The example we walked through is straightforward in that the only metric that can be impacted is conversion to starting a lesson from the courses page, but there are cases where adding additional metrics to test within the same experiment will have a deleterious effect on another metric in the same experiment. Hypothetically, there could be a case where the users don’t find what they want on the courses page, and then wander around the app and discover other sections of the app such as the Code or Q&A section. If the new design increases the percentage of users who start a course from the courses page, we may see a corresponding drop in engagement in other areas of the app.
The second method we discussed for conducting the experiment, with RCTs, is known more commonly as A/B testing, where we refer to groups or changes as A and B (and in our case, we have A/B/C testing as the number of groups is three).
While A/B testing is the cleanest way to test your hypothesis, the data generated by rolling out the changed version to all users can still be very useful. A/B testing is quite costly, as you need to keep two or more versions of the page/section/application and serve them both simultaneously. If you are very confident in your hypothesis, and the costs of a specific A/B test is very high, then it makes sense to roll out the change to all users, make an observational study, and derive the results of the change. You can control for measurement artifacts by making some adjustments -- such as comparing the data to previous time periods to adjust for seasonality effects, etc.
In this post we discussed a high-level overview of the experimentation framework that a SAAS company can implement to iterate on its product and drive user growth. Next, we will dig deeper into the technical details of the A/B testing. Some food for thought prior to our next post in this series: can we be confident that the three numbers we measured after one month of testing (91%, 95%, 89%) are not just arbitrary numbers and they measure actual conversion rate? What if we run the same experiment again or keep it going for a longer period of time? In that case, can we be sure that we will see the same or very similar numbers or numbers Are there any methodologies we can use to quantify our confidence in the results? Turns out statistics can help us with the tools and make our testings more scientific-based (statistical hypothesis testing). More to come!
Gagik holds a Ph.D. in Mathematical Economics and serves as SoloLearn’s Data Science Team Lead and currently teaches Data Science at Yerevan State University. His areas of interest are Statistics, Probabilities, Data Visualization and Deep Learning. He began his career in financial risk management before shifting to software at SoloLearn.