Best Practices for pLTV Experiments

| May 21, 2024 · 5 min

We explore the best practices for evaluating predictive LTV models in advertising. First, we explain the problems of inflated reporting, seasonality, and end selection. For each, we show solutions such as using causal inference and modern marketing measurement.

A data-driven head of User Acquisition is often tasked with testing new strategies, such as LTV modeling. In our experience, three challenges crop up during impact evaluation: inflated reporting, seasonality, and when to end experiments.

Inflated reporting

The problem: A pLTV model helps companies identify high pLTV users, but irrelevant correlations must first be disentangled from causation. Relying on attributed results for ad measurement is unreliable at best:

Marketing measurement solutions have undergone radical changes due to privacy changes, placing legal and technical constraints on data, and changing how identity is processed and used.
Conversion data is missing, partial, delayed, or modelled at various reporting levels. Ad platform reporting tries to tie clicks and impressions to conversions, results from the data it accesses, and how the model is configured.
Clicks and impressions' role in conversions is still being determined. Most tools marketers use lack causality-based analysis.

Solution: We suggest the following three steps.

Bring causal inference experiments into the reporting.
Well-designed experiments Incorporate cause-and-effect relationships into reporting and give Gaming firms a clearer read into the impact of their advertising spend. Gaming companies are beginning to embrace triangulated measurement (MTA, MMM, experiments), particularly adjusting Modeling or Attribution using Experiments. Essentially, modeling is made more rigorous while retaining a cross-channel view for long-range planning, while attribution is made more rigorous while remaining speedy for tactical changes.
Use pLTV as the success metric in experiments.
Many testing methods (on-platform or open-sourced) allow you to use pLTV as the success metric to decide the winning strategies. This means two important things: first, you’re learning the value of the pLTV strategy in terms of lifetime value and not short-term metrics like ROAS (Return on Ad Spend) or cost per acquisition. Therefore avoiding optimizing to short-term value. Second, you’re also learning if the strategy drives incremental causal value above not doing anything. Perfect for adjusting your reporting and not falling into the attribution trap.
Validate LTV predictions with actual day-N revenue
Finally, a test to validate LTV predictions with actual day-N revenue to see if our advertising reporting tracks with internal revenue figures. If you’ve been experimenting with pLTV as a metric, there is less room for nasty surprises.

Seasonality

The problem: External factors like seasonality can play a big role in the revenue and conversions of gaming companies. Seasonality refers to the ebbing and flowing of demand and sales throughout the year tied to repeatable trends and cycles in user behavior, e.g., summer vs. winter months, weekdays vs weekends, day of the month of payroll, holidays/events. It's common to see mobile app usage fall over the summer months, or in the case of an NFL game, interest varies from when the football season culminates in the Superbowl to when the draft for the new season begins. When it comes to testing, how representative will any one test be if the baselines of user behavior fluctuate so much?

The solution: When it comes to experimentation, multiple experiments are better than single experiments. A single experiment can be noisy and can also be impacted by many external factors (e.g., time or media weight on others).

For this reason, it makes the most sense always to have regular testing over different periods. Combining learnings from multiple experiments will make your findings more robust against one-off volatility and more representative of “true” incrementality.

Ending experiments early

The problem: The ideal experimental flight time is to cover the duration of the ad campaign(s) and at least one conversion cycle and to ensure we collect enough conversion outcomes for valid results. But there are a couple of reasons why this doesn't happen:

We discover a problem with the experimental design that would contaminate results and confuse the interpretation
A/B on ad platforms can declare winners early when the test produces results that pass the threshold of a chance to determine a winner. This could be an issue if your campaigns have not yet exited the learning phase and their performance hasn't stabilized
UA managers might terminate the experiment if early results don't meet expectations.

The solution: There is no' one size fits all’ answer to each scenario. However, we can use predefined stopping criteria for experiments.

If a test is set up incorrectly, we stop it and re-randomize the audience for another test. Otherwise, we’re undermining the purpose of the test.
If we don’t like the results, recognize we’re in the process of learning a vital insight that could change how markets operate in the future. Ending the test early could mean we lose that opportunity.

Conclusion

Measuring the success of a pLTV campaign is more nuanced than first appears. It involves understanding causal relationships, continuous and adaptive experimentation, and human judgment.

Overcoming inflated reporting requires using causal methods and pLTV as a success metric. Only then can you be confident you're measuring true long-term value rather than short-term gains.
Seasonality introduces variability into our experimental results, which makes decision-making more complicated. However, conducting regular experiments over different periods can mitigate this impact.
Finally, using pre-defined stopping criteria for experiments can also balance business constraints with deeper insights and help us avoid premature conclusions