Early LTV predictions - 5 pitfalls you want to avoid

Christian Hansen | September 23, 2022 · 8 min
Illustration of people walking on a square raised to different levels

This article is about some of the pitfalls of early lifetime value (LTV) prediction that should be avoided to get the most out of value-based event optimization or value-based-bidding strategies provided by companies such as Meta, Google, or TikTok.

The task of early LTV prediction can be phrased as:

Get the best estimate of a user’s LTV, using as few hours of user activity as possible.

The first part is easily understood; the better the estimates, the better your ad platform can focus the ads on high LTV customers. But for the second part, why is it essential to use as few hours as possible? By reducing the time between a user first appearing and you sending an optimization event back to the ad platform, the platform can learn to identify and target valuable customers earlier; in addition, if the event comes too late, the ad platform may be unable to use it all because of restrictions in their optimization algorithms and limitations on their ability to match the user correctly. Thus, it is not only essential to be able to provide accurate estimates of a given user’s LTV, but to do so quickly. This in turn means you need to be able to model the certainty of your predictions, since you'll want to hold off on sending an uncertain prediction too early for ad platforms that don't allow you to send updated predictions. If you send your predictions when your certainty is low, you may risk optimizing towards low LTV users!

At its core, LTV prediction is a classical time series regression task. At a given prediction time, the user has performed a number of interactions on your platform or service, ideally they have also completed an onboarding workflow that provides descriptive user information, and you want to predict all future value the user may generate.

This seems relatively straightforward for most data scientists. You simply construct a time-based dataset, and use your favorite choice of either classical time series models like ARIMA, or get fancy and employ a recurrent neural network or transformer-based architecture, feed it the data and try to predict the user's future value. What could go wrong? Below we will go over some of the problems that can crop up.

Sending the predictions to the ad platform too slowly

So you have trained and validated your favorite model and it performs well. The loss on the test set is low, and you even took the time to look at some ranking metrics to make sure you're actually able to differentiate between the best and the worst users, a sanity check which can be necessary because often the vast majority of users will have a lifetime value of 0, and naively applying only regression metrics can hide the fact that you've simply trained a model that predicts every user to have a value of 0. But now you are facing a big problem, for some users you may be able to confidently predict their LTV after a few hours, while others may require multiple days before the model is sufficiently certain. How do you control the trade-off between sending the event after 4 or 24 hours of data? There is no simple answer to this dilemma, and the optimal choice depends both on the ad platform, since some tolerate delays better than others, and on the exact improvement in prediction quality resulting from an additional delay. If you don't have past experience with the ad platforms to fall back on, rigorous experimentation is a necessity. 

Deployed model performance differs from your offline evaluation

You have deployed a model trained on a large variety of time horizons for users, and offline evaluation convinced you that the model was ready to be deployed. Still, at inference time after deployment, the predictions vary significantly from those you saw in your offline evaluation. Either of the following reasons normally causes this:

  1. At training time, you correctly used the event timestamps to filter out the events so you only have data for 4 hours for each user, but you forgot that some of the underlying tables are updated on a schedule - for example every 1 or 2 hours as your ETL pipeline is run as batch jobs. Therefore, the distribution of the data fed to your deployed model differs from the one during training, which means the deployed model’s predictions may be quite wrong as a result.

  2. You allowed leakage from future data during training. For example, you used your company’s user tables for the predictions, as they have a set of nice user features that you observed to be highly predictive. What you may have forgotten was that these features are updated continuously, which means the user features can be subject to future data leakage, where data from the future sneaks into the training by using data that does not have explicit timestamps.

Incomplete follow-up

When you trained the model, you did the straightforward thing of looking at a bunch of users from e.g. January to June, found their events in the first 12 hours, and computed the total value of purchases each user made in the future. On the outside, this looks correct, but if we are assuming it is currently September, some of the users have had over twice as long time to accumulate purchases compared to the users at the end of the period! What we see here is the classic problem of right censored data, which means that not all users have data that is observed for the same amount of time. Of course, we could naively fix this by only using data from January, so each user would have had eight months to accumulate value, but then we throw away a lot of useful information. A preferred training strategy would be to use techniques from survival modeling which allow us to utilize users of varying observation periods, and to get the most accurate predictions.

Data drift

You have made sure you have a model that works at several time horizons from hours to days, you have secured that your data looks the same at inference time as during training, and your training strategy takes into account each user's individual observation period length. What can now go wrong? The distribution of new users suddenly differs due to a change in marketing strategy, or a major update went live, which changes the behavior of the users greatly, and suddenly the trained model is just plain wrong in its predictions. It is important to detect these data drifts so we don’t send out wrong predictions, but we also need to handle it. As the data just drifted, we will only have a few users to train a new model on, and none of these will have a long observation period for accumulating value. Here it is paramount to be able to assess if the existing model can be updated to reflect the new users (e.g. by removing part of the feature space that is no longer relevant due to changes on the platform, but hope the rest stays the same), or if a new model have to be trained. If a new model has to be trained, it is important to be able to relate short-time metrics (e.g., initial activity or general product interest) to LTV, so the models can switch targets to instead of predicting LTV, predict valid proxies for LTV that correlates significantly with LTV, so these can be used for a time period while new data is collected.

Adversarial attacks (less susceptible than standard conversion events)

We have a great model, and we have evaluated it using all the best regression and ranking metrics we can come up with. In addition, we tested it in a lot of scenarios from suddenly losing access to certain data sources or changes in distribution. We now feel very confident that we can reliably send events to the ad platforms. One test we may have forgotten about is how susceptible our model is to adversarial attacks from competitors, which aim to mess up the type of new users found through an ad campaign based on the sent events to the ad platform. If the model learns to pick up on behavior that may be reproduced by an adversarial bot. For example, a TV streaming service gives a free trial, and a user's initial consumption is greatly correlated to their future LTV. In this case, a bot could simply sign up with a free trial and watch a lot of TV shows, which the model would mistakenly believe is a high value user and send an event to the ad platform to find more of this type of user. Here we have two aspects to consider, one from a data point of view and one from a modeling point of view:

  1. If the data distribution over new users changes significantly with no explanations, it may be due to an adversarial attack, and care should be taken.

  2. If the model can predict a user to be of high value without them making any actual payments, it is at risk of being abused.

Note that even though predicted LTV is subject to adversarial attacks, it is significantly more robust than using a standard conversion event, such as a single free trial event. To prevent and detect such attacks, robust outlier detection methods are necessary, which we will dive into in a future post.

  • LTV