Essential Data Guide for Predictive LTV Modeling
Predictive LTV modeling requires specific data inputs to optimize user acquisition strategies. We frequently address questions about our data requirements. This guide explains which data is essential, important, and nice-to-have for our machine learning models, while also addressing common concerns about data sharing and privacy.
When discussing data requirements with our clients there are three focus areas:
What data to share - identifying the essential, important, and nice-to-have data types for our predictive models.
How much data we need and at what frequency - outlining historical data requirements and update intervals for optimal model performance.
Privacy aspects - addressing concerns about data security, anonymization, and compliance with regulations like GDPR to ensure responsible data handling.
❗ This article provides a general overview of the data Churney requires to deliver its value. For detailed technical data requirements, please refer to the comprehensive guide shared with you during the onboarding process.❗
Sharing is caring
Our causal prediction models thrive on log-level data - the more granular, the better. They are actually designed to process vast amounts of detailed user interactions, transactions, and behaviors. However, not all data is created equal; certain types of information are more crucial than others in predicting customer lifetime value for optimization. But before we dive into what data to share, let’s first align on some basics:
Preparing data
Data location: Due to data privacy concerns it's essential to confirm the correct geographic region and location where your data is stored before beginning onboarding. Note that changing regions requires a complete do-over by our engineers so please make sure to validate your region prior to communicating it to us.
PII data: we wish to avoid pulling any human readable PII data from our customers. Any PII data required downstream that is shared with us must be hashed first. There are standard hashing requirements you should adhere to for signal activation (or pLTV event sharing) with the networks to function properly.
Raw data instead of transformed data: we prefer the data to the rawest form, coming from the sources, without heavy transformations downstream. This way we can ensure there are no row mutations and information leakage from the future, which may negatively impact our models.
Timely Event Recording: Key events cannot be used if they are not added to your DWH right after they occur. Historical backfilling of events will not be sufficient.
Proper Timestamp Management: Columns must include timestamps to prevent future information leakage.
Clear Payment Tracking: Payment/Purchase events must be clearly identifiable and we need to know exactly where to find them in your data. Having an accurate revenue understanding is key for pLTV.
Firebase logs: Firebase is key for user behavior capture and ID matching. There's a hard limit of 1,000,000 events daily for free replication to BQ when taking the bulk export route. Exceeding this requires Google Marketing 360, which is costly. To avoid this cost, be advised that the event limit is not relevant when performing streaming export.
User ID Mapping: To create a unified view of each user’s journey, we need to map user IDs across all data sources. This requires consistent user identification throughout all sources, with internal user IDs stored in Firebase.
How much? How frequent?
Model creation data: A minimum of 3 months of historical data is required, but we strongly recommend providing more - Ideally a full year. To predict more than 90 days, the general rule of thumb is that we require 4 times the prediction horizon in historical raw data. For example, to predict 90-day future value, at least 1 year of historical data is necessary to build an accurate model.
Daily Data Refresh Required: Data must be updated in your warehouse at least once daily. Ideally, updates should occur multiple times per day (3-12). While real-time updates are not mandatory, updates less frequent than daily are not sufficient.
Data types - the good, the bad & the ugly
Must Have: user identifiers
User ID bindings between tables must be accurate - before we proceed to detail potential flavors of device, user and customer IDs, it is important to note that bindings of whichever leading ID used in each table, must be accurate for Churney to be able to connect the dots at the customer or user level.
Email address, Phone number - It is important to hash columns that contain data which would allow one to determine the identity of a user, including email address and phone number.
Device IDs and any other user identifiers (customer UID, ad platform IDs or otherwise MMP defined IDs)
Must have: Revenue bearing transactions
Payments: be that due to product purchases, IAPs, subscriptions, subscription renewals etc. Essentially we’re looking for any revenue bearing event due to a transaction. Free trials are not included under this classification.
Refunds: refund events are key to prediction success just as revenue events are. Otherwise, our models may over estimate those users who were refunded and are actually worth less.
IAA Revenue: Revenue generated from ad monetization, primarily relevant for mobile businesses. The preferred reporting priority is as follows: impression-level data updated multiple times per day, impression-level data updated daily, or user-level data updated daily.
Event level user interactions & telemetry
Key business interactions, including non-revenue events such as trial sign-ups, trial churns, support interactions, and membership enrollments.
Event-level actions performed by users, such as adding to cart, completing a game level, signing up for a newsletter, and similar activities.
Event-level telemetry data, including sessions, app opens, page views, and similar interactions.
0 and 1st part data
Device information including details about the user's device, such as the device type (e.g., smartphone, tablet, desktop), operating system, browser, screen resolution, model, etc.
User Demographics data including basic information such as age, gender, income level, education, occupation and similar.
Know Your Customer (KYC) or any other type of information provided by the user about themselves.
Acquisition and Attribution Data
Information about user acquisition sources, such as campaign, channel, or platform.
Attribution data that links users to specific marketing efforts, including MMP (Mobile Measurement Partner) tags, UTM parameters, or ad network identifiers.
Providing the right data in the right format is critical to unlocking the full potential of Churney's predictive LTV models. For any questions or support during the onboarding process, our team is here to assist and ensure a smooth and compliant data integration experience.