What is leakage (data)?

Why it matters

Growth teams greenlight pLTV pilots on backtests that look excellent. After launch, platform CPA rises, cohort quality softens, or calibration drifts because the model learned shortcuts unavailable in production. Leakage is a leading cause of that "great offline, weak online" gap.

Leakage is subtle in marketing data because outcomes are delayed. A feature built from "total revenue to date" without a strict as-of timestamp can smuggle future repeats into an early score. Joining attribution tables with post-conversion campaign edits, or training on net revenue before returns finalize, creates the same problem.

For performance marketers, leakage is not only a data science issue. It determines whether signal optimization is trustworthy. If training saw the future, calibration against realized LTV will look good until customer mix shifts and the shortcut breaks.

Leakage (data)

Preventing leakage is a prerequisite across the pLTV stack:

Data warehouse (input): Build point-in-time feature tables: only events and revenue known at or before the anchor event timestamp for each user.
Model (Churney): Train user-level pLTV with explicit prediction horizons (D7, D30, D90) and holdout cohorts that respect time ordering.
Signal design: Apply signal transformation, caps, and conservative early values so production scores stay defensible even when early proxies are noisy.
Activation (output): Send values directly to ad networks via Meta CAPI or Google Ads Conversion API using features available at conversion time only.
Readout: Monitor calibration, model drift, and incremental ROAS vs BAU; leakage often appears as sudden calibration breakdown after mix shift.

The data warehouse must support as-of joins, not just current-state snapshots. Storage is input to modeling; leakage controls determine whether activation signals generalize.

Next step: What data Churney needs · Talk to an expert

Category variants

Leakage pattern	Where it shows up	Fix direction
Future revenue in labels	Ecommerce repeat orders after day 7 scored as day 0	Horizon-specific labels with maturity cutoffs
Refund information	Net revenue feature includes returns filed weeks later	Separate refund models or delayed label refresh
Campaign metadata	Ad set budget or bid strategy after conversion	Freeze campaign fields at click timestamp
Subscription tenure	Full churn status at score time for trial-start anchor	Censored survival features at anchor only
Aggregate segment stats	Category LTV averages keyed to user before enough history	User-level features only

Common mistakes

Random train/test split by row. Splits must respect time and user, not shuffle events.
Using "revenue to date" without as-of logic. Future purchases leak into early scores.
Training on mature cohorts, scoring immature users without adjustment. Label horizon mismatch.
Including post-anchor support or refund flags. Operations data arrives too late for honest early features.
Tuning on platform-attributed ROAS. Optimization feedback can leak campaign outcomes into features.
Skipping leakage review in data readiness. Pilots start on inflated backtests.

Advertiser lens

Role	Cares about
Data science	Point-in-time features, temporal validation, leakage audits
Marketing analytics	Whether backtest windows match live anchor timing
UA / performance	Why live CPA diverges from pilot promises
Data engineering	As-of tables, event ordering, and label refresh SLAs

FAQ

What is data leakage in pLTV?

Using training information that would not be available when you score a user in production and send a value event to an ad platform.

How is leakage different from overfitting?

Overfitting memorizes noise in valid data. Leakage uses invalid future data, often producing unrealistically strong offline metrics.

What is a point-in-time feature?

A variable computed using only data known at or before a defined timestamp, usually the anchor event.

Can leakage come from attribution data?

Yes, if campaign or bid fields reflect post-conversion changes rather than state at click or install.

How do teams detect leakage?

Temporal backtests, feature audits, and comparing early-score calibration to mature cohort outcomes; sudden offline/online gaps are a red flag.

Does leakage affect server-side signals?

Yes. Over-optimistic pLTV scores sent via Meta CAPI or Google Ads Conversion API can destabilize learning when real users underperform leaked training labels.

Not the same as

Term	Difference
Conversion signal loss	Events never reach the platform, not invalid training features
Model drift	Performance degrades over time on valid features; leakage is a training defect
Feedback loop (pLTV)	Live bidding changes acquisition mix; leakage is future data in historical training
Proxy metric	Intentional short-window stand-in, not accidental future information