ML Data Leakage
Introduction
Data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possible even validation data), but the model will perform poorly in the production.
There are two types of leakage: target leakage and train-test contamination.
Target leakage
Target leakage occurs when your predictors include data that will not be available at the time you make predictions. We should note that target leakage is not for data that represents a feature of your model, however we should treat target leakage in the domain of timing and the time that the data become available.
In order to understand better this type of data leakage lets break it down with an example. Imagine you want to predict who will get sick with COVID-19.(very common example). The data will have the following format.
got_covid | age | weight | genre | took_medicines |
---|---|---|---|---|
False | 35 | 80 | 0 | False |
False | 23 | 75 | 1 | False |
True | 79 | 100 | 0 | True |
People take medicines after getting covid in order to recover? The raw data shows a strong relationship between those two columns, but took_medicines is frequently changed after the value of got_covid is determined.
Train-Test Contamination
Another type of leak occurs when you are not careful to distinguish training data from validation data. Most of the times, we seperate data to 70-30 portions. Recall that validation is meant to be a measure of how the model does on data that it hasn’t considered before. You can corrupt this process in subtle ways if the validation data affects the preprocessing behavior. This is sometimes called train-test contamination.