Sensors fail, users skip fields, records get lost. Missing data shrinks your sample and can bias every statistic. Mean and median shift, variance is underestimated, and models trained on incomplete data learn wrong patterns.
A single extreme value can drag the mean far from the true centre and inflate standard deviation. In regression, one outlier can tilt the entire fit line, destroying RΒ² and inflating MSE dramatically.
Copy-paste errors or merge bugs create duplicate rows. This over-represents certain points, biasing averages toward repeated values and giving the model false confidence in specific regions.
When features exist on wildly different scales (e.g. salary in thousands vs. age in tens), distance-based algorithms are dominated by the larger scale. Random noise spikes add variance and weaken signal-to-noise ratio.
Mean imputation preserves the overall average but reduces varianceβββall filled values sit at the centre. Median imputation is more robust to outliers. Both are better than dropping rows when data is scarce, but worse when missingness is systematic.
Z-Score: removes points more than k standard deviations from the mean (default k=3). Works well for roughly normal data. IQR: uses the interquartile range (Q3βββQ1) and flags anything outside Q1βββ1.5Β·IQR β¦ Q3β+β1.5Β·IQR. More robust to skewed distributions.
Normalize (min-max): rescales to [0,β1], preserving relative distances. Best when you know the data bounds. Standardize (z-score): centres at mean=0, std=1. Preferred for algorithms that assume Gaussian inputs (PCA, SVM, neural nets).