πŸ“Š Data Science Lab
← STEM Lab Home
πŸ“š Quick Guide β€” Why Data Cleaning Matters β–Ό
🚫 Missing Values

Sensors fail, users skip fields, records get lost. Missing data shrinks your sample and can bias every statistic. Mean and median shift, variance is underestimated, and models trained on incomplete data learn wrong patterns.

πŸ“ˆ Outliers

A single extreme value can drag the mean far from the true centre and inflate standard deviation. In regression, one outlier can tilt the entire fit line, destroying RΒ² and inflating MSE dramatically.

πŸ“‹ Duplicates

Copy-paste errors or merge bugs create duplicate rows. This over-represents certain points, biasing averages toward repeated values and giving the model false confidence in specific regions.

βš–οΈ Scaling & ⚑ Noise

When features exist on wildly different scales (e.g. salary in thousands vs. age in tens), distance-based algorithms are dominated by the larger scale. Random noise spikes add variance and weaken signal-to-noise ratio.

🧹 Imputation (Mean / Median)

Mean imputation preserves the overall average but reduces varianceβ€Šβ€”β€Šall filled values sit at the centre. Median imputation is more robust to outliers. Both are better than dropping rows when data is scarce, but worse when missingness is systematic.

πŸ“‰ Z-Score & IQR Filtering

Z-Score: removes points more than k standard deviations from the mean (default k=3). Works well for roughly normal data. IQR: uses the interquartile range (Q3β€‰βˆ’β€‰Q1) and flags anything outside Q1β€‰βˆ’β€‰1.5Β·IQR … Q3 + 1.5Β·IQR. More robust to skewed distributions.

πŸ“ Normalize & Standardize

Normalize (min-max): rescales to [0, 1], preserving relative distances. Best when you know the data bounds. Standardize (z-score): centres at mean=0, std=1. Preferred for algorithms that assume Gaussian inputs (PCA, SVM, neural nets).