Data Science Lab – EDA, Clustering & Data Cleaning

Pattern Equation y = Points Noise

Samples Fs (Hz) Frequencies Noise

Samples Features Clusters Dataset Theme

Data (comma-separated) Random n

📚 Quick Guide — Why Data Cleaning Matters ▼

🚫 Missing Values

Sensors fail, users skip fields, records get lost. Missing data shrinks your sample and can bias every statistic. Mean and median shift, variance is underestimated, and models trained on incomplete data learn wrong patterns.

📈 Outliers

A single extreme value can drag the mean far from the true centre and inflate standard deviation. In regression, one outlier can tilt the entire fit line, destroying R² and inflating MSE dramatically.

📋 Duplicates

Copy-paste errors or merge bugs create duplicate rows. This over-represents certain points, biasing averages toward repeated values and giving the model false confidence in specific regions.

⚖️ Scaling & ⚡ Noise

When features exist on wildly different scales (e.g. salary in thousands vs. age in tens), distance-based algorithms are dominated by the larger scale. Random noise spikes add variance and weaken signal-to-noise ratio.

🧹 Imputation (Mean / Median)

Mean imputation preserves the overall average but reduces variance — all filled values sit at the centre. Median imputation is more robust to outliers. Both are better than dropping rows when data is scarce, but worse when missingness is systematic.

📉 Z-Score & IQR Filtering

Z-Score: removes points more than k standard deviations from the mean (default k=3). Works well for roughly normal data. IQR: uses the interquartile range (Q3 − Q1) and flags anything outside Q1 − 1.5·IQR … Q3 + 1.5·IQR. More robust to skewed distributions.

📏 Normalize & Standardize

Normalize (min-max): rescales to [0, 1], preserving relative distances. Best when you know the data bounds. Standardize (z-score): centres at mean=0, std=1. Preferred for algorithms that assume Gaussian inputs (PCA, SVM, neural nets).

Points Noise σ Slope Intercept

💥 Inject Corruption

🧹 Cleaning Operations

📊 Raw Data

💥 Corrupted Data

✅ Cleaned Data