Skip to content

Dataset Preparation & Cleaning


Dataset Preparation & Cleaning, which is a critical step in any AI/ML project (often taking 60–70% of the time)

1. Importance of Dataset Preparation

  • ML models are only as good as the data they are trained on.
  • Poor-quality data → biased, inaccurate, or unstable models.
  • Clean, balanced, and well-structured data → higher performance, better generalization, ethical AI.

2. Steps in Dataset Preparation

Step 1: Data Collection

  • Sources:
    • Public datasets (Kaggle, Hugging Face Datasets, UCI Repository).
    • APIs (Twitter API, OpenAI API logs, etc.).
    • Enterprise databases (CRM, ERP, medical records).
    • Web scraping.
  • Ensure data privacy & compliance (GDPR, HIPAA).

Step 2: Data Integration

  • Combine data from multiple sources → single dataset.
  • Handle issues:
    • Schema mismatch (different column names, formats).
    • Data redundancy (duplicate entries).
    • Unit consistency (e.g., kg vs. lbs, INR vs. USD).

Step 3: Data Cleaning

This is the core step.

(a) Handling Missing Data

  • Options:
    • Drop rows/columns (if missing >70%).
    • Imputation: mean, median, mode, KNN-imputation.
    • Domain-specific filling (e.g., 0 for missing transaction).

(b) Handling Duplicates

  • Remove exact duplicates.
  • For near-duplicates: clustering or fuzzy matching.

(c) Outlier Detection

  • Statistical methods: z-score, IQR.
  • ML methods: Isolation Forest, DBSCAN.
  • Decision: remove or cap outliers depending on domain.

(d) Noise Reduction

  • Fix spelling errors, inconsistent labels (e.g., “NYC”, “New York City”).
  • Apply text normalization for NLP: lowercasing, stemming, lemmatization.

Step 4: Data Transformation

  • Scaling/Normalization:
    • Min-Max scaling, Standardization (z-score).
    • Required for algorithms like SVM, K-means.
  • Encoding categorical data:
    • One-hot encoding, label encoding, embeddings.
  • Feature engineering:
    • Derived features (e.g., “Age group” from “DOB”).
  • Time-series preparation:
    • Rolling windows, lag features, seasonality decomposition.

Step 5: Data Balancing

  • Imbalanced datasets lead to biased models.
  • Techniques:
    • Oversampling: SMOTE, ADASYN.
    • Undersampling: random undersampling, Tomek links.
    • Cost-sensitive learning: penalize misclassification of minority class.

Step 6: Data Splitting

  • Split into:
    • Training set (60–80%).
    • Validation set (10–20%).
    • Test set (10–20%).
  • Advanced: Cross-validation (k-fold, stratified CV).
  • Ensure no data leakage (future info in training set).

3. Advanced Considerations

(a) Dataset Bias

  • Check for demographic imbalance (gender, age, region).
  • Bias mitigation: re-sampling, re-weighting, fairness constraints.

(b) Large-Scale Data

  • For big datasets → distributed processing with Spark, Dask, Ray.
  • For LLM training → dataset sharding + streaming.

(c) Domain-Specific Cleaning

  • NLP: remove stopwords, punctuation, normalize emojis.
  • Computer Vision: image resizing, augmentation (rotation, blur, color jitter).
  • Time Series: seasonal decomposition, anomaly filtering.

(d) Data Augmentation

  • Helps small datasets generalize better.
  • Examples:
    • NLP → back-translation, synonym replacement.
    • Vision → flips, rotations, GAN-based synthetic images.
    • Tabular → noise injection, feature synthesis.

4. Tools & Libraries

  • Python: Pandas, Numpy, Scikit-learn, Dask.
  • NLP: NLTK, SpaCy, Hugging Face datasets.
  • Vision: OpenCV, Albumentations.
  • Time-series: tsfresh, Prophet.

5. Example: NLP Dataset Cleaning (Practical)

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Load dataset
df = pd.read_csv("reviews.csv")

# Remove duplicates
df = df.drop_duplicates()

# Handle missing values
imputer = SimpleImputer(strategy="most_frequent")
df['review'] = imputer.fit_transform(df[['review']])

# Normalize text
df['review'] = df['review'].str.lower().str.replace(r'[^a-z\s]', '', regex=True)

# Train-test split
train, test = train_test_split(df, test_size=0.2, random_state=42)

Summary

  • Dataset preparation & cleaning = foundation of reliable AI/ML models.
  • Includes collection, integration, cleaning, transformation, balancing, and splitting.
  • Advanced tasks: bias mitigation, domain-specific cleaning, augmentation.
  • Final output: a high-quality, well-structured dataset ready for model training.