Dataset Preparation & Cleaning, which is a critical step in any AI/ML project (often taking 60–70% of the time)
1. Importance of Dataset Preparation
- ML models are only as good as the data they are trained on.
- Poor-quality data → biased, inaccurate, or unstable models.
- Clean, balanced, and well-structured data → higher performance, better generalization, ethical AI.
2. Steps in Dataset Preparation
Step 1: Data Collection
- Sources:
- Public datasets (Kaggle, Hugging Face Datasets, UCI Repository).
- APIs (Twitter API, OpenAI API logs, etc.).
- Enterprise databases (CRM, ERP, medical records).
- Web scraping.
- Ensure data privacy & compliance (GDPR, HIPAA).
Step 2: Data Integration
- Combine data from multiple sources → single dataset.
- Handle issues:
- Schema mismatch (different column names, formats).
- Data redundancy (duplicate entries).
- Unit consistency (e.g., kg vs. lbs, INR vs. USD).
Step 3: Data Cleaning
This is the core step.
(a) Handling Missing Data
- Options:
- Drop rows/columns (if missing >70%).
- Imputation: mean, median, mode, KNN-imputation.
- Domain-specific filling (e.g., 0 for missing transaction).
(b) Handling Duplicates
- Remove exact duplicates.
- For near-duplicates: clustering or fuzzy matching.
(c) Outlier Detection
- Statistical methods: z-score, IQR.
- ML methods: Isolation Forest, DBSCAN.
- Decision: remove or cap outliers depending on domain.
(d) Noise Reduction
- Fix spelling errors, inconsistent labels (e.g., “NYC”, “New York City”).
- Apply text normalization for NLP: lowercasing, stemming, lemmatization.
Step 4: Data Transformation
- Scaling/Normalization:
- Min-Max scaling, Standardization (z-score).
- Required for algorithms like SVM, K-means.
- Encoding categorical data:
- One-hot encoding, label encoding, embeddings.
- Feature engineering:
- Derived features (e.g., “Age group” from “DOB”).
- Time-series preparation:
- Rolling windows, lag features, seasonality decomposition.
Step 5: Data Balancing
- Imbalanced datasets lead to biased models.
- Techniques:
- Oversampling: SMOTE, ADASYN.
- Undersampling: random undersampling, Tomek links.
- Cost-sensitive learning: penalize misclassification of minority class.
Step 6: Data Splitting
- Split into:
- Training set (60–80%).
- Validation set (10–20%).
- Test set (10–20%).
- Advanced: Cross-validation (k-fold, stratified CV).
- Ensure no data leakage (future info in training set).
3. Advanced Considerations
(a) Dataset Bias
- Check for demographic imbalance (gender, age, region).
- Bias mitigation: re-sampling, re-weighting, fairness constraints.
(b) Large-Scale Data
- For big datasets → distributed processing with Spark, Dask, Ray.
- For LLM training → dataset sharding + streaming.
(c) Domain-Specific Cleaning
- NLP: remove stopwords, punctuation, normalize emojis.
- Computer Vision: image resizing, augmentation (rotation, blur, color jitter).
- Time Series: seasonal decomposition, anomaly filtering.
(d) Data Augmentation
- Helps small datasets generalize better.
- Examples:
- NLP → back-translation, synonym replacement.
- Vision → flips, rotations, GAN-based synthetic images.
- Tabular → noise injection, feature synthesis.
4. Tools & Libraries
- Python: Pandas, Numpy, Scikit-learn, Dask.
- NLP: NLTK, SpaCy, Hugging Face
datasets
. - Vision: OpenCV, Albumentations.
- Time-series: tsfresh, Prophet.
5. Example: NLP Dataset Cleaning (Practical)
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
# Load dataset
df = pd.read_csv("reviews.csv")
# Remove duplicates
df = df.drop_duplicates()
# Handle missing values
imputer = SimpleImputer(strategy="most_frequent")
df['review'] = imputer.fit_transform(df[['review']])
# Normalize text
df['review'] = df['review'].str.lower().str.replace(r'[^a-z\s]', '', regex=True)
# Train-test split
train, test = train_test_split(df, test_size=0.2, random_state=42)
✅ Summary
- Dataset preparation & cleaning = foundation of reliable AI/ML models.
- Includes collection, integration, cleaning, transformation, balancing, and splitting.
- Advanced tasks: bias mitigation, domain-specific cleaning, augmentation.
- Final output: a high-quality, well-structured dataset ready for model training.