Data Description
File Information
This dataset simulates real-world medical data complexities to provide hands-on experience in data cleaning and preprocessing for AI in healthcare. Deliberately retains common data issues (common data issues) found in real clinical settings, serving as a practical sandbox for:
- 🚮 Data Cleaning Mastery: Tackle duplicates, inconsistent resolutions, and naming heterogeneity.
- ⚖️ Class Imbalance Solutions: Experiment with techniques like SMOTE or augmentation.
- 🎯 Medical AI Readiness: Prepare raw clinical data for XAI (Explainable AI) models.
Content
11,784 ultrasound images intentionally uncleaned:
infected/: 6,784 images (PCOS-positive cases)noninfected/: 5,000 images (Healthy ovaries)
🚩 Real-World Challenges Included:
- Duplicates: 1,956+ groups of identical images (intra-class & cross-class)
- Multi-Resolution Mix: Images from 255x247px to 984x848px
- Metadata Gaps: No clinical patient data (simulating HIPAA-restricted scenarios)
- Class Imbalance: 57.5% vs 42.5% distribution
- Noise Artifacts: Blurring, rotation variants, and naming inconsistencies
Sources & Methodology
- Curated from: Retrospective ultrasound studies across 3 clinics (2018-2022)
- Ethical Compliance: Patient identifiers removed, DICOM metadata stripped
- Annotated: By radiology residents under consultant supervision
Inspiration
Born from the need to bridge the gap between:
- 📊 Clean Tutorial Datasets (MNIST, CIFAR)
- 🏥 Messy Real Clinical Data
Use this to practice the unglamorous but critical 80% of ML work - data wrangling!
Potential Applications
- 🚮 Data cleaning pipelines for medical imaging
- 🔍 Duplicate detection algorithms
- ⚖️ Benchmarking class-balancing techniques
- 📏 Resolution standardization methods
Note to Practitioners
"The true test of an ML engineer isn't model architecture choice, but transforming messy data into trainable gold." - Use these imperfections as your training ground
Verification Report
The following data verification reports are provided by the seller:

PCOS-XAI Ultrasound:Real-World Training Dataset
372.28MB
Apply Report


