Datasets for Breast Cancer Detection with Mammography and MWI

Explore diverse imaging databases to develop robust AI models for breast cancer detection

Key Insights

Comprehensive Mammography Datasets: Datasets such as CBIS-DDSM, OPTIMAM, and ADMANI provide extensive collections of mammography images with detailed pathological annotations.
Emerging Microwave Imaging Data: While readily available public datasets for Microwave Imaging (MWI) are limited, targeted collaborations and research groups often provide prototype data like UM-BMID.
Hybrid Imaging Approaches: Combining mammography and MRI-based datasets such as MAMA-MIA and Duke Breast Cancer MRI Dataset enhances model robustness by integrating multi-modality data.

Overview of Data Sources

In developing artificial intelligence (AI) algorithms for breast cancer detection, high-quality datasets play a crucial role in training and validating deep learning models. Breast cancer detection using mammography has been a fundamental application area, with numerous datasets available comprising scanned film mammograms, digital images, and dynamic imaging modalities. Additionally, microwave imaging (MWI) is an emerging field that promises to offer complementary information by utilizing microwave signals to differentiate between healthy and malignant tissues.

This guide is designed for researchers and engineers interested in exploring datasets that support training and research in breast cancer detection. Below, we detail several key datasets and data resources available for both mammography and microwave imaging modalities, along with pointers to further collaboration opportunities for MWI datasets.

Mammography Datasets

Digital and Film Mammography

Mammography remains one of the primary imaging techniques used globally for early breast cancer detection. There are multiple robust datasets prepared for AI research that encompass a variety of imaging methodologies and comprehensive clinical annotations.

CBIS-DDSM

The CBIS-DDSM (Curated Breast Imaging Subset of DDSM) dataset is a standardized version of the Digital Database for Screening Mammography (DDSM). It comprises 2,620 scanned film mammography studies that include normal, benign, and malignant cases confirmed by pathology reports. This dataset is hosted on Kaggle and is widely used for training machine learning models due to its detailed annotations and high-quality labels. Detailed information about the imaging studies, pathology confirmations, and metadata make this dataset invaluable for both machine learning and deep learning research.

OPTIMAM Mammography Image Database (OMI-DB) and ADMANI

The OPTIMAM and ADMANI datasets are prominent resources in the United Kingdom and Australia respectively.
• OMI-DB offers approximately 3 million mammography images sourced from over 170,000 women, collected across several years, making it one of the largest repositories in the field.
• ADMANI includes around 4.4 million images with detailed annotations and longitudinal data records, supporting both retrospective studies and the development of predictive diagnostic models.

VinDr-Mammo

VinDr-Mammo includes 5,000 four-view exams featuring comprehensive breast-level assessments and annotations. The detailed reports that accompany these images are designed to aid in developing computer-aided detection (CAD) tools, offering a rich dataset for both binary classification and lesion localization tasks.

MRI and Hybrid Imaging Datasets

Expanding Modalities to Enhance AI Models

Although traditional mammography datasets have been instrumental in advancing breast cancer detection, sub-modalities like DCE-MRI (Dynamic Contrast-Enhanced Magnetic Resonance Imaging) provide additional insights. Using datasets that include MRI images can significantly augment the performance of an AI system by allowing it to learn from multiple imaging aspects.

MAMA-MIA Dataset

The MAMA-MIA dataset is a large-scale multicenter breast cancer DCE-MRI repository containing 1,506 cases with expert tumor segmentations. It is particularly suited for benchmarking and developing deep learning algorithms that incorporate dynamic imaging, making it a beneficial dataset for researchers aiming to combine MRI features with mammography data.

Duke Breast Cancer MRI Dataset

This dataset from Duke University provides MRI scans of 922 patients diagnosed with breast cancer. It includes detailed annotations such as tumor bounding boxes, along with valuable clinical imaging data. The combination of imaging, clinical metadata, and precise segmentations establishes this dataset as a gold standard for training AI models that aim to interpret complex imaging studies.

Microwave Imaging (MWI) Datasets

Emergent Technologies and Data Challenges

Microwave Imaging (MWI) leverages non-ionizing electromagnetic waves to probe breast tissues. Unlike traditional imaging methodologies, MWI offers promising capabilities by potentially differentiating malignant tissues based on their dielectric properties. Although this field remains nascent compared to mammography, there are key datasets and experimental studies that have provided initial datasets and measurement results.

University of Manitoba Breast Microwave Imaging Dataset (UM-BMID)

The UM-BMID dataset is one of the few resources dedicated solely to microwave imaging in breast cancer detection. It consists of S-parameter measurements obtained from experimental scans of MRI-derived breast tissue models. These data are instrumental in exploring the feasibility and diagnostic performance of microwave imaging techniques. Researchers interested in MWI are encouraged to collaborate with academic institutions or companies that specialize in this technology, as public datasets remain limited.

Future MWI datasets are likely to emerge as the technology continues to be developed and refined. Engaging with ongoing research, attending conferences, and networking with teams working on prototypes such as MARIA and MammoWave can offer valuable access to early-stage imaging data.

Integrating Data for Enhanced AI Training

Synergistic Use of Multi-Modality Data

While selecting datasets, it is essential to consider that each imaging modality provides complementary information. Models trained on a mixture of traditional mammography, digital mammograms with comprehensive annotations, and MRI studies can learn richer feature sets and thereby improve detection accuracy. Additionally, integrating data from MWI—even if currently limited—can offer perspectives that are not captured by conventional imaging, reducing false negatives, and potentially highlighting early-stage abnormalities.

Integrating multi-modality data necessitates harmonizing different data formats, managing diverse annotations, and addressing potential computational challenges. Researchers often utilize deep convolutional neural networks (CNNs) and transfer learning techniques to overcome these challenges. For instance, pretrained networks on large mammography datasets can be fine-tuned with additional MRI or MWI data, thereby building robust AI systems with enhanced generalizability and specificity.

Data Harmonization and Preprocessing

Combining datasets from various imaging modalities requires meticulous preprocessing steps such as image normalization, resolution matching, and artifact removal. The preprocessing pipelines often involve:

Normalization: Adjusting image intensities to compensate for differences in acquisition parameters.
Segmentation: Employing automated tools to delineate regions of interest such as tumors or lesions.
Alignment: Registration of images from different modalities to a common coordinate system, which is crucial for supervised training.

The workflow is designed to facilitate the extraction of high-quality, representative features from each dataset, thereby enhancing the performance of the AI model across diverse populations and imaging techniques.

Comparative Table of Datasets

Dataset Name	Modality	Key Features	Access Link
CBIS-DDSM	Mammography	2,620 film mammograms; pathology-verified; varied cases (normal, benign, malignant)	Link
OPTIMAM/ADMANI	Mammography	Millions of images; long-term patient data; detailed annotations	N/A (Access through formal request)
VinDr-Mammo	Mammography	4-view exams; detailed breast-level assessments and annotations	N/A (Request platform-specific access)
MAMA-MIA	DCE-MRI	1,506 cases; dynamic contrast-enhanced images; expert tumor segmentations	Link
Duke Breast Cancer MRI Dataset	DCE-MRI	MRI scans of 922 patients; tumor bounding box annotations; clinical data	Link
UM-BMID	Microwave Imaging (MWI)	S-parameter measurements; experimental imaging data from MRI-derived models	Link

Enhancing Your Research Strategy

Curating Multi-Modal Data for Superior AI Performance

When developing AI for detecting breast cancer, it is vital to design a research strategy that leverages the strengths of both traditional imaging and emerging methods. Utilizing a variety of datasets not only increases the diversity of the training data but also aids in creating models that can adapt to real-world variability. This multi-modality approach is particularly useful given the following considerations:

Increased Robustness: Tackling imaging challenges using different modalities ensures that the model can generalize across varied patient populations and imaging systems.
Enhanced Feature Learning: Models trained on diverse data can learn unique features related to contrast differences, morphological variations, and tumor heterogeneity, thereby improving diagnostic accuracy.
Mitigating False Negatives: Integrating data from MWI may capture subtle differences in tissue properties that are not evident in mammograms, reducing the risk of missing early-stage cancers.

By combining mammography, DCE-MRI, and emerging MWI datasets, researchers can push forward the boundaries of AI diagnostics in oncology. Collaboration with research institutions and AI consortia often provides early access to novel datasets and supports the development of sophisticated diagnostic tools.

References

MAMA-MIA Dataset - GitHub
Duke Breast Cancer MRI Dataset - Duke University
Advanced MRI Breast Lesions - TCIA
UM-BMID - IEEE Dataport
CBIS-DDSM - Kaggle