In developing artificial intelligence (AI) algorithms for breast cancer detection, high-quality datasets play a crucial role in training and validating deep learning models. Breast cancer detection using mammography has been a fundamental application area, with numerous datasets available comprising scanned film mammograms, digital images, and dynamic imaging modalities. Additionally, microwave imaging (MWI) is an emerging field that promises to offer complementary information by utilizing microwave signals to differentiate between healthy and malignant tissues.
This guide is designed for researchers and engineers interested in exploring datasets that support training and research in breast cancer detection. Below, we detail several key datasets and data resources available for both mammography and microwave imaging modalities, along with pointers to further collaboration opportunities for MWI datasets.
Mammography remains one of the primary imaging techniques used globally for early breast cancer detection. There are multiple robust datasets prepared for AI research that encompass a variety of imaging methodologies and comprehensive clinical annotations.
The CBIS-DDSM (Curated Breast Imaging Subset of DDSM) dataset is a standardized version of the Digital Database for Screening Mammography (DDSM). It comprises 2,620 scanned film mammography studies that include normal, benign, and malignant cases confirmed by pathology reports. This dataset is hosted on Kaggle and is widely used for training machine learning models due to its detailed annotations and high-quality labels. Detailed information about the imaging studies, pathology confirmations, and metadata make this dataset invaluable for both machine learning and deep learning research.
The OPTIMAM and ADMANI datasets are prominent resources in the United Kingdom and Australia respectively.
• OMI-DB offers approximately 3 million mammography images sourced from over 170,000 women, collected across several years, making it one of the largest repositories in the field.
• ADMANI includes around 4.4 million images with detailed annotations and longitudinal data records, supporting both retrospective studies and the development of predictive diagnostic models.
VinDr-Mammo includes 5,000 four-view exams featuring comprehensive breast-level assessments and annotations. The detailed reports that accompany these images are designed to aid in developing computer-aided detection (CAD) tools, offering a rich dataset for both binary classification and lesion localization tasks.
Although traditional mammography datasets have been instrumental in advancing breast cancer detection, sub-modalities like DCE-MRI (Dynamic Contrast-Enhanced Magnetic Resonance Imaging) provide additional insights. Using datasets that include MRI images can significantly augment the performance of an AI system by allowing it to learn from multiple imaging aspects.
The MAMA-MIA dataset is a large-scale multicenter breast cancer DCE-MRI repository containing 1,506 cases with expert tumor segmentations. It is particularly suited for benchmarking and developing deep learning algorithms that incorporate dynamic imaging, making it a beneficial dataset for researchers aiming to combine MRI features with mammography data.
This dataset from Duke University provides MRI scans of 922 patients diagnosed with breast cancer. It includes detailed annotations such as tumor bounding boxes, along with valuable clinical imaging data. The combination of imaging, clinical metadata, and precise segmentations establishes this dataset as a gold standard for training AI models that aim to interpret complex imaging studies.
Microwave Imaging (MWI) leverages non-ionizing electromagnetic waves to probe breast tissues. Unlike traditional imaging methodologies, MWI offers promising capabilities by potentially differentiating malignant tissues based on their dielectric properties. Although this field remains nascent compared to mammography, there are key datasets and experimental studies that have provided initial datasets and measurement results.
The UM-BMID dataset is one of the few resources dedicated solely to microwave imaging in breast cancer detection. It consists of S-parameter measurements obtained from experimental scans of MRI-derived breast tissue models. These data are instrumental in exploring the feasibility and diagnostic performance of microwave imaging techniques. Researchers interested in MWI are encouraged to collaborate with academic institutions or companies that specialize in this technology, as public datasets remain limited.
Future MWI datasets are likely to emerge as the technology continues to be developed and refined. Engaging with ongoing research, attending conferences, and networking with teams working on prototypes such as MARIA and MammoWave can offer valuable access to early-stage imaging data.
While selecting datasets, it is essential to consider that each imaging modality provides complementary information. Models trained on a mixture of traditional mammography, digital mammograms with comprehensive annotations, and MRI studies can learn richer feature sets and thereby improve detection accuracy. Additionally, integrating data from MWI—even if currently limited—can offer perspectives that are not captured by conventional imaging, reducing false negatives, and potentially highlighting early-stage abnormalities.
Integrating multi-modality data necessitates harmonizing different data formats, managing diverse annotations, and addressing potential computational challenges. Researchers often utilize deep convolutional neural networks (CNNs) and transfer learning techniques to overcome these challenges. For instance, pretrained networks on large mammography datasets can be fine-tuned with additional MRI or MWI data, thereby building robust AI systems with enhanced generalizability and specificity.
Combining datasets from various imaging modalities requires meticulous preprocessing steps such as image normalization, resolution matching, and artifact removal. The preprocessing pipelines often involve:
The workflow is designed to facilitate the extraction of high-quality, representative features from each dataset, thereby enhancing the performance of the AI model across diverse populations and imaging techniques.
Dataset Name | Modality | Key Features | Access Link |
---|---|---|---|
CBIS-DDSM | Mammography | 2,620 film mammograms; pathology-verified; varied cases (normal, benign, malignant) | Link |
OPTIMAM/ADMANI | Mammography | Millions of images; long-term patient data; detailed annotations | N/A (Access through formal request) |
VinDr-Mammo | Mammography | 4-view exams; detailed breast-level assessments and annotations | N/A (Request platform-specific access) |
MAMA-MIA | DCE-MRI | 1,506 cases; dynamic contrast-enhanced images; expert tumor segmentations | Link |
Duke Breast Cancer MRI Dataset | DCE-MRI | MRI scans of 922 patients; tumor bounding box annotations; clinical data | Link |
UM-BMID | Microwave Imaging (MWI) | S-parameter measurements; experimental imaging data from MRI-derived models | Link |
When developing AI for detecting breast cancer, it is vital to design a research strategy that leverages the strengths of both traditional imaging and emerging methods. Utilizing a variety of datasets not only increases the diversity of the training data but also aids in creating models that can adapt to real-world variability. This multi-modality approach is particularly useful given the following considerations:
By combining mammography, DCE-MRI, and emerging MWI datasets, researchers can push forward the boundaries of AI diagnostics in oncology. Collaboration with research institutions and AI consortia often provides early access to novel datasets and supports the development of sophisticated diagnostic tools.