\melbaid

YYYY:NNN \melbaauthorsKeyvan Farahani and Linmin Pei \firstpageno1 \melbayear2025 \datesubmittedyyyy-m1-d1 \datepublishedyyyy-m2-d2 \melbaspecialissueMedical Imaging with Deep Learning (MIDL) 2020 \melbaspecialissueeditorsMarleen de Bruijne, Tal Arbel, Ismail Ben Ayed, Hervé Lombaert \ShortHeadingsMELBA Journal Sample ArticleKeyvan Farahani and Linmin Pei \affiliations\num1 \addrFrederick National Laboratory for Cancer Research, Frederick, MD, 21702, USA
\num2 \addrNational Cancer Institute, National Institute of Health (NIH), Bethesda, MD 20892, USA
\num3 \addrUniversity of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
\num4 \addrPixelMed Publishing, Bangor, PA, USA
\num5 \addrEssential Software Inc. 9711 Washingtonian Blvd, Suite 550, Gaithersburg, MD, 20878
\num6 \addrAmazon Web Services (AWS)
\num7 \addrUniversity of California, Los Angeles, CA, USA
\num8 \addrCedars-Sinai Medical Center, Los Angeles, CA, USA
\num9 \addrStanford University, Palo Alto, CA, USA
\num10 \addrBiomedical Engineering & Imaging Institute, Icahn School of Medicine at Mount Sinai, New York NY 10029, USA
\num11 \addrDepartment of Radiology, Icahn School of Medicine at Mount Sinai, New York NY 10029, USA
\num12 \addrCenter for Diagnostic and Therapeutic Radiology, Department of Radiology, University Medical Center Freiburg, Hugstetter Straße 55, D-79106 Freiburg, Germany
\num13 \addrDivision of Medical and Biological Informatics, German Cancer Research Center, Heidelberg 69120, Germany
\num14 \addrPattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg 69120, Germany
\num15 \addrGE Healthcare, Dublin, Ireland
\num16 \addrGE Healthcare, Milano, Italy
\num17 \addrGE Healthcare, Waukesha, USA
\num18 \addrShanghaiTech University, Shanghai 201210, China
\num19 \addrVanderbilt University Medical Center, Nashville TN 37203, USA
\num20 \addrDivision of Radiotherapy and Imaging, Institute of Cancer Research, London UK
\num21 \addrNational Tsing Hua University, Hsinchu 300044, Taiwan
\num22 \addrImpact Business Information Solutions, Inc, Princeton NJ 08540, USA
\num23 \addrMoffitt Cancer Center, Tampa FL 33612, USA
\num24 \addrUniversity of South Florida, Tampa FL 33620, USA
\num25 \addrRowan University, Glassboro NJ 08028, US
\num26 \addrEllumen, Inc. Silver Spring, MD 20910
\num27 \addrDeloitte Consulting LLP, New York, NY, USA
\num28 \addrSage Bionetworks
\num* \addr(Corresponding Author) National Heart, Lung, and Blood Institute, National Institute of Health (NIH), Bethesda, MD 20892, USA

Medical Image De-Identification Benchmark Challenge

\firstnameLinmin \surnamePei\aff1\orcid0000-0001-6135-9429    \nameGranger Sutton\aff2\orcid0000-0001-7498-8048    \nameMichael Rutherford\aff3\orcid0000-0003-2665-753X    \nameUlrike Wagner\aff1\orcid0000-0002-3230-5058    \nameTracy Nolan\aff3\orcid0000-0002-7023-7586    \nameKirk Smith\aff3\orcid0000-0002-8735-7576    \namePhillip Farmer\aff3\orcid0000-0003-1448-1346    \namePeter Gu\aff5    \nameAmbar Rana\aff5    \nameKailing Chen\aff5    \nameThomas Ferleman\aff5    \nameBrian Park\aff5    \nameYe Wu\aff5    \nameJordan Kojouharov\aff6    \nameGargi Singh\aff6    \nameJon Lemon\aff6    \nameTyler Willis\aff6    \nameMilos Vukadinovic\aff7, 8\orcid0000-0002-0431-1493    \nameGrant Duffy\aff8    \nameBryan He\aff9\orcid0000-0002-6150-761X    \nameDavid Ouyang\aff8\orcid0000-0002-3813-7518    \nameMarco Pereañez\aff10\orcid0000-0002-6447-4956    \nameDaniel Samber\aff10\orcid0000-0002-9241-2330    \nameDerek A. Smith\aff10\orcid0000-0002-8879-281X    \nameChristopher Cannistraci\aff10\orcid0009-0003-0900-8252    \nameZahi Fayad\aff10\orcid0000-0002-3439-7347    \nameDavid S. Mendelson\aff11\orcid0000-0002-1555-8002    \nameMichele Bufano\aff12\orcid0009-0000-5067-9814    \nameElmar Kotter\aff12\orcid0000-0001-9022-6000    \nameHamideh Haghiri\aff13\orcid0009-0006-7308-4235    \nameRajesh Baidya\aff13\orcid0009-0003-5235-1769    \nameStefan Dvoretskii\aff13\orcid0000-0001-7769-0167    \nameKlaus H. Maier-Hein\aff13, 14\orcid0000-0002-6626-2463    \nameMarco Nolden\aff13, 14\orcid0000-0001-9629-0564    \nameChristopher Ablett\aff15    \nameSilvia Siggillino\aff16    \nameSandeep Kaushik\aff17\orcid0000-0003-0654-0799    \nameHongzhu Jiang\aff18    \nameSihan Xie\aff18    \nameZhiyu Wan\aff18, 19\orcid0000-0003-3752-5778    \nameAlex Michie\aff20\orcid0000-0003-1685-6092    \nameSimon J Doran\aff20\orcid0000-0001-8569-9188    \nameAngeline Aurelia Waly\aff21    \nameFelix A. Nathaniel Liang\aff21    \nameHumam Arshad Mustagfirin\aff21    \nameMichelle Grace Felicia\aff21    \nameKuo Po Chih\aff21    \nameRahul Krish\aff22    \nameGhulam Rasool\aff23\orcid0000-0001-8551-0090    \nameNidhal Bouaynaya\aff25    \nameNikolas Koutsoubis\aff23, 24    \nameKyle Naddeo\aff25    \nameKartik Pandit\aff24    \nameTony O’Sullivan\aff22    \nameRaj Krish\aff22    \nameQinyan Pan\aff26\orcid0009-0006-8294-6167    \nameScott Gustafson\aff26    \nameBenjamin Kopchick\aff27\orcid0000-0003-1125-0155    \nameLaura Opsahl-Ong\aff27    \nameAndrea Olvera-Morales\aff27    \nameJonathan Pinney\aff27    \nameKathryn Johnson\aff27    \nameTheresa Do\aff27    \nameJuergen Klenk\aff27    \nameMaria Diaz\aff28    \nameArti Singh\aff28    \nameRong Chai\aff28\orcid0000-0003-3049-7241    \nameDavid A. Clunie\aff4\orcid0000-0002-2406-1145    \nameFred Prior\aff3\orcid0000-0002-6314-5683    \nameKeyvan Farahani\aff*\orcid0000-0003-2111-1896
Abstract

The de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) is a fundamental requirement for sharing medical images, particularly through public repositories, to ensure compliance with patient privacy laws. In addition, preservation of non-PHI metadata to inform and enable downstream development of imaging artificial intelligence (AI) is an important consideration in biomedical research. In 2024, the National Cancer Institute (NCI), in collaboration with the Society for Medical Image Computing and Computer Assisted Intervention (MICCAI), and Sage Bionetworks, conducted the Medical Image de-identification Benchmark (MIDI-B) Challenge. The goal of MIDI-B was to provide a standardized platform for benchmarking of DICOM image deID tools based on a set of rules conformant to the HIPAA Safe Harbor regulation, the DICOM Attribute Confidentiality Profiles, and best practices in preservation of research-critical metadata, as defined by The Cancer Imaging Archive (TCIA). The challenge employed a large, diverse, multi-center, and multi-modality set of real de-identified radiology images with synthetic PHI/PII inserted.
The MIDI-B Challenge consisted of three phases: training, validation, and test. Eighty individuals registered for the challenge. In the training phase, we encouraged participants to tune their algorithms using their in-house or public data. The validation and test phases utilized the DICOM images containing synthetic identifiers (of 216 and 322 subjects, respectively). Ten teams successfully completed the test phase of the challenge. To measure success of a rule-based approach to image deID, scores were computed as the percentage of correct actions from the total number of required actions. The scores ranged from 97.91% to 99.93%. Participants employed a variety of open-source and proprietary tools with customized configurations, large language models, and optical character recognition (OCR). In this paper we provide a comprehensive report on the MIDI-B Challenge’s design, implementation, results, and lessons learned.

doi:
10.59275/j.melba.2024-AAAA
keywords:
Medical image de-identification, protected health information, personally identifiable information, HIPAA Safe Harbor regulation, multi-center, multi-modality, synthetic PHI, DICOM de-identification, optical character recognition (OCR).
volume: 2

1 Introduction

\enluminure

A fundamental requirement for sharing medical images, particularly through public image repositories that are openly accessible without restrictions, is the de-identification (deID) of protected health information (PHI) and personally identifiable information (PII) from the Digital Imaging and Communications in Medicine (DICOM) header and the pixel matrix of all images. Government regulations in most jurisdictions, including the United States and the European Union, prohibit the release of PHI and PII, but do not specify the mechanism used to adhere to the regulations (Moore et al., 2015). Two examples of government privacy regulations are the Privacy Rule of the Health Insurance Portability and Accountability Act (HIPAA) in the United States, and the General Data Protection Regulation (GDPR) in the European Union (Catelli and Esposito, 2023). By the early 2000s, healthcare and research organizations began standardizing deID practices aligned with privacy regulations. In the 2010s, semi-automated deID tools were developed as part of the move to electronic health records (Garfinkel et al., 2023). The growth in artificial intelligence and machine learning (AI/ML) technologies over the past several years has spurred interest in scalable automated solutions for image deID (Monteiro et al., 2017; Rempe et al., 2024; Aasen and Mathisen, 2021).

The Digital Imaging and Communications in Medicine (DICOM) standard, which is widely adopted for medical imaging, embeds patient-related information as metadata in the DICOM files. For some modalities, manufacturers embed certain PHI/PII, such as patient name and date of birth, in the pixel matrix of images, such as for ultrasound (Bidgood Jr et al., 1997; Mildenberger et al., 2002). DICOM data elements used as attributes of information objects are identified by numeric tags (group number and element number), with even-numbered group numbers representing standard attributes, and odd-numbered groups representing private attributes specific to manufacturers. In a DICOM medical image, many identifiers are stored as predefined attributes in the DICOM header. These include the Patient’s Name (0010, 0010), and Patient’s Birth Date (0010, 0030), etc. PHI/PII can be present in either such structured DICOM attributes intended for the purpose, but also in free-text fields, such as Additional Patient History (0010, 21B0). Header and pixel data of DICOM images must, therefore, be de-identified. Fiugre 1 shows a DICOM image comparison before and after deID. The well-defined structure of fixed DICOM attributes allows for semi-automatic deID to remove corresponding PHI/PII with relative reliability, but requiring verification, using tools such as The Radiological Society of North America (RSNA) Clinical Trial Processor (CTP) (Freymann et al., 2012), DICOM Library (Macdonald et al., 2024), and XNAT platform (Clunie et al., 2024). Recently, the technologies of medical image deID have evolved from strict rule-based systems to include hybrid approaches such as deep-learning-based systems for object character recognition in pixel images, and large language models (LLMs) for detection of PHI/PII within free text fields (Langlois et al., 2024; Kopchick et al., 2022). The DICOM deID process for removing sensitive information needs to follow instructions of a set of deID standards such as those in DICOM PS3.15 (National Electrical Manufacturers Association (2025), NEMA) for PHI/PII fields to remove or modify, and satisfy HIPAA DeID requirements.

Refer to caption
Refer to caption
Figure 1: A DICOM image pre (top) and post(bottom) de-identification from the MIDI-B dataset. Left: the DICOM pixel image. Right: the DICOM metadata. In the bottom image, changes after de-identification are highlighted in red boxes.

However, challenges still remain in the DICOM header deID because some DICOM attributes allow free text that can contain PHI/PII, and DICOM private vendor data elements are not rigidly defined (Macdonald et al., 2024). This means that such free text data elements need to be completely removed or replaced if they cannot be cleaned. Furthermore, maximizing data retention for downstream analysis, different image modalities, and variation in DICOM implementations also make the DICOM deID more difficult. Although some DICOM image deID techniques have been published (Clunie et al., 2025), most datasets used to train and validate these methods are privately owned and not publicly available. To the best of our knowledge, there is no publicly available large dataset containing both before and after deID data for researchers and healthcare professionals to evaluate the deID algorithms they develop. In addition, there is no gold standard for DICOM deID for dealing with free text in DICOM attributes.

Therefore, we conducted the Medical Image DeID Benchmark (MIDI-B) challenge, to help guide the assessment and further development of medical image deID tools for patient privacy protection in shared data. In this challenge, we used images infused with synthetic PHI/PII that constitute the MIDI dataset, a small portion of which has previously been released (Rutherford et al., 2021b, a; National Cancer Institute (2025), NCI). To continue serving the research community, the Sage Bionetworks Synapse platform used for the MIDI-B Challenge was repurposed after the challenge as a benchmarking service open to the public for evaluating deID algorithms.

2 Methods

2.1 Data generation

Providing data for an image deID benchmark is challenging because actual PHI/PII cannot be publicly released, and gold standard deID answers keys for synthetic PHI/PII datasets are not available. Most deID algorithms have been developed in house using private datasets containing PHI/PII and evaluated through manual inspection by experts. This approach limits who can develop these algorithms to those who have access to such datasets. Developers without these in-house resources can benefit from realistic DICOM datasets that do not contain actual PHI/PII but rather synthetic facsimiles of PHI/PII, together with an automated mechanism for evaluating their results. Through a separate project we created the MIDI Dataset, a large portion of which was used in the MIDI-B Challenge. The MIDI Dataset is a large and diverse set of clinical DICOM images from multiple modalities and submitting sites, sourced from original already de-identified images in The Cancer Imaging Archive (TCIA) (also available in the National Cancer Institute Imaging Data Commons (IDC)), then infused with synthetic identifiers (PHI/PII). Imaging modalities represented in the MIDI Dataset include computed radiography (CR), magnetic resonance (MR), computed tomography (CT), positron emission tomography (PET), digital X-Ray (DX), structured report (SR), mammography (MG), and ultrasound (US). In the MIDI-B Challenge, most images are from the MR, CT, and PT modalities. Details of the design and construction of the MIDI Dataset and the ground truth answer keys are described elsewhere (Rutherford et al., 2021b). Large portions of the MIDI dataset were used for the MIDI-B challenge and partitioned into validation and test sets which are outlined in Rutherford et al, included in this special issue.

2.2 Challenge design

The MIDI-B Challenge consisted of three phases of Training (3 months), Validation (1 month), and Test (1 week). The challenge was implemented in the Synapse platform (Sage Bionetwork). Participants were required to register for the challenge. All operations for the MIDI-B challenge, including access to validation and test data and submission of results were handled through the Synapse platform. Participants were allowed to download the appropriate dataset for each phase of the challenge through the Synapse platform.

2.3 Training phase

The MIDI-B Challenge participants were expected to train their models on their own (in house) DICOM images. However, to familiarize participants with the dataset used for validation and testing, they were pointed to a small subset of the MIDI dataset (before and after deID) available through The Cancer Imaging Archive (TCIA) and Imaging Data Commons (IDC) (Rutherford et al., 2021b).

2.4 Validation phase

During the validation phase, DICOM images from 216 patients of the MIDI dataset were available to the participants as the validation dataset. The validation dataset contained a total of 23,921 instances, from 241 studies and 280 series. The distribution of the validation dataset modalities is shown in Table 1.

Table 1: Data distribution of the validation dataset by modality
CR MR CT PET DX SR MG US Total
Patients 22 52 39 29 22 21 25 24 234*
Studies 23 55 48 37 25 22 25 25 260*
Series 24 60 49 48 27 22 25 25 280
Instance 26 3,511 4,406 15,724 34 22 31 167 23,921

*: The sum of patient/study numbers does not match the number of patients/studies, because a patient/study could have images from multiple modalities.

2.5 Test phase

To advance to the test phase, each participating team was required to submit a short manuscript describing their algorithm and results for publication in the MICCAI associated proceedings in the Springer Lecture Notes in Computer Science. The manuscripts were reviewed by members of the challenge organizing committee and comments were provided to the authors. During the test phase, DICOM images from 322 patients of the MIDI dataset were available to the participants as the test dataset. The test dataset contained a total 29,660 instances, from 364 studies and 428 series. The test dataset has the same image modalities as the validation dataset. The distribution of the test dataset modalities is shown in Table 2.

Table 2: Data distribution of the test dataset by image modality
CR MR CT PET DX SR MG US Total
Patients 33 79 60 44 32 31 37 36 352*
Studies 38 85 74 62 34 34 37 36 400*
Series 39 88 76 74 35 42 38 36 428
Instance 40 5,407 7,429 16,538 58 42 44 102 29,660

*: The sum of patient/study numbers does not match the number of patients/studies, because a patient/study could have images from multiple modalities.

2.6 Answer key

The answer keys for the MIDI dataset (divided into validation and test datasets) are each a database of correct deID actions taken with respect to the value of various DICOM data elements. Each answer key is used to evaluate DICOM deID. They contain information on the tag, tag_name, value, action category, action_text, answer_category, etc. The key includes eight actions to be performed on the DICOM file header and two actions to be performed on the pixel data. The header actions are <date_shifted>, <patid_consistent>, <tag_retained>, <text_notnull>, <text_removed>, <text_retained>, <uid_changed>, and <uid_consistent>. The pixel data actions are <pixels_hidden> and <pixels_retained>. The action definitions used in the MIDI-B Challenge are listed below in Table 3.

Table 3: Action definition used in the MIDI-B Challenge
Action Description Score
<date_shifted > The date was shifted using a specified shift value. 0: failed
1: passed
<patid_consistent > This ensures patient IDs are consistent with the patient ID mapping file. 0: failed
1: passed
<pixels_hidden > The burned-in PHI/PII tokens within the coordinates specified in the answer key are hidden. score=#removed_tokens#total_tokensscore=\frac{\#_{removed\_tokens}}{\#_{total\_tokens}}italic_s italic_c italic_o italic_r italic_e = divide start_ARG # start_POSTSUBSCRIPT italic_r italic_e italic_m italic_o italic_v italic_e italic_d _ italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT end_ARG start_ARG # start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l _ italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT end_ARG
score[0,1]score\in[0,1]italic_s italic_c italic_o italic_r italic_e ∈ [ 0 , 1 ]
<pixels_retained > This ensures pixels are unchanged. 0: failed
1: passed
<tag_retained > The data element with this tag is retained and present in the DICOM header. 0: failed
1: passed
<text_notnull > The value of the data element with this tag is not null or zero length value. 0: failed
1: passed
<text_removed > The PHI/PII tokens specified are removed from the data element value for this tag. score=#removed_tokens#total_tokensscore=\frac{\#_{removed\_tokens}}{\#_{total\_tokens}}italic_s italic_c italic_o italic_r italic_e = divide start_ARG # start_POSTSUBSCRIPT italic_r italic_e italic_m italic_o italic_v italic_e italic_d _ italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT end_ARG start_ARG # start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l _ italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT end_ARG
score[0,1]score\in[0,1]italic_s italic_c italic_o italic_r italic_e ∈ [ 0 , 1 ]
<text_retained > The tokens (non-PHI/PII) specified are retained in the data element value for this tag. score=#retained_tokens#total_tokensscore=\frac{\#_{retained\_tokens}}{\#_{total\_tokens}}italic_s italic_c italic_o italic_r italic_e = divide start_ARG # start_POSTSUBSCRIPT italic_r italic_e italic_t italic_a italic_i italic_n italic_e italic_d _ italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT end_ARG start_ARG # start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l _ italic_t italic_o italic_k italic_e italic_n italic_s end_POSTSUBSCRIPT end_ARG
score[0,1]score\in[0,1]italic_s italic_c italic_o italic_r italic_e ∈ [ 0 , 1 ]
<uid_changed > Each UID is changed. 0: failed
1: passed
<uid_consistent > Ensures all UIDs are consistent with the UID mapping file. 0: failed
1: passed

2.7 Challenge submission

In the validation phase and test phase, participants were required to submit two mapping files, patient ID and Unique Identifier (UID), along with de-identified images to the submission portal. The mapping files allowed correspondence between before and after files to be established. The two mapping files are table lists in CSV format, indicating the linkage of the patient IDs or UIDs before and after deID. The MIDI-B validation script (Rutherford et al., 2021b) evaluated each submission using the appropriate answer key, mapping files, and DICOM images.

2.8 Evaluation metric

We evaluated the de-ID results against various externally defined requirements, including the HIPAA Privacy Rule’s ’safe harbor method’ (18 elements) (U.S. Department of Health and Human Services, 2025), the DICOM PS3.16 Basic Confidentiality Profile National Electrical Manufacturers Association (2025) (NEMA). and TCIA’s best practices (The Cancer Imaging Archive, 2025), as proxies for the real-world risk of re-identification, which is indeterminable.

For each action type, we list the description and the scoring in Table 3. The instance-based accuracy, was computed as the percentage of the correctly executed actions across all instances including partial credit as shown in Table 3.

accuracy=sumofscoresforallactionsnumberoftotalactions×100%,\displaystyle accuracy=\frac{sum\ of\ scores\ for\ all\ actions}{number\ of\ total\ actions}\times 100\%,italic_a italic_c italic_c italic_u italic_r italic_a italic_c italic_y = divide start_ARG italic_s italic_u italic_m italic_o italic_f italic_s italic_c italic_o italic_r italic_e italic_s italic_f italic_o italic_r italic_a italic_l italic_l italic_a italic_c italic_t italic_i italic_o italic_n italic_s end_ARG start_ARG italic_n italic_u italic_m italic_b italic_e italic_r italic_o italic_f italic_t italic_o italic_t italic_a italic_l italic_a italic_c italic_t italic_i italic_o italic_n italic_s end_ARG × 100 % ,

There were two measurement strategies used to assess the performance of the submissions: series-based and instance-based methods. The series-based method calculates the action accuracy as a percentage at the series level, while the instance-based method calculates the action accuracy as a percentage at the instance level. For the series-based method, an error for an action in any instance of the series counts as a complete error for that action in the series (no partial credit was given as it is at the instance level). Series-based scoring was used for final rankings in MIDI-B to normalize across modalities with very different numbers of instances per series (such as multiple slices in one series that share similar characteristics). The assumption was that the DICOM data elements and burned-in pixel text would be the same across different instances of the same series and hence have the same PHI/PII actions. This assumption was only partially true. There was no expectation that the series-based and instance-based scoring would produce the exact same rankings but that the rankings would probably be similar, as turned out to be the case. To be more informative to participants during the test phase, we also provided instance-based results.

Our evaluation method generated detailed Scoring and Discrepancy Reports, as described in the Appendix (A.1.1, A.1.2, A.1.3, and A.2). The Scoring Report consisted of three sheets: Scoring, Actions, and Categories. The Scoring sheet provided the number of correctly executed actions against the total number of required actions, which was used in computing the accuracy of each algorithm (Table A1). The Actions sheet summarized the number of errors, correct actions, and totals for each action (Table A2). The Categories sheet enumerated the number of errors, correct actions, and totals for each type and subcategory (Table A3). The Discrepancy Report, focused on pixel deID, listed the differences between the de-identified images and the corresponding Answer key (Table B1).

3 Results

A total of 80 individuals registered for the MIDI-B challenge and were allowed to participate in the validation and test phases.

3.1 Validation phase

A total of 108 attempts from 17 individuals were submitted to the Validation Submission Portal. The performance for all submissions in the validation phase is shown in Figure 2. The series-based performance across all submissions was 96.49%±4.78%96.49\%\pm 4.78\%96.49 % ± 4.78 %.

Refer to caption
Figure 2: The boxplot of the series-based performances for all submissions in the validation phase.

3.2 Test phase

Each team was allowed to submit only once during the test phase. The series-based performance for all teams is shown in Table 4, and the series-based performance across all teams was 99.53%±0.60%99.53\%\pm 0.60\%99.53 % ± 0.60 %.

Table 4: Performances using the series-/instance- methods in the test phase
Method T-01 T-02 T-03 T-04 T-05 T-06 T-07 T-08 T-09 T-10
Series-based* 0.9987 0.9993 0.9908 0.9955 0.9991 0.9992 0.9791 0.9968 0.9988 0.9958
Instance-base 0.9973 0.9983 0.9894 0.9906 0.9989 0.999 0.9617 0.9919 0.9988 0.9836

*: Used for the test phase leaderboard in the MIDI-B Challenge.

The series-based method captured the unique action errors for each series but did not evaluate the performance at the instance level. Therefore, we also computed the accuracy using the instance-based method, as shown in Table 4. The instance-based performance across all teams was 99.09%±1.15%99.09\%\pm 1.15\%99.09 % ± 1.15 %. The rankings based on the instance level are not exactly the same as the rankings based on the series level. Additionally, we performed the Analysis of Variance (ANOVA) to assess significant differences between the performances of the series-based and instance-based methods. The ANOVA test revealed a significant difference with a pvaluep-valueitalic_p - italic_v italic_a italic_l italic_u italic_e of 0.044580.044580.04458. As mentioned earlier, the final rankings for the validation and test phases of the MIDI-B Challenge were based on the series-based method.

We summarize the detailed action errors based on the series-level scoring report for all teams inTable 5. The bold numbers indicate the best performance across all teams for each action type. Overall, most teams performed well in actions related to <date_shifted>, <patid_consisitent>, <pixels_hidden>, <uid_changed>, and <uid_consistent>, but not all did well for other actions. Notably, no team achieved the best performance across all action categories. In Appendix A.3, we also provide informative performance for the instance-base result in Table C1. Errors per type for both series-based and instance-based methods are summarized in Table C2 and Table C3.

Table 5: The summary of action errors for all teams in series-based report during the test phase
Action T-01 T-02 T-03 T-04 T-05 T-06 T-07 T-08 T-09 T-10 Total Actions*
1 <date_shifted > 1 3 3 16 1 2 18 3 2 2 2306
2 <patid_consistent > 93 0 35 14 0 0 0 0 0 0 429
3 <pixels_hidden > 0 11 12 1 0 0 8 0 1 3 15
4 <pixels_retained > 32 7 259 0 34 0 864 69 0 0 29,471
5 <tag_retained > 10 0 1,069 545 0 8 187 89 19 89 121,690
6 <text_notnull > 74 74 638 340 68 12 71 74 106 71 85,323
7 <text_removed > 323 142 420 386 103 326 245 1,338 341 421 5,816
8 <text_retained > 208 196 2,526 1,131 310 131 3,587 310 201 1,863 254,949
9 <uid_changed > 1 0 128 2 1 4 116 0 0 0 40,633
10 <uid_consistent > 1 0 268 203 1 4 7,019 0 0 0 40,633
Total 743 433 5,358 2,638 518 487 12,115 1,883 670 2,449 581,265

The best performance in each action type across the ten teams is in bold.

*: The number in the last column is the total number for each action type in the answer key.

4 Discussion

In this section, we describe the general performance of teams with respect to the action errors in terms of the meta-data and pixel image deID, and share some lessons learned from the MIDI-B challenge.

Several participants noted that the datasets provided by MIDI-B are particularly valuable for DICOM de-identification. However, generating high-quality datasets presents several challenges. The most obvious one is creating realistic datasets that do not contain actual PHI/PII. One participant shared their approach to generating their own training dataset. In addition to creating the synthetic dataset, to evaluate performance an answer key is needed. There are many options for how to de-identify a dataset, which complicates the answer key generation and its verification. Furthermore, writing an algorithm to score a de-identified DICOM image against the answer key is not trivial. Many participants designed algorithms based on original in-house DICOM images with PHI/PII and validated them through manual curation.

One potential solution is to have the algorithms flag the PHI/PII for human review, with the review feedback incorporated into the continuous training. This approach works better for false positives, but false negatives may not be flagged and could be missed during review. One participant suggested assigning likelihood measures instead of using a strict threshold, so that false/true negatives near the threshold could be flagged and reviewed.

4.1 Meta-data deID

No team performed perfectly or scored the best across all action types. There was more consistent performance in some action types than others and this was mostly true for action types on which most teams scored well. We provide our analysis of action-type-specific issues below.

The three action types on which most submitters performed perfectly or nearly perfectly were the <patid_consistent>, the <uid_changed>, and <uid_consistent> actions. They are measuring very similar PHI/PII deID steps. Basically, unique identifiers (UIDs) that are PHI/PII need to be replaced consistently (the same old UID with the same new UID) throughout the set of DICOM files in order to preserve referential integrity. This should be a fairly straightforward coding technique, usually involving hashing or table lookups. Any errors should be simple coding errors and algorithmic shortcomings.

The <date_shifted> action also requires a straightforward algorithmic substitution. Where dates in DICOM date data elements for a given patient are all consistently shifted by some amount to retain the temporal relevance between different imaging studies for the same patient. The amount to shift is arbitrary and should be different for each patient to reduce the risk of reverse engineering the shift amount. The validation script used for the MIDI-B challenge checked that dates were changed but not whether the dates were shifted consistently for the same patient or differently for different patients. No teams scored perfectly for this action type. We see two possible reasons for this. The date fields in the fixed DICOM data elements are not consistently formatted and some of the dates being checked by the validation script contain free text data.

For the <tag_retained> action only two teams scored perfectly, and no teams scored perfectly for the <text_notnull> action. These action types are measuring DICOM standard compliance rather than PHI/PII removal. A DICOM file needs to remain DICOM compliant to be maximally useful for downstream processing/analysis. The validation script is therefore penalizing each noncompliant data element. Unfortunately, the MIDI dataset used as input for the MIDI-B challenge is not itself perfectly DICOM compliant because of the imperfect original data or synthetic data generation. So, while the intent was to penalize only PHI/PII removal that was handled improperly resulting in DICOM noncompliant data elements (i.e., made no worse by deID), the validation script actually penalized all noncompliance, including pre-existing issues. The submitters were not explicitly tasked with making noncompliant DICOM files compliant and many did not. All submitters were essentially penalized the same based on this factor.

The most challenging action types were <text_removed> and <text_retained>. They are measuring PHI/PII removal from free text DICOM data elements. The <text_removed> action measures whether PHI/PII that should be removed is completely removed. The <text_retained> action measures how much text that is not PHI/PII is removed by mistake (i.e., is undesirable to remove for preservation of research utility). There is often an algorithmic tradeoff between aggressively labeling text as PHI/PII to reduce false negatives at the expense of increasing false positives. Team 02 had the fewest combined errors for these two action categories and the best balance between the two action type errors. Most errors in the <text_removed> and the <text_retained> actions were due to retaining/removing text or only partially removing/retaining it. For example, in some submissions, the text in Series Description (0008, 103e) still existed for the <text_removed>. The Study Description (0008,1030) text ”<BREASTˆROUTINE for MASS for 311-25-3722>” was incorrectly trimmed to “<BREASTˆROUTINE for 311-25-3722>” for the <text_retained>. In these cases, the improper operations led to a penalty of partial score for the actions.

Only two teams successfully completed the <tag_retained> action. For example, the value of Acquisition Number (0020, 0012) was deleted by some teams.

For <text_notnull> action, none of the submissions were entirely successful for all DICOM images. For example, there was no value for Image Type (0008, 0008) in some submissions. The existence of free text made the action more challenging.

4.2 Pixel deID

Optical character recognition (OCR) techniques and tools were widely used by MIDI-B participants to de-identify the burned-in PHI/PII from the pixel images. In the action of <pixels_hidden>, four teams performed well. The errors occurred due to false positives or false negatives in the deID process of burned-in text. However, it is difficult to conclude the performances because only a very small number of pixel images with burned-in PHI/PII were present in the MIDI dataset and evaluated. For the <pixels_retained> action, half of the submissions failed to perform correctly. While nearly all teams achieve 98+%98+\%98 + % accuracy in the total score, there are still many errors in pixel image deID, including false positives (FP), and false negatives (FN). Figure 3 demonstrates several instances of improper pixel image deID. The top-left image shows the original image. The top-right image depicts the post-deID image, which includes false negatives (the PHI/PII information remains in the top red box) and false positives (some white blocks appear in the bottom red box). The bottom-left image shows a different post-deID of the same image with false negatives, where the date of birth still appears in the red box. Finally, the bottom-right image demonstrates over-removal because of the false positives, where the anatomical structure image has been altered beyond the removal of PHI/PII information.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Failed pixel image de-identification of a case from the MIDI-B Challenge submissions. Top-left: pre-de-identified pixel image. Top-right: de-identification with false negative and false positives. Bottom-left: de-identification with false negatives. Bottom-right: over-removal.

Improper pixel image deID may be partial, such as the retention of partial PHI/PII information or the removal of non-PHI/PII information. In the top-row of Figure 4, the deID process result contains false negatives. For example, only the last name is de-identified in the example, while the first name remains. In the bottom-row of Figure 4, the text ’SEMI-UPRIGHT’ at the position of the bottom red box has been misrecognized as PHI/PII information and then incorrectly redacted.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Two examples of improper pixel image de-identification. Top-row: only the first name is de-identified. Bottom-row: non-PII information is removed.

As mentioned, a limitation of the current MIDI-B dataset is the small number of images with PHI/PII data burned into the pixels. This limitation arose partly from the effort required to construct such images and partly from the effort needed to evaluate them for deID. As a result, the scoring greatly underweighted the action of <pixels_hidden>.

4.3 Lessons learned

The MIDI-B challenge was the first of its kind, and as such, during the course of the challenge and afterward, we learned many lessons and noted areas that could be improved in the future.

The first lesson was how best to evaluate the action accuracy. We decided to calculate accuracy by dividing the total number of correct actions by the total number of actions treating all action types equally. This was problematic in two ways. It weights all types equally even if some types of actions/errors are more impactful for either PHI/PII retention or utility preservation. Deciding on a weighting scheme for each type is not obvious or impartial so we chose not to assign any arbitrary scaling.

The second lesson was the impact of the implicit weighting that occurs based on the number of occurrences of each action type in the dataset. Assuming that each action type is important, then performance for that action type should have an equivalent impact on the final score. Our scoring approach assigned more weight to types which have more actions. This was partially addressed by providing detailed scoring breakdowns by type to the participants. A simple correction for the final score is to normalize the score by actions per type to give equal weight to each type, then the scoring function would be Accuracy=1Ni=1NCiSiAccuracy=\frac{1}{N}\sum_{i=1}^{N}\frac{C_{i}}{S_{i}}italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG, where NNitalic_N is the number of action types, CiC_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and SiS_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the sum of scores for all actions and the total number of the ithi^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT actions types, respectively, as shown in Table 6. For all teams the normalized series-based scores are lower than unnormalized series-based scores used in the MIDI-B Challenge. This implies that all teams did better on the action types which were more prevalent versus the less prevalent action types. Alternatively, the accuracy formula could incorporate a scaling of the perceived importance of each type. With this approach, after normalizing by number of actions in a type, each type could also be assigned an arbitrary weight depending on its perceived importance, with more critical types receiving higher weights. The resulting scoring function becomes Accuracy=i=1N(CiSi×ωi)Accuracy=\sum_{i=1}^{N}\left(\frac{C_{i}}{S_{i}}\times\omega_{i}\right)italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG × italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where CiC_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, SiS_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and ωi\omega_{i}italic_ω start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the sum of scores, the total number, and the corresponding weight of the ithi^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT action type, respectively.

Table 6: Performance comparison by using different scoring method for the series-based result in the test phase.
Score T-01 T-02 T-03 T-04 T-05 T-06 T-07 T-08 T-09 T-10
Over score 99.87% 99.93% 99.08% 99.55% 99.91% 99.92% 97.91% 99.92% 99.88% 99.58%
Normalized score 97.24% 92.39% 90% 98.09% 99.79% 99.42% 91.95% 99.42% 98.72% 97.18%

*: The best performance across the ten teams is in bold.

The second lesson learned pertained to the selection of the measurement method. As discussed in the section on measurement metrics, we chose the series-based method for the challenge. This method highlights the uniqueness of errors at the series level, allowing us to assess performance by focusing on overarching patterns and trends across a series of actions. While this method has its advantages, it also has limitations. Specifically, it may not effectively capture all errors that occur at the instance level. Errors that are isolated or specific to individual instances could be overlooked or under-weighted, potentially leading to an incomplete understanding of the system’s overall accuracy. This trade-off underscores the importance of carefully considering the goals and scope of the evaluation when selecting a measurement method.

5 Summary of participants methods in the test phase

The ten teams that completed the MIDI-B Challenge employed similar approaches in their implementations while also introducing their own innovations. Generally, the teams divided the work into two parts: text processing of the DICOM header data elements, and character recognition followed by text processing of the pixel image. Most teams primarily used regular expressions and keywords to identify PHI/PII in free text data elements rather than a machine learning approach. They also used various open-source character recognition packages. All participants claimed to adhere to the DICOM PS3.15 confidentiality standard that pertains to data elements known to potentially contain PHI/PII. Most teams also followed the TCIA’s best practices guidelines. The participants identified handling private data elements as the most significant challenge. In this regard, the TCIA’s private tag dictionary (The Cancer Imaging Archive, 2025) was especially helpful.

Many participants needed to fine-tune their algorithms based on the MIDI-B Challenge datasets and scoring criteria. The MIDI dataset contained many private data elements that were unfamiliar to most participants and the scoring criteria required keeping most of the information in these private data elements. DICOM PS3.15 provides multiple options for handling PHI/PII, ranging from conservative approaches, in which most free text data elements that might contain PHI/PII are removed or replaced, to less conservative or more sophisticated approaches, where free text data elements are retained to address specific use cases and may be redacted by searching for specific types of PHI/PII.

The primary method used for PHI/PII detection within the free text data elements was regular expressions, supplemented with contextual keywords to indicate PHI/PII such as names and addresses. Most participants joined the challenge with a predetermined set of regular expressions but found it necessary to tune the regular expressions based on the validation dataset. Several participants noted that their regular expressions had been built up over years of experience at their own institutions with their own data. The need to tune these regular expressions again based on the validation dataset suggested that they may not generalize well to other datasets outside of the challenge. One obvious example of this limitation is the language used with the validation and test datasets, which was English. One participant did show a training method that could be used for other languages and had been demonstrated for their method.

Two participants used an LLM trained or tuned for medical texts but not specifically on MIDI-B datasets as an additional method beyond regular expressions to detect PHI/PII. This approach seems promising, as a much broader set of training datasets is available for these LLMs to identify PHI/PII in free text. Improved LLM models would still need some form of regression testing when updating deID software. Both implementations combined the LLMs with regular expression-based methods in a hybrid fashion.

6 Conclusion

The increasing demand for sharing of imaging data from clinical sources requires the availability of extremely robust tools for accurate, scalable and automated de-identification of PHI/PII from DICOM image files. The MIDI-B Challenge provided a platform for developers to evaluate the performance of fully- or semi- automated DICOM deID tools for removing PHI/PII while preserving the research utility of data, using images with synthetic PHI/PII and a de-identification answer key. The challenge was well received by the community, with many international teams from academia and industry participating in the activity. Participants utilized a wide range of methods, including open-source tools, proprietary tools, customized configurations, large language models, and OCR methods for deID.

Ten teams advanced to the Test phase and were placed on the leaderboard. Being the first of its kind, MIDI-B aimed to address an important requirement for sharing of medical imaging data by standardizing an approach based on real clinical images with Synthetic PHI/PII. MIDI-B also provided an opportunity to promote automated approaches to image de-identification, to emphasize the scalability and traceability of the process. We plan to publicly share the MIDI dataset that was used in this challenge as a resource to the community. Evaluation of benchmarks for image de-identification is a complex task, dependent on many factors, including utilization of realistic but synthetic data, contextual sensitivity, and limited availability of definitive guidelines related to private tags. We believe that our approach and the lessons learned will be helpful in future developments of benchmarks for medical image de-identification.

\acks

We gratefully acknowledge Sage Bionetworks for providing the infrastructure and computing resources. We also appreciate the participation of everyone involved in the MIDI-B Challenge.

This work was partially funded by NCI grant U24CA248265 and by Leidos Biomedical Research, Inc., under Prime Contract 75N91019D00024 (Task Orders 75N91020F00003 and 75N91024F00011), issued by the National Institutes of Health/National Cancer Institute. Additional support was provided in part by the computational and data infrastructure, along with staff expertise, from the Mount Sinai Imaging Research Warehouse (MSIRW), the Department of Diagnostic, Molecular and Interventional Radiology and the Biomedical Engineering and Imaging Institute and Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai. This work was also supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences.

We further acknowledge support from the Helmholtz Metadata Collaboration (HMC) Hub Health and the RACOON project within the ”NUM 2.0” initiative (FKZ: 01KX2121), funded by the German Federal Ministry of Education and Research (BMBF).

\ethics

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.

\coi

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

\data

The data and its Answer key of the MIDI-B Challenge is public at TCIA at https://doi.org/10.7937/cf2p-aw56. In addition, the validation script is availale at MIDI-B_Validation_script on Github.

References

  • Aasen and Mathisen (2021) Malik Aasen and Fredrik Fidjestøl Mathisen. De-identification of medical images using object-detection models, generative adversarial networks and perceptual loss. Master’s thesis, The University of Bergen, 2021.
  • Bidgood Jr et al. (1997) W Dean Bidgood Jr, Steven C Horii, Fred W Prior, and Donald E Van Syckle. Understanding and using dicom, the data interchange standard for biomedical imaging. Journal of the American Medical Informatics Association, 4(3):199–212, 1997.
  • Catelli and Esposito (2023) Rosario Catelli and Massimo Esposito. De-identification techniques to preserve privacy in medical records. In Artificial Intelligence in Healthcare and COVID-19, pages 125–148. Elsevier, 2023.
  • Clunie et al. (2024) David Clunie, Fred Prior, Michael Rutherford, Stephen Moore, William Parker, Haridimos Kondylakis, Christian Ludwigs, Juergen Klenk, Bob Lou, Lawrence O’Sullivan, et al. Summary of the national cancer institute 2023 virtual workshop on medical image de-identification—part 1: Report of the midi task group-best practices and recommendations, tools for conventional approaches to de-identification, international approaches to de-identification, and industry panel on image de-identification. Journal of Imaging Informatics in Medicine, pages 1–15, 2024.
  • Clunie et al. (2025) David A. Clunie, Adam Flanders, Andrew Taylor, Bradley Erickson, Brian Bialecki, David Brundage, David Gutman, Fred Prior, James A. Seibert, Jason Perry, Judy Wawira Gichoya, Justin Kirby, Katherine Andriole, Lori Geneslaw, Susan Moore, Thomas J. Fitzgerald, William Tellis, Ying Xiao, and Keyvan Farahani. Report of the medical image de-identification (midi) task group – best practices and recommendations. arXiv preprint arXiv:2303.10473v3, March 2025. URL https://arxiv.org/abs/2303.10473v3. PMID: 37033463; PMCID: PMC10081345.
  • Freymann et al. (2012) John B Freymann, Justin S Kirby, John H Perry, David A Clunie, and C Carl Jaffe. Image data sharing for biomedical research—meeting hipaa requirements for de-identification. Journal of digital imaging, 25(1):14–24, 2012.
  • Garfinkel et al. (2023) Simson Garfinkel, Joseph Near, Aref Dajani, Phyllis Singer, and Barbara Guttman. De-identifying government datasets: Techniques and governance. US Department of Commerce, National Institute of Standards and Technology, 2023.
  • Kopchick et al. (2022) B Kopchick, J Klenk, T Carlson, M Kumpatla, S Klimov, D Mikdadi, Q Pan, S Gustafson, J Kaltman, U Wagner, et al. Medical image de-identification using cloud services. In Medical Imaging 2022: Imaging Informatics for Healthcare, Research, and Applications, volume 12037, pages 122–128. SPIE, 2022.
  • Langlois et al. (2024) Quentin Langlois, Nicolas Szelagowski, Jean Vanderdonckt, and Sébastien Jodogne. Open platform for the de-identification of burned-in texts in medical images using deep learning. In BIOSTEC (1), pages 297–304, 2024.
  • Macdonald et al. (2024) Jacob A Macdonald, Katelyn R Morgan, Brandon Konkel, Kulsoom Abdullah, Mark Martin, Cory Ennis, Joseph Y Lo, Marissa Stroo, Denise C Snyder, and Mustafa R Bashir. A method for efficient de-identification of dicom metadata and burned-in pixel text. Journal of Imaging Informatics in Medicine, 37(5):1–7, 2024.
  • Mildenberger et al. (2002) Peter Mildenberger, Marco Eichelberg, and Eric Martin. Introduction to the dicom standard. European radiology, 12:920–927, 2002.
  • Monteiro et al. (2017) Eriksson Monteiro, Carlos Costa, and José Luís Oliveira. A de-identification pipeline for ultrasound medical images in dicom format. Journal of medical systems, 41(5):89, 2017.
  • Moore et al. (2015) Stephen M Moore, David R Maffitt, Kirk E Smith, Justin S Kirby, Kenneth W Clark, John B Freymann, Bruce A Vendt, Lawrence R Tarbox, and Fred W Prior. De-identification of medical images with retention of scientific research value. Radiographics, 35(3):727–735, 2015.
  • National Cancer Institute (2025) (NCI) National Cancer Institute (NCI). Pseudo-phi-dicom-data collection, 2025. URL https://portal.imaging.datacommons.cancer.gov/explore/filters/?collection_id=pseudo_phi_dicom_data. Accessed: 2025-04-22.
  • National Electrical Manufacturers Association (2025) (NEMA) National Electrical Manufacturers Association (NEMA). Dicom part 15: Security and system management profiles, section e.2, 2025. URL https://dicom.nema.org/medical/dicom/current/output/chtml/part15/sect_E.2.html. Accessed: 2025-04-22.
  • of Health and Services (2025) U.S. Department of Health and Human Services. De-identification of protected health information, 2025. URL https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html. Accessed: 2025-04-28.
  • Rempe et al. (2024) Moritz Rempe, Lukas Heine, Constantin Seibold, Fabian Hörst, and Jens Kleesiek. De-identification of medical imaging data: A comprehensive tool for ensuring patient privacy. arXiv preprint arXiv:2410.12402, 2024.
  • Rutherford et al. (2021a) M. Rutherford, S.K. Mun, B. Levine, W.C. Bennett, K. Smith, P. Farmer, J. Jarosz, U. Wagner, K. Farahani, and F. Prior. Pseudo-phi-dicom-data: A dicom dataset for evaluation of medical image de-identification, 2021a. URL https://www.cancerimagingarchive.net/collection/pseudo-phi-dicom-data/#citations. Accessed: 2025-04-22.
  • Rutherford et al. (2021b) Michael Rutherford, Seong K Mun, Betty Levine, William Bennett, Kirk Smith, Phil Farmer, Quasar Jarosz, Ulrike Wagner, John Freyman, Geri Blake, et al. A dicom dataset for evaluation of medical image de-identification. Scientific Data, 8(1):183, 2021b.
  • The Cancer Imaging Archive (2025) The Cancer Imaging Archive. Submission and de-identification overview, 2025. URL https://wiki.cancerimagingarchive.net/display/Public/Submission+and+De-identification+Overview#SubmissionandDeidentificationOverview-DICOMDe-identificationDetails%23:~:text=Private%20Tags. Accessed: 2025-04-22.
  • U.S. Department of Health and Human Services (2025) U.S. Department of Health and Human Services. The hipaa privacy rule, 2025. URL https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html. Accessed: 2025-04-22.

Appendix A

A.1 Scoring Report

A.1.1 Scoring report in the Scoring Report

Please refer to the Table A1.

Table A1: The overall performance tabl ein the Scoring Report
Category Errors Pass Total Score
All

A.1.2 Action report in the Scoring Report

Please refer to the Table A2.

Table A2: The actions table in the Scoring Report
Action Type Errors Pass Total
1 <date_shifted>
2 <patid_consistent>
3 <pixels_hidden>
4 <pixels_retained>
5 <tag_retained>
6 <text_notnull>
7 <text_removed>
8 <text_retained>
9 <uid_changed>
10 <pixels_hidden>
Total <uid_consistent>

A.1.3 Category report in the Scoring Report

Please refer the Table A3.

Table A3: The category table in the Scoring Report
Subcategory Fail Pass Total
1 dicom DICOM-IOD-1
2 dicom DICOM-IOD-2
3 dicom DICOM-P15-BASIC-C
4 dicom DICOM-P15-BASIC-U
5 hipaa HIPAA-A
6 hipaa HIPAA-B
7 hipaa HIPAA-C
8 hipaa HIPAA-D
9 hipaa HIPAA-G
10 hipaa HIPAA-H
11 hipaa HIPAA-R
12 tcia TCIA-P15-BASIC-D
13 tcia TCIA-P15-BASIC-X
14 tcia TCIA-P15-BASIC-X/Z/D
15 tcia TCIA-P15-BASIC-Z
16 tcia TCIA-P15-BASIC-Z/D
17 tcia TCIA-P15-DESC-C
18 tcia TCIA-P15-DEV-C
19 tcia TCIA-P15-DEV-K
20 tcia TCIA-P15-MOD-C
21 tcia TCIA-P15-PAT-K
22 tcia TCIA-P15-PIX-K
23 tcia TCIA-PTKB-K
24 tcia TCIA-PTKB-X
25 tcia TCIA-REV
Total

Appendix B

B.1 Discrepancy Report

Please refer the Table B1.

Table B1: The Discrepancy Report in the feedback
Header Definition Values
index Unique identifier - can be ignored
check_passed Status of check if the action is a success 0 indicates that the action failed
check_score The score for partially correct answers [0, 1]
tag_ds DICOM tag
tag_name DICOM tag name
file_value De-ided value provided by the participants
answer_value Original value for Answer evaluation
action Type of actions to be performed Details in the Table above
action_text Text on which the action should have been performed
category The category for which this validation record is categorized hipaa
dicom
tcia
subcategory The subcategory for which this validation record is subcategorized Table A3
modality Image modality CR, CT, DX, MG, MR, PET, SR, etc.
class SOP Class UID Reference
patient Patient ID
study Study Instance UID
series Series Instance UID
instance SOP Instance UID
file_name File name of the instance

Appendix C

C.1 Result Reports

C.1.1 DeID action error report

Please refer to Table C1.

Table C1: The summary of action errors for all teams in instance-based scoring report in the test phase.
Action T-01 T-02 T-03 T-04 T-05 T-06 T-07 T-08 T-09 T-10
1 <date_shifted> 90 458 458 1,580 90 436 1,510 458 436 436
2 <patid_consistent> 6,870 0 35 14 0 0 0 0 0 0
3 <pixels_hidden> 0 11 11 1 0 0 8 0 1 3
4 <pixels_retained> 32 7 259 0 34 0 864 69 0 0
5 <tag_retained > 16 0 2,684 2,160 0 14 1,821 1,704 34 1,704
6 <text_notnull> 200 200 764 466 192 118 197 200 258 197
7 <text_removed> 2,164 3,510 29,889 4,360 1,455 3,213 20,499 38,663 3,497 28,333
8 <text_retained> 12,470 9,162 49,274 25,172 6,723 3,889 270,552 24,351 5,692 100,566
9 <uid_changed> 91 0 664 182 91 364 116 0 0 0
10 <uid_consistent> 91 0 804 41,661 91 364 11,539 0 0 0
Total 22,024 13,348 84,842 75,596 8,676 8,398 307,106 65,445 9,918 131,239

*: The best performance in each action type across the ten teams is in bold.

C.1.2 DeID category error reports

C.1.3 DeID category errors based on series-level scoring reports

Please refer to Table C2

Table C2: DeID category errors based on series-level scoring report in the test phase.
Category Subcategory T-01 T-02 T-03 T-04 T-05 T-06 T-07 T-08 T-09 T-10
1 dicom DICOM-IOD-1 82 74 1,226 669 68 20 134 137 112 134
2 dicom DICOM-IOD-2 2 0 481 216 0 0 124 26 13 26
3 dicom DICOM-P15-BASIC-C 93 0 35 14 0 0 0 0 0 0
4 dicom DICOM-P15-BASIC-U 1 0 268 203 1 4 7,019 0 0 0
5 hipaa HIPAA-A 1 14 18 13 2 0 45 2 2 24
6 hipaa HIPAA-B 3 6 12 3 2 2 0 0 2 11
7 hipaa HIPAA-C 1 3 3 16 1 2 18 3 2 6
8 hipaa HIPAA-D 3 2 3 0 0 0 0 0 0 11
9 hipaa HIPAA-G 0 0 0 1 0 0 25 26 0 10
10 hipaa HIPAA-H 0 2 1 0 1 1 33 27 0 14
11 hipaa HIPAA-R 1 0 128 2 1 4 116 0 0 0
12 tcia TCIA-P15-BASIC-D 4 0 59 5 0 0 59 0 0 0
13 tcia TCIA-P15-BASIC-X 0 0 0 0 0 0 0 0 0 5
14 tcia TCIA-P15-BASIC-X/Z/D 0 0 0 0 0 0 0 0 0 0
15 tcia TCIA-P15-BASIC-Z 0 0 0 0 0 0 0 0 0 0
16 tcia TCIA-P15-BASIC-Z/D 0 0 0 0 0 0 0 0 0 0
17 tcia TCIA-P15-DESC-C 25 14 625 57 196 8 1,368 255 44 1,395
18 tcia TCIA-P15-DEV-C 0 0 0 0 0 0 0 0 0 0
19 tcia TCIA-P15-DEV-K 2 6 4 6 0 0 144 2 2 1
20 tcia TCIA-P15-MOD-C 0 0 105 69 0 0 0 0 0 0
21 tcia TCIA-P15-PAT-K 0 1 36 28 0 0 0 0 0 0
22 tcia TCIA-P15-PIX-K 32 7 259 0 34 0 864 69 0 0
23 tcia TCIA-PTKB-K 117 117 324 202 117 117 826 140 117 493
24 tcia TCIA-PTKB-X 233 32 149 284 0 257 0 1,100 257 18
25 tcia TCIA-REV 143 155 1,622 850 95 72 1,340 96 119 301
Total 743 433 5,358 2,638 518 487 12,115 1,883 670 2,449

*: The best performance in each subcategory type across the ten teams is in bold.

C.1.4 DeID category errors based on instance-level scoring reports

Please refer to Table C3

Table C3: DeID category errors based on instance-level scoring report in the test phase.
Category Subcategory T-01 T-02 T-03 T-04 T-05 T-06 T-07 T-08 T-09 T-10
1 dicom DICOM-IOD-1 214 200 1,378 821 192 132 286 289 269 286
2 dicom DICOM-IOD-2 2 0 2,070 1,805 0 0 1,732 1,615 23 1,615
3 dicom DICOM-P15-BASIC-C 6,870 0 35 14 0 0 0 0 0 0
4 dicom DICOM-P15-BASIC-U 91 0 804 41,661 91 364 11,539 0 0 0
5 hipaa HIPAA-A 1 175 383 22 2 0 7,829 2 80 64
6 hipaa HIPAA-B 171 83 714 78 76 76 0 0 76 29
7 hipaa HIPAA-C 90 458 458 1,580 90 436 1,510 458 436 442
8 hipaa HIPAA-D 201 5 205 0 0 0 0 0 0 33
9 hipaa HIPAA-G 0 0 0 9 0 0 5,289 3,396 0 27
10 hipaa HIPAA-H 0 77 1 0 1 1 3,913 1,738 0 114
11 hipaa HIPAA-R 91 0 664 182 91 364 116 0 0 0
12 tcia TCIA-P15-BASIC-D 191 0 2,567 8 0 0 2,567 0 0 0
13 tcia TCIA-P15-BASIC-X 0 0 0 0 0 0 0 0 0 164
14 tcia TCIA-P15-BASIC-X/Z/D 0 0 0 0 0 0 0 0 0 0
15 tcia TCIA-P15-BASIC-Z 0 0 0 0 0 0 0 0 0 0
16 tcia TCIA-P15-BASIC-Z/D 0 0 0 0 0 0 0 0 0 0
17 tcia TCIA-P15-DESC-C 1,342 304 33,243 1,063 4,685 657 65,450 34,208 2,296 76,496
18 tcia TCIA-P15-DEV-C 0 0 0 0 0 0 0 0 0 0
19 tcia TCIA-P15-DEV-K 98 197 4 6 0 0 6,319 193 193 97
20 tcia TCIA-P15-MOD-C 0 0 105 69 0 0 0 0 0 0
21 tcia TCIA-P15-PAT-K 0 135 36 28 0 0 0 0 0 0
22 tcia TCIA-P15-PIX-K 32 7 259 0 34 0 864 69 0 0
23 tcia TCIA-PTKB-K 2,825 2,825 7,199 22,743 2,825 2,825 47,146 3,724 2,825 23,805
24 tcia TCIA-PTKB-X 1,122 2,622 6,117 4,163 0 3,070 0 19,253 3,070 1,372
25 tcia TCIA-REV 8,683 6,260 28,600 1,344 589 473 152,546 500 650 26,695
Total 22,024 13,348 84,842 75,596 8,676 8,398 307,106 65,445 9,918 131,239

*: The best performance in each subcategory type across the ten teams is in bold.

Appendix D

D.1 Short Paper Summary

D.1.1 Team 01 (Milos Vukadinovic, et al.))

The authors used a fully automated deID pipeline from the InVision Toolkit for DICOM deID. They utilized the Cedars-Sinai Medical Center (CSMC) echocardiography dataset for the initial deID development. As the first step, they implemented several key functions to process DICOM tags in accordance with DICOM confidentiality standards and TCIA best practices. To clean the private tags, the authors systematically iterate through each private tag and determine the appropriate action based on the Private Tag Dictionary. In the next step, the PII information is removed. Furthermore, they use a combination of regex patterns to check for all HIPAA identifiers. To manage PHI/PII-sensitive pixel data, the EasyOCR library is utilized to detect and extract all text from the DICOM images. The system then identifies close matches between the extracted text and the elements in their removed tags library, ensuring that any sensitive information embedded in the image text is effectively removed.

D.1.2 Team 02 (Marco Pereañez, et al.)

The authors developed the Research Image DeID System (RIDS) to address the complexities of DICOM deID in compliance with HIPAA and institutional regulations. RIDS is a comprehensive platform designed to streamline and standardize the deID process for radiology images. It consists of three main subsystems: REDCap, XNAT, and the RIDS deID algorithms. REDCap is a secure web application for building and managing online surveys and databases. XNAT is an open-source imaging informatics software platform dedicated to supporting imaging-based research. The deID algorithm consists of two main parts: handling DICOM pixel images with an open-source OCR engine, and stripping burned-in PHI/PII from DICOM headers. Specifically, the authors employed Microsoft’s open-source Presidio Python SDK libraries to detect and redact potential PHI/PII from the DICOM pixel array. To strip PHI/PII from the DICOM header, the algorithms are partially based on a set of auxiliary DICOM tag lists, originally derived from the Imaging Research Warehouse (MSIRW), that determine the masking action to take on different DICOM tags. The authors also developed DICOM tag data type, and regular expression-based functions to mask and remove PHI/PII from the DICOM header.

D.1.3 Team 03 (Humam Arshad Mustagfirin, et al.)

The authors developed a comprehensive method for anonymizing DICOM datasets through a combination of custom-built scripts and pre-trained models designed to detect and remove identifiable information. This method aims to identify and eliminate highly sensitive textual data embedded in DICOM images while ensuring that the essential metadata within DICOM files is anonymized. The techniques used are categorized into two major areas: anonymization of DICOM metadata and processing of images using OCR. To anonymize the metadata, several strategies were applied: unique ID hashing, date tag incrementing, free-text redaction, and selective tag preservation. For image anonymization, contrast-limited adaptive histogram equalization (CLAHE) was first applied as an image preprocessing step, followed by the use of Tesseract OCR for text detection and redaction. Finally, multi-frame DICOMs were properly handled to maintain data integrity.

D.1.4 Team 04 (Christopher Ablett, et al.)

The team developed a rule-based method for the deID of PHI/PII in DICOM image data. To achieve this, they employed a rigorous deID protocol that goes beyond simple key-coding or basic data-masking techniques. The protocol utilized RSNA’s open-source CTP tool and applied a tailored version of Part 15, Section E of the DICOM 2020a standard to de-identify imaging scans and their associated PHI/PII data. The tool classified the ingested data into cohorts based on pre-set whitelist/blacklist filters and sent any data not recognized by the filters to quarantine. Moreover, the CTP tool is programmable to apply a variety of deID actions, including fully removing certain date values, replacing other values with “dummy” headers, applying hashing and modified hashing techniques, and shifting time/date attributes to mask original temporal data while still enabling longitudinal analysis. It is also capable of redacting personal data burned into imaging scans, further assisting in deID in a manner tailored to the data involved.

D.1.5 Team 05 (Hamideh Haghiri, et al.))

The paper introduced a hybrid system combining AI-based and rule-based approaches for de-identifying DICOM files. The approach initially relied on a modified existing DICOM deID framework compliant with DICOM PS 3.15 confidentiality guidelines. However, due to performance limitations, the authors developed an improved algorithm (DCMTCIADeidentifier) based on The Cancer Imaging Archive’s Safe Harbor method. To enhance precision, they incorporated custom rules for handling sensitive data and integrated AI models, such as RoBERTa, fine-tuned on clinical text datasets, for PHI/PII detection. Optical Character Recognition (OCR) tools like PaddleOCR were also employed to identify and obscure PHI/PII within image data. To improve performance, authors refined their approach by applying the transformer model exclusively to free text, while structured data was handled only by rule-based methods. To increase DICOM compliance, this component employs a dciodvfy-based DICOM validator (13) to ensure the integrity of DICOM files after the deID process. If validation errors arise due to missing attributes, the corresponding tags are added to the DICOM file with an empty value to maintain compliance and completeness.

D.1.6 Team 06 (Hongzhu Jiang, et al.)

The authors proposed rule-based DICOM image deID algorithms to protect patient privacy while ensuring the continued utility of medical data for research, diagnostics, and treatment. They also provided a comprehensive overview of the standards and regulations that govern the process. Two categories of deID methods were implemented: simple deID and pseudonymization. The first method involves removing real patient identifiers, while the second replaces identifiers with a pseudonym that is unique to the individual and known within a specified context but not linked to the individual in the external world. For the simple deID method, the Presidio software development kit was applied for pixel masking and text removal. For the pseudonymization method, the procedure included patient ID replacement, UID replacement, and date shifting. The results showed great performance in the challenge. However, the method had a limitation in terms of generalization. It worked on a specific dataset and might be undermined if applied to others.

D.1.7 Team 07 (Alex Michie, et al.)

The paper outlined a comprehensive strategy for medical image de-identification on the XNAT platform, which was essential for managing data in multicenter clinical and research studies. The proposed method addressed both DICOM metadata and pixel-based protected health information (PHI) using two key tools: DicomEdit v6.6 for tag de-identification and Microsoft Presidio for pixel-level redaction. The authors discussed several challenges encountered during the process, such as inconsistencies in DICOM series metadata and the complexities of handling “burned-in” PHI in images. They also explored the limitations of their tools, including issues with conditional tag retention and discrepancies in handling private tags. To address these challenges, the authors proposed enhancements, such as improving OCR capabilities in Presidio, refining anonymization scripts, and training models on larger datasets to detect and redact PHI more effectively. They emphasized the importance of automating these processes to reduce manual workloads and improve the scalability of de-identification workflows, particularly in multicenter trials. The study concluded by discussing the broader implications of their work, including the need for ongoing refinement of de-identification tools to meet the evolving demands of clinical research and data sharing.

D.1.8 Team 08 (Michele Bufano, et al.)

The proposed method consisted of an initialization phase and three subsequent steps: classification of the patient collection (which studies, series, and instances were associated with each patient), deID of each DICOM file header, and removal of burned-in sensitive pixel text. In the initialization phase, a keyword finding algorithm was developed to recover common words adjacent to the names of people, institutions, places, and telephone numbers. The list of keywords was used to recognize potential PHI/PII (sensitive) text. It was not clear from the paper what rule-based method had been used to do this. The keyword finding algorithm was trained on the small, published MIDI-B Challenge dataset and then retrained using the validation dataset. Following DICOM PS 3.15, data elements known to contain only PHI/PII, such as names, dates, addresses, and identifiers, were redacted or modified, but the sensitive text was retained for use in finding PHI/PII in free text fields and burned-in pixel text. The recognition of burned-in pixel text was implemented using Keras-ocr. This was followed by using a similarity text package, Fuzz, to determine the sensitive text to be removed based on comparison to the sensitive text found in the DICOM data elements. The authors also mentioned training the keyword finding algorithm on languages other than English to show that the algorithm generalized.

D.1.9 Team 09 (Rahul Krish, et al.)

The authors developed an advanced medical image deID approach. It was both text- and pixel-based, comprising two main components: metadata processing and pixel data analysis. For text metadata, they employed a dual-path approach to process both metadata and pixel data in DICOM images, ensuring thorough removal of PHI and PII. The text metadata process began with a rule-based system to filter out explicit identifiers and was then processed using Python’s FuzzyWuzzy package to detect and remove subtle identifiers that may have been spread throughout the text. To enhance the process, they incorporated an ensemble of pre-trained language models to ensure comprehensive metadata deID. For pixel data, they utilized a custom uncertainty-aware Faster R-CNN model, a two-stage object detection framework, to detect regions within the images that may have contained PHI or PII, such as text embedded directly in the image. To enhance its capability in the deID task, they incorporated Variational Density Propagation (VDP) into the classifier decision head. An important feature of the approach was the handling of uncertainty in text region detection.

D.1.10 Team 10 (Peter Gu, et al.)

The system employed a two-tier architecture using Amazon Web Services (AWS): the Redaction Service and the Evaluation Service. The Redaction Service extracted metadata and pixel data from DICOM files using tools like Pydicom and Tesseract OCR, applied predefined rules to identify and redact Personal Health Information (PHI) and Personally Identifiable Information (PII), and securely stored the de-identified files in AWS S3. The Evaluation Service leveraged AWS Rekognition and Comprehend Medical to validate the effectiveness of the deID process, generating reports and refining the redaction process as needed. An iterative step was used to update the rules: “After the evaluation, a curation step was performed to eliminate false positives. Based on the new findings, the redaction service rule sets were refined and enhanced. This process was repeated multiple times, applying the improved rule sets to both the training and validation datasets to ensure accuracy and completeness”. The system adopted an event-driven and containerized microservice architecture using AWS technologies such as Lambda, S3, and Elastic Container Service, ensuring scalability, reliability, and compliance. While the method demonstrated high performance in handling metadata, the authors noted limitations in processing pixel-level data and recommended further improvements, particularly in OCR capabilities and training models for complex scenarios. The proposed approach reduced manual effort, enhanced compliance with privacy standards, and mitigated risks of data breaches, offering a scalable and efficient solution for protecting patient privacy in medical imaging datasets.