11institutetext: Department of Computer Science, University of Tübingen, Tübingen, Germany 22institutetext: Else Kröner Fresenius Center for Digital Health, TU Dresden, Dresden, Germany 33institutetext: Department of Medicine I, University Hospital Dresden, TU Dresden, Dresden, Germany

Seeing More with Less:
Video Capsule Endoscopy with Multi-Task Learning

Julia Werner 11 0009-0006-0279-1776    Oliver Bause 11 0009-0003-5388-2959    Julius Oexle 11    Maxime Le Floch 2233    Franz Brinkmann 2233 0000-0002-3474-3115    Jochen Hampe 2233 0000-0002-2421-6127    Oliver Bringmann 11 0000-0002-1615-507X
Abstract

Video capsule endoscopy has become increasingly important for investigating the small intestine within the gastrointestinal tract. However, a persistent challenge remains the short battery lifetime of such compact sensor edge devices. Integrating artificial intelligence can help overcome this limitation by enabling intelligent real-time decision-making, thereby reducing the energy consumption and prolonging the battery life. However, this remains challenging due to data sparsity and the limited resources of the device restricting the overall model size. In this work, we introduce a multi-task neural network that combines the functionalities of precise self-localization within the gastrointestinal tract with the ability to detect anomalies in the small intestine within a single model. Throughout the development process, we consistently restricted the total number of parameters to ensure the feasibility to deploy such model in a small capsule. We report the first multi-task results using the recently published Galar dataset, integrating established multi-task methods and Viterbi decoding for subsequent time-series analysis. This outperforms current single-task models and represents a significant advance in AI-based approaches in this field. Our model achieves an accuracy of 93.63% on the localization task and an accuracy of 87.48% on the anomaly detection task. The approach requires only 1 million parameters while surpassing the current baselines.

Keywords:
Video Capsule Endoscopy Multi-Task Learning Viterbi decoding.

1 Introduction

The Video Capsule Endoscopy (VCE) aims to detect pathological tissue within the gastrointestinal (GI) tract, more specifically the small intestine, by employing a small pill-size capsule, which is equipped among others with LEDs and a camera [8, 20]. By peristalsis, the capsule traverses through the GI tract and captures low-resolution images throughout its journey, which are then sent out to an external device for final evaluation by medical doctors. This procedure is essential as it transmits images of the small intestine, which is largely not accessible by standard techniques, such as the gastroscopy or colonoscopy [5, 18]. Thus, the main objective of the VCE is to cover the small intestine, while the esophagus, stomach and colon are of less interest.

Current devices, such as the PillCam SB3 [14], simply transmit all captured images to an external device without prior classification. However, this transmission is very costly and only those frames originating from the small intestine are actually relevant in the VCE. On the other hand, simply storing all images on the device locally until the end of the procedure requires collection of the capsule afterwards [27], does not improve the battery lifetime and prohibits real-time assessment as well as decision-making. Importantly, as the available energy of such small sensor edge device is very limited, e.g. the PillCam has a battery lifetime of 8-12h [14, 15], it is essential to minimize unnecessary energy usage. Since the traversal time of the capsule varies between patients, for some patients, the small intestine is not covered by such capsule, before the battery is depleted. Theoretically, images that are not of interest can be discarded immediately, saving the energy that would be otherwise be used for their transmission. To achieve this, the organ, which the capsule currently traverses, needs to be classified and transmission of images only started after exiting the stomach.
   Furthermore, jointly addressing organ detection with anomaly detection while continuously limiting the overall model size, can improve the entire procedure by ensuring comprehensive coverage of the GI tract and enabling preliminary assessments of potential anomalies prior to final evaluation by physicians. On-site anomaly detection allows real-time decisions to be made, such as raising the frame rate or resolution, which might improve the visualization of important regions. However, AI-based anomaly detection still remains a major challenge in this field due to data sparsity and the need of high medical competence to label such data. One possible technique to tackle this is multi-task learning (MTL) [3] which leverages the observation that, in certain cases, simultaneous learning of multiple tasks can produce superior results [3, 26]. This approach relies on the assumption that the tasks being targeted are related. When this condition is met, MTL has the potential to surpass the performance of single-task learning (STL).
   Our Contribution: By applying MTL to VCE, our main objective is to develop a model capable of concurrently classifying the organ section and assessing whether an anomaly is present in the captured image. Precisely determining the capsule’s entry into the small intestine enables suppression of image transmission outside the target region and thus, reduces energy consumption. This helps to ensure full small intestine coverage across all patients without exhausting the battery. Thus, we report the first multi-task results targeting this localization problem along with anomaly detection and aim to surpass the performance of current state-of-the-art single-task models for both objectives. Since for real-world applications, this model must be deployed on a very small device, its hardware implementation feasibility is a key consideration. As an initial step, we focus on limiting the overall model size and the total number of multiply-accumulate (MAC) operations compared to the current baselines.

2 Related Work

Multi-Task Studies Targeting the Gastrointestinal Tract:

Recently, several MTL approaches have been developed to capture diseases in the GI tract. For example, [10] targeted the detection of Crohn’s disease, by using MTL with a private dataset including standard endoscopy images from 15 patients, however, using very large networks, such as a DenseNet121 and a ResNet50, for this task. More research has been conducted to detect esophageal lesions [25] and polyps in the colon [19] with MTL. Notably, both studies did not target the small intestine. Most recently, [23] performed magnetically controlled capsule endoscopy for the classification of gastric anatomical sites as well as lesions on a public, self-built dataset. In this work, following the constraints of VCE applications, only datasets containing low-resolution images from VCE studies with a specific focus on targeting the small intestine are applicable.

Localization and Anomaly Detection with VCE Images:

The reported results are predominantly limited to single-task models, which can either identify the anatomical regions or classify lesions in the digestive tract. This is inherently given by the fact that the previously published VCE datasets contain labels only for a single task. For example, the Kvasir-Capsule dataset [18] consists only of images labeled for anomalies but not the organ sections while the Rhode-Island Gastroenterology dataset [4] comprises of images labeled for the different organ sections but not for any anomalies. The newly available large Galar dataset [11] incorporates images labeled not only for different GI sections but also anomalies. Besides [11], [21] were the first to perform anomaly detection on the Galar dataset with an ensemble model, however; without targeting the localization task. We intend to perform MTL on VCE images drawn from this dataset and outperform current single-task baselines by combining the localization and the anomaly detection task while restricting the overall model size. Since the small intestine is the only part of the GI tract largely inaccessible by standard endoscopes, it is the primary focus of this procedure.

3 Experiments and Methodology

The primary goal of our method is to prepare a single, resource-constrained, light-weight neural network, that is capable of performing both, localization and anomaly detection tasks, on VCE images as presented in Figure 1 to ensure coverage of the small intestine before the capsule’s battery is depleted and provide basic anomaly detection functionalities. We start by preparing the dataset to capture class imbalances, continue with the neural network training including hard-parameter sharing, and finally perform post-processing for the localization task involving a HMM and Viterbi decoding.

Refer to caption
Figure 1: The neural network is trained on the Galar dataset with VCE images and hard parameter sharing with two separate classifier heads, dynamic weight average or task uncertainty weighting of losses and post-processing with a HMM and Viterbi decoding.

3.1 Pre-Processing - Galar Dataset

To the best of our knowledge, the Galar dataset [11] is the only publicly available VCE dataset, which consists of images labeled for the different GI organs as well as anomalies. Thus, this dataset is well-suited for simulating the VCE environment and evaluating multi-task learning approaches for this application. The authors of the Galar dataset selected patients with the ID 618061-8061 - 80 for the test set, representing completely untouched VCE studies from a clinical setting. For better comparability, we evaluate our method on the same test set. The remaining 606060 patient studies (ID 1601-601 - 60) were used as a development set for training the neural network.

As multiple other medical datasets, the Galar dataset is characterized by a large class imbalance. From all organs available in this dataset, the smallest class (mouth) represents only 0.05 %0.05\text{\,}\mathrm{\char 37\relax}start_ARG 0.05 end_ARG start_ARG times end_ARG start_ARG % end_ARG of all images. Moreover, pathological cases represent the minority within the classes, and in addition, there are significant differences in class frequencies among the pathologies themselves. While the smallest anomaly class (erythema) comprises only 0.12 %0.12\text{\,}\mathrm{\char 37\relax}start_ARG 0.12 end_ARG start_ARG times end_ARG start_ARG % end_ARG of the data, the largest class (blood) makes up 11.9 %11.9\text{\,}\mathrm{\char 37\relax}start_ARG 11.9 end_ARG start_ARG times end_ARG start_ARG % end_ARG of all frames. Additionally, if the capsule traveled slowly or resides at one location for some time, many similar images are often captured, which can disrupt the training process. To counteract this class imbalance, the majority classes were downsampled as described in the following.

The main objective remains to theoretically capture all images of interest and transmit them to an on-body device for final evaluation by physicians. We do not aim to provide an extensive evaluation of anomalies on-site, but to capture optimally all relevant images, which are finally evaluated by medical doctors after transmission. Additionally, basic on-site anomaly detection allows important features to be added, such as increased resolution or frame rate upon outlier detection improving the entire procedure. Thus, detailed analysis of different anomaly types does not need to be conducted on the capsule directly. Instead, using a binary distinction eases the computational workload of the model for this application and also enables size restrictions to be applied. Therefore, all pathologies were combined into a single anomaly class in order to perform binary classification. For each pathology, a random selection of frames was included. Next, all samples from the mouth and esophagus were collected and the remaining section classes randomly downsampled to have a final ratio of 1:11:11 : 1 (normal : anomaly). The data was randomly shuffled and split into a training and validation set with a defined ratio of 70:3070:3070 : 30. The final class distribution of the different organ sections and the anomaly and normal samples is depicted in Table 1.

Table 1: Class distribution of the used organ sections and anomalies (positive samples) after downsampling to address the occurrence of many related frames.
Subset Mouth Esophagus Stomach Small Intestine Colon Total Negative Positive Total
Train 1,476 1,718 11,962 32,274 26,059 73,489 39,779 40,104 79,883
Val 252 132 2,598 9,478 7,633 20,093 10,748 10,757 21,505
Test 265 397 16,510 175,716 41,667 234,555 218,665 16,499 235,164

3.2 Neural Network Training with Hard-Parameter Sharing

In the context of capsule endoscopy, multiple authors [23, 22] aimed to restrict the total number of parameters to address the constraints given by tiny edge devices equipped during such procedure by using light-weight networks from the MobileNet family. These networks are not only beneficial in image classification tasks but were additionally designed for the efficient deployment on embedded devices [17, 7]. Hence, in this work, we adopt the MobileNetV3-Small [7] architecture and modify the classification head. Instead of a single classification head, we designed and generated separate classification heads tailored to each specific task. Offering the benefit of minimizing parameters and straightforward deployment while allowing the classification of multiple tasks, we implement hard parameter sharing [2]. This MTL technique is used to share the hidden layers between all tasks and further exploit task-specific classifier heads.

The pretrained model was fine-tuned for additional 101010 epochs and the learning rate and weight decay were each defined based on a hyperparameter search with Hydra [24] and the well-established hyperparameter optimization framework Optuna [1] with 303030 trials (for the localization, the anomaly detection and the multi-task runs each). We aimed to employ well-established MTL losses for comparison and to investigate how this influences the final results. Thus, the homoscedastic task uncertainty based on [9] was implemented. The MTL loss \mathcal{L}caligraphic_L is assembled for the tasks i{1,2}i\in\{1,2\}italic_i ∈ { 1 , 2 }, with

=i=1nexp(logσi)i+logσi.\displaystyle\mathcal{L}=\sum_{i=1}^{n}\exp{(-\log\sigma_{i})}\mathcal{L}_{i}+\log\sigma_{i}.caligraphic_L = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( - roman_log italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_log italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (1)

Furthermore, the Dynamic Weight Average Loss (DWA) [13] was implemented to improve the balancing of the loss values for both tasks. Instead of weighting the tasks according to their homoscedastic uncertainty, this loss adjusts its weights based on loss changes between the epochs. This helps to prevent premature neglect of any task and supports the learning of more challenging tasks over extended training periods. The temperature factor TTitalic_T also allows the strength of the weighting to be set manually. The DWA weighting λi\lambda_{i}italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is assembled for the tasks i{1,2}i\in\{1,2\}italic_i ∈ { 1 , 2 }, with

λi(t):=Iexp(wi(t1)/T)kexp(wk(t1)/T),wi(t1)=i(t1)i(t2).\displaystyle\lambda_{i}(t):=\frac{I\exp{(w_{i}(t-1)/T)}}{\sum_{k}\exp{(w_{k}(t-1)/T)}},w_{i}(t-1)=\frac{\mathcal{L}_{i}(t-1)}{\mathcal{L}_{i}(t-2)}.italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) := divide start_ARG italic_I roman_exp ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) / italic_T ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_exp ( italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_t - 1 ) / italic_T ) end_ARG , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) = divide start_ARG caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) end_ARG start_ARG caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 2 ) end_ARG . (2)

For the anomaly detection task, the cross-entropy loss was replaced with a focal loss [12] to address the pronounced class imbalance between normal and abnormal tissue samples. This allows the model to focus more on the underrepresented class without requiring oversampling of abnormal cases, thereby preserving a realistic class distribution during training. The final evaluations were then passed to a HMM to perform post-processing.

3.3 Post-Processing - Hidden Markov Model and Viterbi Decoding

The HMM [16] has previously been applied to the problem of localization in video capsule endoscopy, using the only available dataset at the time with labels for 4 organ sections: the Rhode Island VCE dataset [4] [22]. It has been shown, that specifically for this setting involving time-series data, post-processing the CNN output with a HMM and Viterbi decoding leads to a significant improvement in performance allowing a more precise determination of when the small intestine is entered compared to solely using a CNN. This method is particularly well-suited to the problem at hand as it incorporates prior knowledge of the sequential order in which the organs are traversed and their respective lengths.

In this work, we aim to transfer this technique to the newly available Galar dataset that includes class labels for the organs across the GI tract and apply it as a post-processing to the MTL approach. Compared to [22], the number of hidden states is extended to 555, such that we have {S1=Mouth,S2=Esophagus,S3=Stomach,S4=Smallintestine,S5=Colon}\{S_{1}=\mathrm{Mouth},S_{2}=\mathrm{Esophagus},S_{3}=\mathrm{Stomach},S_{4}=\mathrm{Small\ intestine},S_{5}=\mathrm{Colon}\}{ italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_Mouth , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Esophagus , italic_S start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = roman_Stomach , italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = roman_Small roman_intestine , italic_S start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = roman_Colon }, with an initial distribution πi=P(X1=Si)\pi_{i}=P(X_{1}=S_{i})italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_P ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for i5i\leq 5italic_i ≤ 5.The HMM for this GI setting is visualized in Figure 2.

Refer to caption
Figure 2: Hidden Markov Model for the described GI setting.

Each moment ttitalic_t, at which the capsule captured an image, an observation XtK1,,KmX_{t}\in{K_{1},\dots,K_{m}}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is obtained, representing a location classified by the neural network. The transition probabilities aij=P(Xt+1=Sj|Xt=Si)a_{ij}=P(X_{t+1}=S_{j}|X_{t}=S_{i})italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_P ( italic_X start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) of the Markov chain StS_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are encoded such that for St+1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT only the current organ or the succeeding organ can be the next state, in accordance with the natural anatomy of the GI tract. The emission probabilities bj(k)=P(Ot=Kk|Xt=Sj)b_{j}(k)=P(O_{t}=K_{k}|X_{t}=S_{j})italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_k ) = italic_P ( italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) for an observation Ot{O1=Mouth,O2=Esophagus,O3=Stomach,O4=Smallintestine,O5=Colon}O_{t}\in\{O_{1}=\mathrm{Mouth},O_{2}=\mathrm{Esophagus},O_{3}=\mathrm{Stomach},O_{4}=\mathrm{Small\ intestine},O_{5}=\mathrm{Colon}\}italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_Mouth , italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Esophagus , italic_O start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = roman_Stomach , italic_O start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = roman_Small roman_intestine , italic_O start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = roman_Colon } at time ttitalic_t are inferred based on the confusion matrix from the neural network on the training set as this directly encodes the emission probabilities. Finally, the Viterbi algorithm [6] calculates the most probable sequence of gastroenterological states X=(X1,,Xt)X=(X_{1},\dots,X_{t})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), based on the evaluations of the neural network (O1,,Ot)(O_{1},\dots,O_{t})( italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), with argmaxX1,,XtP(X1,,Xt,O1,Ot)\underset{X_{1},\dots,X_{t}}{\arg\max}\>P(X_{1},\dots,X_{t},O_{1},...O_{t})start_UNDERACCENT italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_arg roman_max end_ARG italic_P ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_O start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as described in [22].

4 Results

Our main objective is to provide a very light-weight multi-task model, that enhances the current baselines of single-task models, while reducing the overall complexity and provide useful functionalities for VCE devices, predominantly by ensuring the coverage of the whole small intestine by precise organ detection/localization of the capsule. Typically, deep neural networks are too large to be deployed on tiny edge devices, such as the video capsule for endoscopies. To address this, we conducted all experiments using the light-weight MobileNet [7]. First, in Table 2, the results for the organ detection are presented.

Table 2: Results of the multi-task (MT) runs for the localization task in [%\mathrm{\char 37\relax}%] with the MobileNetV3 compared to single-task (ST) baseline results with the best results in bold and the second best results underlined.
Accuracy F1 Precision Recall Params MACs
Baseline [11] 81 71 25 M 4,087 B
Our ST Localization 85.05 60.55 54.42 86.44 1 M 63,614 M
Our MT Results
MT UW w/o HMM 80.96 49.30 48.50 69.58 1 M 63,616 M
MT DWA w/o HMM 86.22 61.53 54.48 87.08 1 M 63,616 M
MT DWA Focal w/o HMM 87.77 65.12 58.61 87.5 1 M 63,616 M
MT UW & HMM 90.48 66.22 71.57 79.14 1 M 63,616 M
MT DWA & HMM 93.48 91.02 88.56 94.34 1 M 63,616 M
MT DWA Focal & HMM 93.63 92.41 90.40 94.94 1 M 63,616 M

The ST model shows inferior results compared to the baseline with a F1-score of only 60.55%60.55\%60.55 % possibly due to a less complex model, however, this is notably improved by the MT approach. The proposed MT approach, if combined with Viterbi decoding, shows a strong performance across all tested loss functions. The best results were achieved with the DWA and focal loss, with an accuracy of 93.63%93.63\%93.63 % and a F1-score of 92.4192.4192.41 compared to an accuracy of 81%81\%81 % and F1-score of 71%71\%71 % of the baseline [11]. We further observe, that the uncertainty weighted loss shows inferior results compared to the DWA or DWA with focal loss.

Furthermore, in Table 3, the MT results for the anomaly detection task are shown. While all three loss options lead to proficient results, the DWA loss produces the best results compared to the baseline and the ST model (F1-score: 54.38% vs. 37.01%). Only the recall is inferior compared to the baseline, with only 54.69% vs. 60.88%. However, this work provides a VCE targeted approach to perform the important tasks of localization and anomaly detection within one single model. Notably, both tasks can be conducted with the same model of only 1 M parameters, which is only 25%25\%25 % of the anomaly detection baseline model and merely 4%4\%4 % of the localization baseline model. In addition, the number of MAC units is reduced from 4 B and 189 M to only 64 M with this approach, lowering the computational workload.

Table 3: Results of the multi-task (MT) runs for the anomaly detection task in [%\mathrm{\char 37\relax}%] with the MobileNetV3 compared to single-task (ST) baseline results.
Classification Accuracy F1 Precision Recall Params MACs
Baseline [21] 87.28 37.01 26.59 60.88 4 M 189 M
Our ST Anomaly Detection 84.91 53.34 52.99 54.63 1 M 63,611 M
Our MT Results
MT UW 83.69 52.94 52.71 54.66 1 M 63,616 M
MT DWA 87.7 54.38 54.14 54.69 1 M 63,616 M
MT DWA Focal 87.48 53.24 53.08 53.45 1 M 63,616 M

5 Conclusion

This work introduces a light-weight neural network with a total of just 1 M parameters, designed to combine precise self-localization and anomaly detection capabilities through a multi-task learning approach for VCE. We report the first multi-task results using the recently published Galar dataset and integrate various established multi-task methods and Viterbi decoding. With an accuracy of 93.63 %93.63\text{\,}\mathrm{\char 37\relax}start_ARG 93.63 end_ARG start_ARG times end_ARG start_ARG % end_ARG and an F1-score of 92.41 %92.41\text{\,}\mathrm{\char 37\relax}start_ARG 92.41 end_ARG start_ARG times end_ARG start_ARG % end_ARG, our approach shows superior performance for the localization task compared to the current baseline, while also enabling fundamental anomaly detection functionality. Thus, the presented work improves upon current AI models for VCE, by providing precise localization within the GI tract and basic anomaly detection functionalities.

Prospect of application: Once the exit of the stomach has been determined by the multi-task model, additional features such as adapting frame rates (lowered before the small intestine and increased upon anomaly detection) or varying resolutions may be added. The approach can prolong battery life and serve as a starting point for deploying multi-task models on VCE devices.

5.0.1 Acknowledgments

This work has been partly funded by the German Federal Ministry of Research, Technology and Space (BMFTR) in the project MEDGE (16ME0530).

5.0.2 Disclosure of Interests.

The authors have no competing interests to declare that are relevant to the content of this article.

References

  • [1] Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: A next-generation hyperparameter optimization framework. In: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. pp. 2623–2631 (2019)
  • [2] Caruana, R.: Multitask learning: A knowledge-based source of inductive bias1. In: Proceedings of the Tenth International Conference on Machine Learning. pp. 41–48. Citeseer (1993)
  • [3] Caruana, R.: Multitask learning. Machine learning 28, 41–75 (1997)
  • [4] Charoen, A., Guo, A., Fangsaard, P., Taweechainaruemitr, S., Wiwatwattana, N., Charoenpong, T., Rich, H.G.: Rhode island gastroenterology video capsule endoscopy data set. Scientific Data 9(1),  602 (2022)
  • [5] Costamagna, G., Shah, S.K., Riccioni, M.E., Foschia, F., Mutignani, M., Perri, V., Vecchioli, A., Brizi, M.G., Picciocchi, A., Marano, P.: A prospective trial comparing small bowel radiographs and video capsule endoscopy for suspected small bowel disease. Gastroenterology 123(4), 999–1005 (2002)
  • [6] Forney, G.D.: The viterbi algorithm. Proceedings of the IEEE 61(3), 268–278 (1973)
  • [7] Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., et al.: Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1314–1324 (2019)
  • [8] Iddan, G., Meron, G., Glukhovsky, A., Swain, P.: Wireless capsule endoscopy. Nature 405(6785), 417–417 (2000)
  • [9] Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7482–7491 (2018)
  • [10] Kong, Z., He, M., Luo, Q., Huang, X., Wei, P., Cheng, Y., Chen, L., Liang, Y., Lu, Y., Li, X., et al.: Multi-task classification and segmentation for explicable capsule endoscopy diagnostics. Frontiers in Molecular Biosciences 8, 614277 (2021)
  • [11] Le Floch, M., Wolf, F., McIntyre, L., Weinert, C., Palm, A., Volk, K., Herzog, P., Kirk, S.H., Steinhaeuser, J.L., Stopp, C., et al.: Galar-a large multi-label video capsule endoscopy dataset. medRxiv pp. 2024–09 (2024)
  • [12] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
  • [13] Liu, S., Johns, E., Davison, A.J.: End-to-end multi-task learning with attention. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1871–1880 (2019)
  • [14] Medtronic: PillCam™ SB3 System. https://www.medtronic.com/covidien/en-nz/products/capsule-endoscopy/pillcam-sb-3-system.html/ (2025), [Online; accessed 7-May-2025]
  • [15] Monteiro, S., de Castro, F.D., Carvalho, P.B., Moreira, M.J., Rosa, B., Cotter, J.: Pillcam® sb3 capsule: Does the increased frame rate eliminate the risk of missing lesions? World journal of gastroenterology 22(10),  3066 (2016)
  • [16] Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
  • [17] Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4510–4520 (2018)
  • [18] Smedsrud, P.H., Thambawita, V., Hicks, S.A., Gjestang, H., Nedrejord, O.O., Næss, E., Borgli, H., Jha, D., Berstad, T.J.D., Eskeland, S.L., et al.: Kvasir-capsule, a video capsule endoscopy dataset. Scientific Data 8(1),  142 (2021)
  • [19] Tang, S., Yu, X., Cheang, C.F., Liang, Y., Zhao, P., Yu, H.H., Choi, I.C.: Transformer-based multi-task learning for classification and segmentation of gastrointestinal tract endoscopic images. Computers in Biology and Medicine 157, 106723 (2023)
  • [20] Thomson, A., Keelan, M., Thiesen, A., Clandinin, M., Ropeleski, M., Wild, G.: Small bowel review: diseases of the small intestine. Digestive diseases and sciences 46, 2555–2566 (2001)
  • [21] Werner, J., Gerum, C., Nick, J., Floch, M.L., Brinkmann, F., Hampe, J., Bringmann, O.: Enhanced anomaly detection for capsule endoscopy using ensemble learning strategies. arXiv preprint arXiv:2504.06039 (2025)
  • [22] Werner, J., Gerum, C., Reiber, M., Nick, J., Bringmann, O.: Precise localization within the gi tract by combining classification of cnns and time-series analysis of hmms. In: International Workshop on Machine Learning in Medical Imaging. pp. 174–183. Springer (2023)
  • [23] Xu, T., Li, Y.Y., Huang, F., Gao, M., Cai, C., He, S., Wu, Z.X.: A multi-task neural network for image recognition in magnetically controlled capsule endoscopy. Digestive Diseases and Sciences pp. 1–9 (2024)
  • [24] Yadan, O.: Hydra - a framework for elegantly configuring complex applications. Github (2019), https://github.com/facebookresearch/hydra
  • [25] Yu, X., Tang, S., Cheang, C.F., Yu, H.H., Choi, I.C.: Multi-task model for esophageal lesion analysis using endoscopic images: classification with image retrieval and segmentation with attention. Sensors 22(1),  283 (2021)
  • [26] Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE transactions on knowledge and data engineering 34(12), 5586–5609 (2021)
  • [27] Zwinger, L.L., Siegmund, B., Stroux, A., Adler, A., Veltzke-Schlieker, W., Wentrup, R., Jürgensen, C., Wiedenmann, B., Wiedbrauck, F., Hollerbach, S., et al.: Capsocam sv-1 versus pillcam sb 3 in the detection of obscure gastrointestinal bleeding: results of a prospective randomized comparative multicenter study. Journal of clinical gastroenterology 53(3), e101–e106 (2019)