\setcctype

by

Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao 0009-0000-3964-7336 Department of Electrical Engineering and AutomationAalto UniversityEspooFinland rongzhen.zhao@aalto.fi Yi Zhao 0009-0002-9979-595X Department of Electrical Engineering and AutomationAalto UniversityEspooFinland yi.zhao@aalto.fi Juho Kannala 0000-0001-5088-4041 Department of Computer ScienceAalto UniversityEspooFinland Center for Machine Vision and Signal AnalysisUniversity of OuluOuluFinland juho.kannala@aalto.fi  and  Joni Pajarinen 0000-0003-4469-8191 Department of Electrical Engineering and AutomationAalto UniversityEspooFinland joni.pajarinen@aalto.fi
(2025)
Abstract.

Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input’s reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): i) We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; ii) We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.

Object-Centric Learning, Slot Attention, Object Representation, Visual Prediction, Visual Reasoning
journalyear: 2025copyright: ccconference: Proceedings of the 33rd ACM International Conference on Multimedia; October 27–31, 2025; Dublin, Irelandbooktitle: Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25), October 27–31, 2025, Dublin, Irelanddoi: 10.1145/3746027.3755339isbn: 979-8-4007-2035-2/2025/10submissionid: mfp3301copyright: noneccs: Computing methodologies Computer vision representations

o1: Slots redundancy reduction yields limited boosts in object recognition, i.e., classification (left) and localization (right), while re-initialization boosts both object discovery and recognition. Refer to caption

o2: The decoding attention map does not always segment/discover objects better than the aggregation’s (left); Attention maps of the 3rd aggregation iteration are almost always better than that of the first (right). Refer to caption

Figure 1. Two key observations inspire our method. (o1) We re-initialize an extra aggregation to update the remaining slots after slots redundancy reduction, instead of decoding the remaining slots directly. (o2) We self-distill the attention map at the first iteration to approximate the almost-always-good attention at the last aggregation iteration, rather than from aggregation attention to decoding attention.

1. Introduction

Humans’ vision system typically decomposes scenes into objects and allocates attention at the object level (Chen, 2012; Kahneman et al., 1992). This efficiently captures scene information, models object attributes and their interrelationships, and ultimately supports complex functionalities of prediction, reasoning, planning and decision-making (Bar, 2004). Similarly, in Artificial Intelligence (AI), Object-Centric Learning (OCL) seeks to aggregate dense (super-)pixels of images or videos into sparse feature vectors, termed slots, representing corresponding objects in a sub-symbolic manner (Locatello et al., 2020). In multimodal AI systems, object-centric representations can encode visual information more effectively and integrate with other modalities more flexibly than popular approaches based on dense feature maps (Zhang et al., 2024; Wang et al., 2024).

Existing OCL typically follows the encoding-aggregation-decoding paradigm (Zhao et al., 2025b). Firstly, input image or video pixels are encoded into feature map; Secondly, object super-pixels in the feature map are iteratively aggregated into corresponding feature vectors, i.e., slots, using competitive cross attention, i.e., Slot Attention (Locatello et al., 2020) or its variants (Biza et al., 2023; Jia et al., 2023); Lastly, these slots are decoded to reconstruct the input in some format, providing the self-supervision signal.

However, (i1) once initialized, these slots are iterated naively in the aggregation, and when there are fewer objects than the predefined number of slots, redundant slots would compete with informative ones for object representation (Aydemir et al., 2023; Fan et al., 2024), causing objects to be segmented into parts. Additionally, (i2) mainstream methods (Singh et al., 2022a; Seitzer et al., 2023; Wu et al., 2023b) derive supervision signals solely from reconstructing the input using the aggregated slots, leaving internal information unutilized, e.g., the aggregation attention maps at latter iterations have better object segmentation, which can be utilized to provide extra supervision signal.

Issues (i1) and (i2) remain unresolved despite just a few ingenious attempts. (r1) SOLV (Aydemir et al., 2023) and AdaSlot (Fan et al., 2024) remove possible redundancy in the aggregated slots, but without further updating the remaining. As shown in Figure 1 o1, the remaining slots, despite being informative, are already distracted by redundant slots and are still of low quality in object representation. (r2) SPOT (Kakogeorgiou et al., 2024) drives the aggregation attention map to approximate the whole decoding attention map, but without discarding the latter’s bad compositions. As shown in Figure 1 o2, the decoding attention map, despite being generally better, sometimes has lower accuracy in object segmentation than the aggregation attention map. This is why SPOT has to pretrain an extra copy of the model as the teacher and has to run both the teacher and student for their offline distillation, at least doubling the training cost.

Our solution is simple yet effective. (s1) We use the remaining slots after slots redundancy reduction to re-initialize an extra aggregation. This improves slots’ object representation quality, instead of directly decoding the remaining slots. (s2) We use the attention map at the last aggregation iteration to guide that at the first iteration, where the guidance is mostly always better. This enables self-distillation, requiring no extra model copy, no pretraining, and minimal additional cost. Besides, (s3) we implement a generalized auto-regressive (AR) decoder for OCL. Since existing AR decoders’ fix-order flattening disrupts spatial correlations among superpixels (Singh et al., 2022a, b; Kakogeorgiou et al., 2024), we use random-order flattening to enforce the decoder to model spatial correlations.

In short, our contributions are (i) re-initialized OCL aggregation, (ii) self-distillation within aggregation iterations, (iii) random auto-regressive decoding; (iv) new state-of-the-art performance.

Refer to caption
Figure 2. Our DIAS introduces three novel designs: (i) Reinitialized aggregation, which improves slots’ object representation quality by re-initializing an extra aggregation to update the remaining slots after slots redundancy reduction; (ii) Self-distilled aggregation, which obtains better internal supervision by approximating the almost-always-better attention map at the last aggregation iteration from that at the first aggregation iteration; and (iii) Random auto-regressive decoding, which enforces the decoder’s modeling of spatial correlations by randomly flattening a 2-dimensional feature map into a 1-dimensional sequence. On the top left is the common OCL architecture, which is adapted from (Zhao et al., 2025b). ϕa\bm{\phi}_{\mathrm{a}}bold_italic_ϕ start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT is OCL aggregator, ϕd\bm{\phi}_{\mathrm{d}}bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT is OCL decoder, and ϕr\bm{\phi}_{\mathrm{r}}bold_italic_ϕ start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT is slots redundancy reduction. m is the mask token, bos is the Begin-of-Sentence token, and pos emb stands for the position embedding tensor. All notations are defined in the marked sections.

2. Relate Work

OCL aggregation. The core module of OCL is Slot Attention (Locatello et al., 2020), an iterative cross attention that takes some feature vectors as the competitive query, with the input’s feature map as the key and value. BO-QSA (Jia et al., 2023) improves the gradient flow on the query vectors for better optimization. ISA (Biza et al., 2023) realizes scaling, rotation and translation invariance in slots with inductive biases. While others learn Gaussian distributions to sample the query vectors, SAVi (Kipf et al., 2022) and SAVi++ (Elsayed et al., 2022) project conditions like object bounding boxes into query vectors. These aggregators are all faced with the issue that redundant slots compete with informative slots to represent single objects as parts. SOLV (Aydemir et al., 2023) merges similar slots after aggregation using agglomerative clustering by some threshold. AdaSlot (Fan et al., 2024) removes redundant slots after aggregation using a learned policy with some regularization. These two recent methods only remove or merge slots but do not update the remaining ones. Our DIAS improves the aggregation by both reducing redundant slots and updating the remaining slots with extra re-initialized aggregation.

OCL decoding. There are mainly three types of decoding for OCL. Mixture-based decoders include CNN (Kipf et al., 2022; Elsayed et al., 2022), MLP (Seitzer et al., 2023) and SlotMixer (Zadaianchuk et al., 2024). The former two broadcast each of the aggregated slots into some spatial dimension, decode them respectively and mix the final components as the reconstruction. This means that the computation is proportional to the number of slots (Seitzer et al., 2023), which is very expensive. The latter one mixes the slots with a Transformer decoder without self-attention on the query. Such a decoder has to work in lower slot channel dimensions (Zadaianchuk et al., 2024), which limits object representation quality. Auto-regression-based decoders include causal Transformer decoder (Singh et al., 2022a, b) and causal Transformer decoder with nine permutations (Kakogeorgiou et al., 2024). They basically drive the slots to reconstruct the input’s feature map with the causally masked input feature as the query, in a next-token prediction manner. However, flattening a 2-dimensional feature map into a 1-dimensional sequence in a fixed order loses the spatial relationship (Kakogeorgiou et al., 2024). Diffusion-based decoders include different conditional Diffusion models (Wu et al., 2023b; Jiang et al., 2023). They basically recover the noise from the noise diffused input with slots as conditions, to drive slots to be informative. But, Diffusion models are very large and sensitive to train (Rombach et al., 2022). Our DIAS realizes general auto-regression via arbitrary ordering. In terms of the reconstruction target, there are also works that introduce quantization (Singh et al., 2022a, b; Zhao et al., 2024, 2025a, 2025b), but we focus on the non-quantized case.

Knowledge distillation. The classical offline knowledge distillation (KD) (Gou et al., 2021) requires a pretrained model as the teacher and a to-be-trained model as the student. The teacher is frozen and runs at the same time as the student to provide better representations as targets for the corresponding student representations. There is also online distillation (Gou et al., 2021) where no pretraining of a teacher model is needed. A teacher model typically has the same architecture as the student and its weights are usually updated by exponentially moving averaging (Wikipedia, [n. d.]) the student model. Self-distillation (Gou et al., 2021) does not need an extra model as the teacher. Instead, intermediate representations in the model’s higher layers are taken as the teacher of the lower layers. This requires the higher layers’ representations to be better than the lower layers’. KD has not been fully explored in the OCL setting and SPOT (Kakogeorgiou et al., 2024) is the only one we know for now. But SPOT’s “self-distillation“ is actually offline distillation, which consumes at least two times training cost. Our DIAS realizes exact self-distillation by utilizing the definitively refined attention maps of the aggregation iterations, saving much cost.

3. Proposed Method

We begin with the preliminary of object-centric learning, highlighting the issues in existing OCL methods. Then, we present our method, Slot Attention with re-Initialization and self-Distillation (DIAS). In Figure 2, our method is illustrated and compared with three recent studies that share similar ideas.

3.1. Preliminary

Existing OCL models typically consist of three major modules: encoder, aggregator and decoder. We focus on the latter two.

Given the initial slots, i.e., query vectors 𝑺0s×c\bm{S}^{0}\in\mathbb{R}^{s\times c}bold_italic_S start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_s × italic_c end_POSTSUPERSCRIPT, a Slot Attention module ϕa\bm{\phi}_{\mathrm{a}}bold_italic_ϕ start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT aggregates the feature map 𝒁h×w×c\bm{Z}\in\mathbb{R}^{h\times w\times c}bold_italic_Z ∈ roman_ℝ start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT of an input image or video frame, into slots 𝑺s×c\bm{S}\in\mathbb{R}^{s\times c}bold_italic_S ∈ roman_ℝ start_POSTSUPERSCRIPT italic_s × italic_c end_POSTSUPERSCRIPT after some iterations:

(1) 𝑺i,𝑨ai=ϕa(𝑺i1,𝒁)i=1ia\bm{S}^{i},\bm{A}^{i}_{\mathrm{a}}=\bm{\phi}_{\mathrm{a}}(\bm{S}^{i-1},\bm{Z})\quad i=1...i_{\mathrm{a}}bold_italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( bold_italic_S start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , bold_italic_Z ) italic_i = 1 … italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT
(2) 𝑺𝑺ia\bm{S}\equiv\bm{S}^{i_{\mathrm{a}}}bold_italic_S ≡ bold_italic_S start_POSTSUPERSCRIPT italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

where iai_{\mathrm{a}}italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT is the total number of iterations for aggregation, and ia=3i_{\mathrm{a}}=3italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = 3 for unconditional queries (Locatello et al., 2020) while ia=1i_{\mathrm{a}}=1italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = 1 for conditional queries (Kipf et al., 2022). 𝑨ais×h×w\bm{A}^{i}_{\mathrm{a}}\in\mathbb{R}^{s\times h\times w}bold_italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_s × italic_h × italic_w end_POSTSUPERSCRIPT is aggregation attention map and can be converted to object segmentation with argmax()\mathrm{argmax(\cdot)}roman_argmax ( ⋅ ) along dimension ssitalic_s. ϕa\bm{\phi}_{\mathrm{a}}bold_italic_ϕ start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT can be parameterized into different Slot Attention variants like in Biza et al. (2023); Jia et al. (2023). We call 𝑺i\bm{S}^{i}bold_italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT as “query slots” and 𝑺\bm{S}bold_italic_S as “aggregated slots”. Note that ϕa\bm{\phi}_{\mathrm{a}}bold_italic_ϕ start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT enables all query vectors in 𝑺i\bm{S}^{i}bold_italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT to compete with one another for some super-pixels in 𝒁\bm{Z}bold_italic_Z.

To obtain the self-supervision signal, the aggregated slots 𝑺\bm{S}bold_italic_S are decoded by a decoder ϕd\bm{\phi}_{\mathrm{d}}bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT to reconstruct the input 𝑿\bm{X}bold_italic_X in some format 𝒀h×w×c\bm{Y}\in\mathbb{R}^{h^{\prime}\times w^{\prime}\times c^{\prime}}bold_italic_Y ∈ roman_ℝ start_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, with some clue 𝑪\bm{C}bold_italic_C:

(3) 𝒀,𝑨d=ϕd(𝑪,𝑺)\bm{Y}^{\prime},\bm{A}_{\mathrm{d}}=\bm{\phi}_{\mathrm{d}}(\bm{C},\bm{S})bold_italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_A start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( bold_italic_C , bold_italic_S )

where 𝑨ds×h×w\bm{A}_{\mathrm{d}}\in\mathbb{R}^{s\times h^{\prime}\times w^{\prime}}bold_italic_A start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_s × italic_h start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is decoding attention map and can be converted to object segmentation masks by binarization; 𝒀\bm{Y}^{\prime}bold_italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the reconstructed 𝒀\bm{Y}bold_italic_Y; and 𝒀\bm{Y}bold_italic_Y can be 𝑿\bm{X}bold_italic_X (Locatello et al., 2020), 𝒁\bm{Z}bold_italic_Z (Seitzer et al., 2023), or 𝒁\bm{Z}bold_italic_Z’s quantization (Zhao et al., 2025b), etc. 𝑪\bm{C}bold_italic_C is the destructed 𝒀\bm{Y}bold_italic_Y, which varies across methods.

The supervision signal comes from the reconstruction loss, either Mean Squared Error (MSE) or Cross Entropy (CE):

(4) lrecon=MSE(𝒀,𝒀)l_{\mathrm{recon}}=\mathrm{MSE}(\bm{Y}^{\prime},\bm{Y})italic_l start_POSTSUBSCRIPT roman_recon end_POSTSUBSCRIPT = roman_MSE ( bold_italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_Y )
(5) lrecon=CE(𝒀,𝒀)l_{\mathrm{recon}}=\mathrm{CE}(\bm{Y}^{\prime},\bm{Y})italic_l start_POSTSUBSCRIPT roman_recon end_POSTSUBSCRIPT = roman_CE ( bold_italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_Y )

The choice depends on specific methods. Mixture- and diffusion-based decoding use Equation 4, while auto-regression-based decoding uses either Equation 4 or Equation 5.

Refer to caption
Figure 3. Visualization of object discovery.

3.2. Re-Initialized Aggregation

Here is the first issue we address.

When there are fewer objects (plus the background) than the number of slots, redundant slots should be removed, or they will compete for object super-pixels and cause objects to be segmented into parts. SOLV (Aydemir et al., 2023) and AdaSlot (Fan et al., 2024) are the only two studies so far that explore this idea. They both conduct the redundancy reduction ϕr\bm{\phi}_{\mathrm{r}}bold_italic_ϕ start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT after the aggregation:

(6) 𝑺r,𝑴r=ϕr(𝑺)\bm{S}_{\mathrm{r}},\bm{M}_{\mathrm{r}}=\bm{\phi}_{\mathrm{r}}(\bm{S})bold_italic_S start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT ( bold_italic_S )
(7) 𝒀,𝑨d=ϕd(𝑪,𝑺r,𝑴r)\bm{Y}^{\prime},\bm{A}_{\mathrm{d}}=\bm{\phi}_{\mathrm{d}}(\bm{C},\bm{S}_{\mathrm{r}},\bm{M}_{\mathrm{r}})bold_italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_A start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( bold_italic_C , bold_italic_S start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT )

where 𝑺r\bm{S}_{\mathrm{r}}bold_italic_S start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT is in the same shape as 𝑺\bm{S}bold_italic_S with redundant slots being zeroed out; 𝑴rs\bm{M}_{\mathrm{r}}\in\mathbb{R}^{s}bold_italic_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT tells the decoder ϕd\bm{\phi}_{\mathrm{d}}bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT which slots are zeroed. Equation 7 is no different from Equation 3 except masking.

But as the slots aggregation is competitive, the remaining slots are already affected by those redundant slots and are still of low quality in object representing. As shown in Figure 1 o1, with redundancy reduction solely, we yield some boosts on object discovery but very limited boosts on object recognition.

Thus, we propose using the remaining slots after redundancy reduction to re-initialize an extra aggregation. Thus our aggregation and encoding are formulated as below:

(8) 𝑺i,𝑨ai=ϕa(𝑺i1,𝒁)i=1ia1\bm{S}^{i},\bm{A}^{i}_{\mathrm{a}}=\bm{\phi}_{\mathrm{a}}(\bm{S}^{i-1},\bm{Z})\quad i=1...i_{\mathrm{a}}-1bold_italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , bold_italic_A start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( bold_italic_S start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , bold_italic_Z ) italic_i = 1 … italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT - 1
(9) 𝑺r,𝑴r=ϕr(𝑺ia1)\bm{S}_{\mathrm{r}},\bm{M}_{\mathrm{r}}=\bm{\phi}_{\mathrm{r}}(\bm{S}^{i_{\mathrm{a}}-1})bold_italic_S start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT ( bold_italic_S start_POSTSUPERSCRIPT italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT )
(10) 𝑺,𝑨a=ϕa(𝑺r,𝒁,𝑴r)\bm{S},\bm{A}_{\mathrm{a}}=\bm{\phi}_{\mathrm{a}}(\bm{S}_{\mathrm{r}},\bm{Z},\bm{M}_{\mathrm{r}})bold_italic_S , bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ( bold_italic_S start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT , bold_italic_Z , bold_italic_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT )
(11) 𝒀,𝑨d=ϕd(𝑪,𝑺,𝑴r)\bm{Y}^{\prime},\bm{A}_{\mathrm{d}}=\bm{\phi}_{\mathrm{d}}(\bm{C},\bm{S},\bm{M}_{\mathrm{r}})bold_italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_A start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( bold_italic_C , bold_italic_S , bold_italic_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT )

Here Equation 8 is the initial aggregation, Equation 9 is the redundancy reduction, Equation 10 is the extra aggregation, and Equation 11 is the masked decoding. For the initial aggregation, we use ia1i_{\mathrm{a}}-1italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT - 1 iterations, and ia2i_{\mathrm{a}}\geq 2italic_i start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ≥ 2. For the redundancy reduction, we adopt SOLV’s solution, agglomerative clustering (Learn, 2025b) by cosine distance given some threshold. For the extra aggregation, we apply the redundancy mask 𝑴r\bm{M}_{\mathrm{r}}bold_italic_M start_POSTSUBSCRIPT roman_r end_POSTSUBSCRIPT to the query-key attention logits so as to discard the effect of those zeroed-out slots (Fan et al., 2024). For the masked decoding, we follow the solutions from SOLV and AdaSlot.

As shown in Figure 1 o1, with further updating the remaining slots after redundancy reduction by re-initializing extra aggregation, we can yield boosts in both object discovery and recognition.

ClevrTex #slot=11 COCO #slot=7 VOC #slot=6
ARI ARIfg mBO mIoU ARI ARIfg mBO mIoU ARI ARIfg mBO mIoU
SLATE 17.4

±2.9

87.4

±1.7

44.5

±2.2

43.3

±2.4

17.5

±0.6

28.8

±0.3

26.8

±0.3

25.4

±0.3

18.6

±0.1

26.2

±0.8

37.2

±0.5

36.1

±0.4

DINOSAUR 50.7

±24.1

89.4

±0.3

53.3

±5.0

52.8

±5.2

18.2

±1.0

35.0

±1.2

28.3

±0.5

26.9

±0.5

21.5

±0.7

35.2

±1.3

40.6

±0.6

39.7

±0.6

SlotDiffusion 66.1

±1.3

82.7

±1.6

54.3

±0.5

53.4

±0.8

17.7

±0.5

28.7

±0.1

27.0

±0.4

25.6

±0.4

17.0

±1.2

21.7

±1.8

35.2

±0.9

34.0

±1.0

AdaSlot 19.1

±2.3

35.2

±2.0

29.4

±1.2

27.5

±0.8

SPOT 65.2

±11.5

89.6

±2.4

56.3

±4.1

55.1

±3.7

20.3

±0.7

41.1

±0.3

30.4

±0.1

29.0

±0.9

25.3

±5.2

36.4

±2.8

44.4

±1.7

43.4

±2.1

DIAS

image

69.4 ±8.7 89.9 ±1.3 59.4 ±2.9 57.8 ±2.4 22.1 ±0.8 42.7 ±0.2 32.8 ±0.1 31.3 ±0.4 27.3 ±4.6 37.7 ±2.5 46.1 ±2.1 45.0 ±1.7
Table 1. Object discovery on images. Using DINO2 ViT s/14 for OCL encoding; input resolution is 256×\times×256 (224×\times×224).

3.3. Self-Distilled Aggregation

Here is the second issue we address.

The key to realizing self-distillation or simply distillation is that the teacher should be better than the student during training (of course after training the student can be better than the teacher). SPOT (Kakogeorgiou et al., 2024) resorts to pretrain an extra copy of the model as the teacher to realize such performance gap.

However, the decoding attention map 𝑨a\bm{A}_{\mathrm{a}}bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT generally exhibits better object segmentation than the aggregation attention map 𝑨d\bm{A}_{\mathrm{d}}bold_italic_A start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT, but not always, as shown in Figure 1 o2 left. In contrast, the attention map at the last aggregation iteration is almost always better than that at the first iteration. Thus, we can enforce additional supervision by forcing the first aggregation attention map to approximate the attention map at the last aggregation iteration.

SPOT (Kakogeorgiou et al., 2024) utilizes all the decoder attention map as the target without selection. And if some decoding attention compositions are worse than the corresponding aggregation attention, then such approximation would harm the performance. This is also why they have to pretrain an extra copy of the model and use it as the teacher to guide the training of the (student) model. Such offline distillation at least doubles the training cost.

(12) 𝑴a=Hungarian(bins(𝑨a),bins(𝑨dteacher))\bm{M}_{\mathrm{a}}^{*}=\mathrm{Hungarian}(\mathrm{bin}_{s}(\bm{A}_{\mathrm{a}}),\mathrm{bin}_{s}(\bm{A}_{\mathrm{d}}^{\mathrm{teacher}}))bold_italic_M start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Hungarian ( roman_bin start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ) , roman_bin start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_teacher end_POSTSUPERSCRIPT ) )
(13) lapprox=CE(𝑨a,𝑴a)l_{\mathrm{approx}}=\mathrm{CE}(\bm{A}_{\mathrm{a}},\bm{M}_{\mathrm{a}}^{*})italic_l start_POSTSUBSCRIPT roman_approx end_POSTSUBSCRIPT = roman_CE ( bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

where bins\mathrm{bin}_{s}roman_bin start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is binarization along dimension ssitalic_s, with maximum value as one and others as zeros; Hungarian()\mathrm{Hungarian}(\cdot)roman_Hungarian ( ⋅ ) is Hungarian matching; lapproxl_{\mathrm{approx}}italic_l start_POSTSUBSCRIPT roman_approx end_POSTSUBSCRIPT is the approximation loss.

To address this issue, we gain the signal for self-distillation from the intrinsic performance gap between the last aggregation iteration and the first. As shown in Figure 1 o2 right, in 90% cases, the attention map at the last aggregation iteration is better than that at the first iteration.

And our solution is quite simple:

(14) 𝑴a=Hungarian(bins(𝑨a1),bins(𝑨a))\bm{M}_{\mathrm{a}}^{*}=\mathrm{Hungarian}(\mathrm{bin}_{s}(\bm{A}_{\mathrm{a}}^{1}),\mathrm{bin}_{s}(\bm{A}_{\mathrm{a}}))bold_italic_M start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Hungarian ( roman_bin start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , roman_bin start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT ) )
(15) lapprox=CE(𝑨a1,𝑴a)l_{\mathrm{approx}}=\mathrm{CE}(\bm{A}_{\mathrm{a}}^{1},\bm{M}_{\mathrm{a}}^{*})italic_l start_POSTSUBSCRIPT roman_approx end_POSTSUBSCRIPT = roman_CE ( bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , bold_italic_M start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

where bins\mathrm{bin}_{s}roman_bin start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is binarization along dimension ssitalic_s, with maximum value as one and others as zeros; Hungarian()\mathrm{Hungarian}(\cdot)roman_Hungarian ( ⋅ ) is Hungarian matching; lapproxl_{\mathrm{approx}}italic_l start_POSTSUBSCRIPT roman_approx end_POSTSUBSCRIPT is the approximation loss; 𝑨a1\bm{A}_{\mathrm{a}}^{1}bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is the attention map at the first aggregation iteration while 𝑨a\bm{A}_{\mathrm{a}}bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT is the attention map at the last aggregation – after re-initialization defined in Section 3.2.

As the teacher 𝑨a\bm{A}_{\mathrm{a}}bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT is better than 𝑨a1\bm{A}_{\mathrm{a}}^{1}bold_italic_A start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT in most cases, which is a intrinsic performance gap, we do not need to pre-train a model as the teacher to produce such definitive performance gap.

3.4. Random Auto-Regressive Decoding

We also improve the decoding.

As discussed in Section 2 “OCL decoding”, three types OCL decoding all have their disadvantages. Mixture-based decoders like CNN (Kipf et al., 2022; Elsayed et al., 2022) and MLP (Seitzer et al., 2023) are too computationally expensive, while SlotMixer (Zadaianchuk et al., 2024) is efficient but has limited object representation quality. Diffusion-based decoders (Jiang et al., 2023; Wu et al., 2023b) are expensive and sensitive to train. Auto-regression-based decoders like Transformer decoder (Singh et al., 2022a, b) are very efficient, but their next-token prediction has to flatten the 2-dimensional super-pixels into 1-dimensional in a fixed order, thus they do not utilize the spatial correlation among super-pixels. Transformer9 (Kakogeorgiou et al., 2024) utilizes nine handcrafted permutations {𝑰j|j=19}\{\bm{I}^{j}|j=1...9\}{ bold_italic_I start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_j = 1 … 9 } to better capture spatial dependencies:

(16) 𝒀bosj=concatn(𝑬bosj,gathern(flattenh,w(𝒀),𝑰j)[:1])\bm{Y}_{\mathrm{bos}}^{j}=\mathrm{concat}_{n}(\bm{E}_{\mathrm{bos}}^{j},\mathrm{gather}_{n}(\mathrm{flatten}_{h,w}(\bm{Y}),\bm{I}^{j})[:-1])bold_italic_Y start_POSTSUBSCRIPT roman_bos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = roman_concat start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_E start_POSTSUBSCRIPT roman_bos end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , roman_gather start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_flatten start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ( bold_italic_Y ) , bold_italic_I start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) [ : - 1 ] )
(17) 𝒀j=ϕd(𝒀bosj,𝑴causal){\bm{Y}^{j}}^{\prime}=\bm{\phi}_{\mathrm{d}}(\bm{Y}^{j}_{\mathrm{bos}},\bm{M}_{\mathrm{causal}})bold_italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_bos end_POSTSUBSCRIPT , bold_italic_M start_POSTSUBSCRIPT roman_causal end_POSTSUBSCRIPT )

where 𝑰jn\bm{I}^{j}\in\mathbb{R}^{n}bold_italic_I start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the jjitalic_jth handicraft super-pixel ordering, with n=h×wn=h\times witalic_n = italic_h × italic_w; 𝑬bosjc\bm{E}^{j}_{\mathrm{bos}}\in\mathbb{R}^{c}bold_italic_E start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_bos end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the jjitalic_jth Beginning-of-Sentence (BoS) token (Kakogeorgiou et al., 2024); flattenh,w()\mathrm{flatten}_{h,w}(\cdot)roman_flatten start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ( ⋅ ) flattens height and width; gathern(,)\mathrm{gather}_{n}(\cdot,\cdot)roman_gather start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ , ⋅ ) select the tensor along dimension nnitalic_n given index tensor 𝑰j\bm{I}^{j}bold_italic_I start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT; concatn(,)\mathrm{concat}_{n}(\cdot,\cdot)roman_concat start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( ⋅ , ⋅ ) concatenates two tensors along dimension nnitalic_n; 𝒀bosjn×c\bm{Y}^{j}_{\mathrm{bos}}\in\mathbb{R}^{n\times c}bold_italic_Y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_bos end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n × italic_c end_POSTSUPERSCRIPT is the sequence prepended with BoS; 𝑴causaln×n\bm{M}_{\mathrm{causal}}\in\mathbb{R}^{n\times n}bold_italic_M start_POSTSUBSCRIPT roman_causal end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is the typical causal mask (Kakogeorgiou et al., 2024); and ��d\bm{\phi}_{\mathrm{d}}bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT is parameterized as a Transformer decoder.

The spatial relationships among super-pixels cannot be sufficiently captured by these nine handcrafted orderings. Therefore, we propose a generalized auto-regressive decoding scheme that accommodates arbitrary super-pixel orderings.

(18) n#𝒰{0,1,,n},n=h×w,\quad n^{\#}\sim\mathcal{U}\{0,1,\dots,n\},~~n=h\times w,italic_n start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ∼ caligraphic_U { 0 , 1 , … , italic_n } , italic_n = italic_h × italic_w ,
(19) 𝑰=shuffle(𝑰0)\bm{I}=\mathrm{shuffle}(\bm{I}^{0})bold_italic_I = roman_shuffle ( bold_italic_I start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT )
(20) 𝑬p#=gathern(flattenh,w(𝑬p),𝑰[:n#+1])\bm{E}_{\mathrm{p}}^{\#}=\mathrm{gather}_{n}(\mathrm{flatten}_{h,w}(\bm{E}_{\mathrm{p}}),\bm{I}[:n^{\#}+1])bold_italic_E start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT = roman_gather start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_flatten start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ( bold_italic_E start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ) , bold_italic_I [ : italic_n start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT + 1 ] )
(21) 𝒀#=concatn#(gathern(flattenh,w(𝒀),𝑰[:n#]),𝑬m)\bm{Y}^{\#}=\mathrm{concat}_{n^{\#}}(\mathrm{gather}_{n}(\mathrm{flatten}_{h,w}(\bm{Y}),\bm{I}[:n^{\#}]),\bm{E}_{\mathrm{m}})bold_italic_Y start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT = roman_concat start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( roman_gather start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( roman_flatten start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT ( bold_italic_Y ) , bold_italic_I [ : italic_n start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT ] ) , bold_italic_E start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT )
(22) 𝒀#=ϕd(𝑬p#+𝒀#){\bm{Y}^{\#}}^{\prime}=\bm{\phi}_{\mathrm{d}}(\bm{E}^{\#}_{\mathrm{p}}+\bm{Y}^{\#})bold_italic_Y start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( bold_italic_E start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT + bold_italic_Y start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT )

Equation 18 obtains a random sequence length n#n^{\#}italic_n start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT; Equation 19 samples an arbitrary sequence order; Equation 20 shuffles the position embedding tensor 𝑬ph×w×c\bm{E}_{\mathrm{p}}\in\mathbb{R}^{h\times w\times c}bold_italic_E start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_h × italic_w × italic_c end_POSTSUPERSCRIPT, while Equation 21 shuffles the target tensor 𝒀\bm{Y}bold_italic_Y, with mask token 𝑬mc\bm{E}_{\mathrm{m}}\in\mathbb{R}^{c}bold_italic_E start_POSTSUBSCRIPT roman_m end_POSTSUBSCRIPT ∈ roman_ℝ start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT indicating the next token to predict; and Equation 22 takes the positional information and the known tokens to predict the next token.

Accordingly, the reconstruction loss is modified as below:

(23) lrecon=MSE(𝒀#,𝒀#)l_{\mathrm{recon}}=\mathrm{MSE}({\bm{Y}^{\#}}^{\prime},\bm{Y}^{\#})italic_l start_POSTSUBSCRIPT roman_recon end_POSTSUBSCRIPT = roman_MSE ( bold_italic_Y start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_Y start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT )

Our random auto-regression compels the decoder ϕd\bm{\phi}_{\mathrm{d}}bold_italic_ϕ start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT to reconstruct the target 𝒀\bm{Y}bold_italic_Y using whatever subset of known super-pixels from arbitrary positions in the feature map. This, combined with the former two techniques, further boosts OCL performance.

4. Experiment

We conduct the following experiments using three random seeds. We evaluate object representation quality through object discovery and recognition, as well as downstream tasks, including video prediction and reasoning.

4.1. Object Discovery

The attention maps of OCL aggregation and decoding are the segmentation of the objects and background that are discovered by slots. We measure these segmentation using typical metrics, i.e., ARI (Adjusted Rand Index) (Learn, 2025a), ARIfg (foreground Adjusted Rand Index), mBO (mean Best Overlap) (Uijlings et al., 2013) and mIoU (mean Intersection-over-Union) (Learn, [n. d.]). ARI measures more about larger areas, i.e., the background; mBO only measures the best overlapped segmentation; mIoU is the most strict measurement.

For baselines, we use the models from (Zhao et al., 2025b), which reproduce very strong baselines by unifying representative OCL models all with current best practices. Among them, SLATE, STEVE, DINOSAUR, SlotDiffusion and SPOT are designed for images; STEVE and VideoSAUR are for videos; VVO with different decoders support images or videos. We also include AdaSlot (for images) (Fan et al., 2024) and SOLV (for videos) (Aydemir et al., 2023).

For datasets, we use ClevrTex (the main set for training and the out-of-distribution set for evaluation) (Karazija et al., [n. d.]), COCO (instance segmentation) (Lin et al., 2014), VOC (Everingham et al., 2010), MOVi-D (Research, [n. d.]) and YTVIS (high-quality) (SysCV, [n. d.]). Among them, ClevrTex is synthetic images with complex textures; COCO and VOC are real-world images; MOVi-D is a set of challenging synthetic videos with up to 20 objects per scene; YTVIS is real-world YouTube videos.

As shown in Table 1, with unconditional query, our DIAS outperforms all the baselines on both synthetic and real-world images in object discovery under all the metrics. In fact, DIAS achieves new state-of-the-art performance. As shown in Table 2, with conditional query, DIAS the temporal version surpasses STEVE by a large margin on synthetic videos; With unconditional query, DIAS defeats the state-of-the-art video OCL model. As shown in Table 3, combined with the general improver VVO (Zhao et al., 2025b), DIAS pushes state-of-the-art performance forward even further.

Specifically, only SOLV, AdaSlot and DIAS, as well as DIAS temporal, have the redundant slot reduction operation, but DIAS is significantly better than the others. Besides, only SPOT and DIAS have the distillation operation, but SPOT has to pretrain an extra copy of the model as the teacher and then run both the teacher and a new copy of model simultaneously to do the distillation. By contrast, DIAS only needs to run one single copy of model without any pretraining, and thus, as shown in Table 4, only consumes average-level training resources.

ARI ARIfg mBO mIoU
MOVi-D #slot=21 conditional
STEVE 31.2

±3.8

62.9

±5.1

22.2

±2.2

19.9

±2.4

DIAS

video

37.2 ±3.5 64.7 ±3.7 25.9 ±2.4 22.7 ±2.6
YTVIS #slot=7 #step=20
VideoSAUR 33.0

±0.6

49.0

±0.9

30.8

±0.4

30.1

±0.6

SOLV 36.2

±1.1

50.4

±1.2

30.7

±0.8

31.0

±0.7

DIAS

video

38.7 ±1.0 52.1 ±0.4 33.3 ±0.7 34.6 ±0.6
Table 2. Object discovery on videos. Using DINO2 ViT s/14 for OCL encoding; input resolution is 256×\times×256 (224×\times×224).
COCO #slot=7
ARI ARIfg mBO mIoU
VVO

Tfd

21.1

±2.1

31.5

±1.1

29.6

±0.7

28.2

±0.8

VVO

Mlp

19.2

±0.4

35.9

±0.6

28.7

±0.3

27.4

±0.3

VVO

Dfz

18.3

±0.4

29.0

±1.0

27.2

±0.1

25.8

±0.1

DIAS

image

+VVO
22.9 ±1.7 43.1 ±0.8 33.2 ±0.7 31.7 ±0.5
Table 3. Combined with technique VVO, DIAS pushes SotA forward even further. Using DINO2 ViT s/14 for OCL encoding; input resolution is 256×\times×256 (224×\times×224).
COCO #slot=7
pretraining training
memory time memory time
SPOT 6.4 GB 11.6 h 7.0 GB 12.1 h
DIAS

image

N/A N/A 6.5 GB 12.0 h
Table 4. DIAS spends merely no more than 1/2 training cost compared with SPOT. Using one RTX 3080; auto-mixed precision.

4.2. Object Recognition

With slots representing different objects or background, we can further evaluate how much object information they aggregate. In other studies, this is also called set prediction (Locatello et al., 2020) or object property prediction (Fan et al., 2024). Following the routine of (Seitzer et al., 2023), we reuse the trained models from Section 4.1 to extract the real-world image dataset COCO into slots representation, and train a small MLP to take each slot as input to predict the corresponding object’s class label and bounding box. We use metrics top1 (Learn, 2025d) and R2 (Learn, 2025c) to measure the classification and regression performance.

As shown in Table 5, whether using the general improve VVO, our method achieves obviously better classification accuracy and bounding box regression accuracy on those real-world images. Also considering its object discovery superiority, we can assert that DIAS can represent visual scenes with better object representation quality than all the baselines.

COCO #slot=7
class top1\uparrow bbox R2\uparrow
DINOSAUR + MLP 0.61

±0.0

0.57

±0.1

SPOT + MLP 0.67

±0.0

0.62

±0.1

DIAS

image

+ MLP 0.70 ±0.0 0.63 ±0.0
Table 5. Object recognition on COCO.

4.3. Visual Prediction and Reasoning

Refer to caption
Figure 4. Visual prediction (upper) and reasoning (lower).

With improved object representation, our DIAS also helps the downstream tasks, like visual prediction and reasoning. Following the rountine from (Wu et al., 2023a), we firstly train our DIAS and other baselines on Physion (Bear et al., 2021), which is a synthetic video dynamics modeling dataset, and represent the dataset as slots. Then we train an object-centric world model SlotFormer (Wu et al., 2023a) to auto-regressively predict future frames of slots, given initial five frames of slots as the input. And based on such predictions, we further train a reasoning model to classify whether the stimuli objects will contact finally.

As shown in Figure 4 upper, in terms of visual prediction, measured by L2 norm normalized MSE, our combination DIAS+SlotFormer accumulates errors significantly slower than the baseline combination VideoSAUR+SlotFormer. As shown in Figure 4 lower, in terms of visual reasoning, measured by binary classification accuracy, our combination DIAS+SlotFormer achieves better overall reasoning accuracy thanks to the much lower prediction error.

By the way, the reasoning accuracy goes down-up-down because: Early on, the model’s predictions are accurate, allowing the classifier to quickly infer the final state and achieve high accuracy. As prediction errors start to accumulate, these errors distract the classifier and reduce accuracy. When more frames are predicted, the model receives additional (though noisy) evidence near the end, which boosts accuracy temporarily–until the accumulated error becomes too great and overwhelms the useful information.

4.4. Ablation Study

redundancy re- self- rand ar COCO #slot=7
reduction initializ. distill. decoding ARI ARIfg
22.1

±0.8

42.7

±0.2

21.3

±0.8

42.0

±0.4

20.6

±0.7

41.5

±0.3

21.4

±1.7

41.8

±0.5

19.9

±1.1

42.1

±0.7

Table 6. Effects of four techniques.
COCO #slot=7
redundancy reduction re-initializ. class top1 bbox R2
0.70

±0.0

0.63

±0.0

0.68

±0.0

0.62

±0.0

SPOT as a reference 0.67 ±0.0 0.62 ±0.1
Table 7. Effects of re-initialization.

Here we ablate our model to evaluate how much our three techniques contribute to the performance boosts respectively. We conduct the ablation on COCO using both object discovery and recognition tasks.

As shown in Table 6 and Table 7, redundancy reduction in aggregated slots, which is not our contribution, is important to the performance. But, this does not mean that our technique re-initialization is not necessary; instead, re-initialization builds on the reduction and improves it even further, especially in object recognition.

As shown in Table 6, our arbitrary-order auto-regressive decoding surely contributes positively to our DIAS model architecture. Specifically, our self-distillation is also beneficial to both the background segmentation performance, i.e., ARI, and the foreground segmentation performance, i.e., ARIfg. Our random AR decoding contributes much more to the background segmentation.

5. Conclusion

We explored three techniques in OCL aggregation and decoding modules, i.e., re-initialization in aggregation, self-distillation from the last aggregation attention to the first aggregation attention, and arbitrary-order auto-regressive decoding. With these techniques, our DIAS achieves new state-of-the-art in OCL tasks on both images and videos. The basis of our re-initialization still relies on manual threshold and so does our self-distillation. These could be future works to improve DIAS further.

Acknowledgements.
We acknowledge the support of Finnish Center for Artificial Intelligence (FCAI), Research Council of Finland flagship program. We thank the Research Council of Finland for funding the projects ADEREHA (grant no. 353198), BERMUDA (362407), PROFI7 (352788) and MARL (357301). We also appreciate CSC - IT Center for Science, Finland, for granting access to supercomputers Mahti and Puhti, as well as LUMI, owned by the European High Performance Computing Joint Undertaking (EuroHPC JU) and hosted by CSC Finland in collaboration with the LUMI consortium. Furthermore, we acknowledge the computational resources provided by the Aalto Science-IT project through the Triton cluster.

References

  • (1)
  • Aydemir et al. (2023) Görkay Aydemir, Weidi Xie, and Fatma Guney. 2023. Self-Supervised Object-Centric Learning for Videos. Advances in Neural Information Processing Systems 36 (2023), 32879–32899.
  • Bar (2004) Moshe Bar. 2004. Visual Objects in Context. Nature Reviews Neuroscience 5, 8 (2004), 617–629.
  • Bear et al. (2021) Daniel Bear, Elias Wang, Damian Mrowca, Felix Jedidja Binder, Hsiao-Yu Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin A Smith, Fan-Yun Sun, et al. 2021. Physion: Evaluating Physical Prediction from Vision in Humans and Machines. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  • Biza et al. (2023) Ondrej Biza, Sjoerd van Steenkiste, Mehdi SM Sajjadi, Gamaleldin Elsayed, Aravindh Mahendran, and Thomas Kipf. 2023. Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames. In International Conference on Machine Learning. 2507–2527.
  • Chen (2012) Zhe Chen. 2012. Object-based attention: A Tutorial Review. Attention, Perception, & Psychophysics 74 (2012), 784–802.
  • Elsayed et al. (2022) Gamaleldin Elsayed, Aravindh Mahendran, Sjoerd Van Steenkiste, et al. 2022. SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. Advances in Neural Information Processing Systems 35 (2022), 28940–28954.
  • Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. International journal of computer vision 88 (2010), 303–338.
  • Fan et al. (2024) Ke Fan, Zechen Bai, Tianjun Xiao, Tong He, Max Horn, Yanwei Fu, Francesco Locatello, and Zheng Zhang. 2024. Adaptive Slot Attention: Object Discovery with Dynamic Slot Number. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23062–23071.
  • Gou et al. (2021) Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge Distillation: A Survey. International Journal of Computer Vision 129, 6 (2021), 1789–1819.
  • Jia et al. (2023) Baoxiong Jia, Yu Liu, and Siyuan Huang. 2023. Improving Object-centric Learning with Query Optimization. In The Eleventh International Conference on Learning Representations.
  • Jiang et al. (2023) Jindong Jiang, Fei Deng, Gautam Singh, and Sungjin Ahn. 2023. Object-Centric Slot Diffusion. Advances in Neural Information Processing Systems (2023).
  • Kahneman et al. (1992) Daniel Kahneman, Anne Treisman, and Brian J Gibbs. 1992. The reviewing of object files: Object-specific integration of information. Cognitive psychology 24, 2 (1992), 175–219.
  • Kakogeorgiou et al. (2024) Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, and Nikos Komodakis. 2024. Spot: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22776–22786.
  • Karazija et al. ([n. d.]) Laurynas Karazija, Iro Laina, and Christian Rupprecht. [n. d.]. ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  • Kipf et al. (2022) Thomas Kipf, Gamaleldin Elsayed, Aravindh Mahendran, et al. 2022. Conditional Object-Centric Learning from Video. International Conference on Learning Representations (2022).
  • Learn ([n. d.]) Sci-Kit Learn. [n. d.]. jaccard_score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html. Accessed: 2025-04-06.
  • Learn (2025a) Sci-Kit Learn. 2025a. adjusted_rand_score. Accessed: 2025-04-06.
  • Learn (2025b) Sci-Kit Learn. 2025b. AgglomerativeClustering. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
  • Learn (2025c) Sci-Kit Learn. 2025c. r2_score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html. Accessed: 2025-04-06.
  • Learn (2025d) Sci-Kit Learn. 2025d. top_k_accuracy_score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.top_k_accuracy_score.html. Accessed: 2025-04-06.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 740–755.
  • Locatello et al. (2020) Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, et al. 2020. Object-Centric Learning with Slot Attention. Advances in Neural Information Processing Systems 33 (2020), 11525–11538.
  • Research ([n. d.]) Google Research. [n. d.]. Multi-Object Video (MOVi) datasets. https://github.com/google-research/kubric/tree/main/challenges/movi. Accessed: 2025-04-06.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  • Seitzer et al. (2023) Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, et al. 2023. Bridging the Gap to Real-World Object-Centric Learning. International Conference on Learning Representations (2023).
  • Singh et al. (2022a) Gautam Singh, Fei Deng, and Sungjin Ahn. 2022a. Illiterate DALL-E Learns to Compose. International Conference on Learning Representations (2022).
  • Singh et al. (2022b) Gautam Singh, Yi-Fu Wu, and Sungjin Ahn. 2022b. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. Advances in Neural Information Processing Systems 35 (2022), 18181–18196.
  • SysCV ([n. d.]) SysCV. [n. d.]. HQ-YTVIS: High-Quality Video Instance Segmentation Dataset. https://github.com/SysCV/vmt?tab=readme-ov-file#hq-ytvis-high-quality-video-instance-segmentation-dataset. Accessed: 2025-04-06.
  • Uijlings et al. (2013) Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective Search for Object Recognition. International Journal of Computer Vision 104 (2013), 154–171.
  • Wang et al. (2024) Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. 2024. OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning. arXiv preprint arXiv:2405.01533 (2024).
  • Wikipedia ([n. d.]) Wikipedia. [n. d.]. Exponential Smoothing. https://en.wikipedia.org/wiki/Exponential_smoothing. Accessed: 2025-04-06.
  • Wu et al. (2023a) Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, and Animesh Garg. 2023a. SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. International Conference on Learning Representations (2023).
  • Wu et al. (2023b) Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, and Animesh Garg. 2023b. SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models. Advances in Neural Information Processing Systems 36 (2023), 50932–50958.
  • Zadaianchuk et al. (2024) Andrii Zadaianchuk, Maximilian Seitzer, and Georg Martius. 2024. Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities. Advances in Neural Information Processing Systems 36 (2024).
  • Zhang et al. (2024) Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. 2024. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding. Advances in Neural Information Processing Systems 37 (2024), 71737–71767.
  • Zhao et al. (2024) Rongzhen Zhao, Vivienne Wang, Juho Kannala, and Joni Pajarinen. 2024. Grouped Discrete Representation for Object-Centric Learning. arXiv preprint arXiv:2411.02299 (2024).
  • Zhao et al. (2025a) Rongzhen Zhao, Vivienne Wang, Juho Kannala, and Joni Pajarinen. 2025a. Multi-Scale Fusion for Object Representation. ICLR (2025).
  • Zhao et al. (2025b) Rongzhen Zhao, Vivienne Wang, Juho Kannala, and Joni Pajarinen. 2025b. Vector-Quantized Vision Foundation Models for Object-Centric Learning. arXiv preprint arXiv:2502.20263 (2025).