\setcctype

Slot Attention with Re-Initialization and Self-Distillation

Rongzhen Zhao 0009-0000-3964-7336 Department of Electrical Engineering and AutomationAalto UniversityEspooFinland rongzhen.zhao@aalto.fi , Yi Zhao 0009-0002-9979-595X Department of Electrical Engineering and AutomationAalto UniversityEspooFinland yi.zhao@aalto.fi , Juho Kannala 0000-0001-5088-4041 Department of Computer ScienceAalto UniversityEspooFinland Center for Machine Vision and Signal AnalysisUniversity of OuluOuluFinland juho.kannala@aalto.fi and Joni Pajarinen 0000-0003-4469-8191 Department of Electrical Engineering and AutomationAalto UniversityEspooFinland joni.pajarinen@aalto.fi

(2025)

Abstract.

Unlike popular solutions based on dense feature maps, Object-Centric Learning (OCL) represents visual scenes as sub-symbolic object-level feature vectors, termed slots, which are highly versatile for tasks involving visual modalities. OCL typically aggregates object superpixels into slots by iteratively applying competitive cross attention, known as Slot Attention, with the slots as the query. However, once initialized, these slots are reused naively, causing redundant slots to compete with informative ones for representing objects. This often results in objects being erroneously segmented into parts. Additionally, mainstream methods derive supervision signals solely from decoding slots into the input’s reconstruction, overlooking potential supervision based on internal information. To address these issues, we propose Slot Attention with re-Initialization and self-Distillation (DIAS): i) We reduce redundancy in the aggregated slots and re-initialize extra aggregation to update the remaining slots; ii) We drive the bad attention map at the first aggregation iteration to approximate the good at the last iteration to enable self-distillation. Experiments demonstrate that DIAS achieves state-of-the-art on OCL tasks like object discovery and recognition, while also improving advanced visual prediction and reasoning. Our code is available on https://github.com/Genera1Z/DIAS.

Object-Centric Learning, Slot Attention, Object Representation, Visual Prediction, Visual Reasoning

^†^†journalyear: 2025^†^†copyright: cc^†^†conference: Proceedings of the 33rd ACM International Conference on Multimedia; October 27–31, 2025; Dublin, Ireland^†^†booktitle: Proceedings of the 33rd ACM International Conference on Multimedia (MM ’25), October 27–31, 2025, Dublin, Ireland^†^†doi: 10.1145/3746027.3755339^†^†isbn: 979-8-4007-2035-2/2025/10^†^†submissionid: mfp3301^†^†copyright: none^†^†ccs: Computing methodologies Computer vision representations

Refer to caption — Figure 1. Two key observations inspire our method. (o1) We re-initialize an extra aggregation to update the remaining slots after slots redundancy reduction, instead of decoding the remaining slots directly. (o2) We self-distill the attention map at the first iteration to approximate the almost-always-good attention at the last aggregation iteration, rather than from aggregation attention to decoding attention.

1. Introduction

Humans’ vision system typically decomposes scenes into objects and allocates attention at the object level (Chen, 2012; Kahneman et al., 1992). This efficiently captures scene information, models object attributes and their interrelationships, and ultimately supports complex functionalities of prediction, reasoning, planning and decision-making (Bar, 2004). Similarly, in Artificial Intelligence (AI), Object-Centric Learning (OCL) seeks to aggregate dense (super-)pixels of images or videos into sparse feature vectors, termed slots, representing corresponding objects in a sub-symbolic manner (Locatello et al., 2020). In multimodal AI systems, object-centric representations can encode visual information more effectively and integrate with other modalities more flexibly than popular approaches based on dense feature maps (Zhang et al., 2024; Wang et al., 2024).

Existing OCL typically follows the encoding-aggregation-decoding paradigm (Zhao et al., 2025b). Firstly, input image or video pixels are encoded into feature map; Secondly, object super-pixels in the feature map are iteratively aggregated into corresponding feature vectors, i.e., slots, using competitive cross attention, i.e., Slot Attention (Locatello et al., 2020) or its variants (Biza et al., 2023; Jia et al., 2023); Lastly, these slots are decoded to reconstruct the input in some format, providing the self-supervision signal.

However, (i1) once initialized, these slots are iterated naively in the aggregation, and when there are fewer objects than the predefined number of slots, redundant slots would compete with informative ones for object representation (Aydemir et al., 2023; Fan et al., 2024), causing objects to be segmented into parts. Additionally, (i2) mainstream methods (Singh et al., 2022a; Seitzer et al., 2023; Wu et al., 2023b) derive supervision signals solely from reconstructing the input using the aggregated slots, leaving internal information unutilized, e.g., the aggregation attention maps at latter iterations have better object segmentation, which can be utilized to provide extra supervision signal.

Issues (i1) and (i2) remain unresolved despite just a few ingenious attempts. (r1) SOLV (Aydemir et al., 2023) and AdaSlot (Fan et al., 2024) remove possible redundancy in the aggregated slots, but without further updating the remaining. As shown in Figure 1 o1, the remaining slots, despite being informative, are already distracted by redundant slots and are still of low quality in object representation. (r2) SPOT (Kakogeorgiou et al., 2024) drives the aggregation attention map to approximate the whole decoding attention map, but without discarding the latter’s bad compositions. As shown in Figure 1 o2, the decoding attention map, despite being generally better, sometimes has lower accuracy in object segmentation than the aggregation attention map. This is why SPOT has to pretrain an extra copy of the model as the teacher and has to run both the teacher and student for their offline distillation, at least doubling the training cost.

Our solution is simple yet effective. (s1) We use the remaining slots after slots redundancy reduction to re-initialize an extra aggregation. This improves slots’ object representation quality, instead of directly decoding the remaining slots. (s2) We use the attention map at the last aggregation iteration to guide that at the first iteration, where the guidance is mostly always better. This enables self-distillation, requiring no extra model copy, no pretraining, and minimal additional cost. Besides, (s3) we implement a generalized auto-regressive (AR) decoder for OCL. Since existing AR decoders’ fix-order flattening disrupts spatial correlations among superpixels (Singh et al., 2022a, b; Kakogeorgiou et al., 2024), we use random-order flattening to enforce the decoder to model spatial correlations.

In short, our contributions are (i) re-initialized OCL aggregation, (ii) self-distillation within aggregation iterations, (iii) random auto-regressive decoding; (iv) new state-of-the-art performance.

2. Relate Work

OCL aggregation. The core module of OCL is Slot Attention (Locatello et al., 2020), an iterative cross attention that takes some feature vectors as the competitive query, with the input’s feature map as the key and value. BO-QSA (Jia et al., 2023) improves the gradient flow on the query vectors for better optimization. ISA (Biza et al., 2023) realizes scaling, rotation and translation invariance in slots with inductive biases. While others learn Gaussian distributions to sample the query vectors, SAVi (Kipf et al., 2022) and SAVi++ (Elsayed et al., 2022) project conditions like object bounding boxes into query vectors. These aggregators are all faced with the issue that redundant slots compete with informative slots to represent single objects as parts. SOLV (Aydemir et al., 2023) merges similar slots after aggregation using agglomerative clustering by some threshold. AdaSlot (Fan et al., 2024) removes redundant slots after aggregation using a learned policy with some regularization. These two recent methods only remove or merge slots but do not update the remaining ones. Our DIAS improves the aggregation by both reducing redundant slots and updating the remaining slots with extra re-initialized aggregation.

OCL decoding. There are mainly three types of decoding for OCL. Mixture-based decoders include CNN (Kipf et al., 2022; Elsayed et al., 2022), MLP (Seitzer et al., 2023) and SlotMixer (Zadaianchuk et al., 2024). The former two broadcast each of the aggregated slots into some spatial dimension, decode them respectively and mix the final components as the reconstruction. This means that the computation is proportional to the number of slots (Seitzer et al., 2023), which is very expensive. The latter one mixes the slots with a Transformer decoder without self-attention on the query. Such a decoder has to work in lower slot channel dimensions (Zadaianchuk et al., 2024), which limits object representation quality. Auto-regression-based decoders include causal Transformer decoder (Singh et al., 2022a, b) and causal Transformer decoder with nine permutations (Kakogeorgiou et al., 2024). They basically drive the slots to reconstruct the input’s feature map with the causally masked input feature as the query, in a next-token prediction manner. However, flattening a 2-dimensional feature map into a 1-dimensional sequence in a fixed order loses the spatial relationship (Kakogeorgiou et al., 2024). Diffusion-based decoders include different conditional Diffusion models (Wu et al., 2023b; Jiang et al., 2023). They basically recover the noise from the noise diffused input with slots as conditions, to drive slots to be informative. But, Diffusion models are very large and sensitive to train (Rombach et al., 2022). Our DIAS realizes general auto-regression via arbitrary ordering. In terms of the reconstruction target, there are also works that introduce quantization (Singh et al., 2022a, b; Zhao et al., 2024, 2025a, 2025b), but we focus on the non-quantized case.

Knowledge distillation. The classical offline knowledge distillation (KD) (Gou et al., 2021) requires a pretrained model as the teacher and a to-be-trained model as the student. The teacher is frozen and runs at the same time as the student to provide better representations as targets for the corresponding student representations. There is also online distillation (Gou et al., 2021) where no pretraining of a teacher model is needed. A teacher model typically has the same architecture as the student and its weights are usually updated by exponentially moving averaging (Wikipedia, [n. d.]) the student model. Self-distillation (Gou et al., 2021) does not need an extra model as the teacher. Instead, intermediate representations in the model’s higher layers are taken as the teacher of the lower layers. This requires the higher layers’ representations to be better than the lower layers’. KD has not been fully explored in the OCL setting and SPOT (Kakogeorgiou et al., 2024) is the only one we know for now. But SPOT’s “self-distillation“ is actually offline distillation, which consumes at least two times training cost. Our DIAS realizes exact self-distillation by utilizing the definitively refined attention maps of the aggregation iterations, saving much cost.

3. Proposed Method

We begin with the preliminary of object-centric learning, highlighting the issues in existing OCL methods. Then, we present our method, Slot Attention with re-Initialization and self-Distillation (DIAS). In Figure 2, our method is illustrated and compared with three recent studies that share similar ideas.

3.1. Preliminary

Existing OCL models typically consist of three major modules: encoder, aggregator and decoder. We focus on the latter two.

Given the initial slots, i.e., query vectors $\bm{S}^{0}\in\mathbb{R}^{s\times c}$ , a Slot Attention module $\bm{\phi}_{\mathrm{a}}$ aggregates the feature map $\bm{Z}\in\mathbb{R}^{h\times w\times c}$ of an input image or video frame, into slots $\bm{S}\in\mathbb{R}^{s\times c}$ after some iterations:

(1)

\bm{S}^{i},\bm{A}^{i}_{\mathrm{a}}=\bm{\phi}_{\mathrm{a}}(\bm{S}^{i-1},\bm{Z})\quad i=1...i_{\mathrm{a}}

(2)

\bm{S}\equiv\bm{S}^{i_{\mathrm{a}}}

where $i_{\mathrm{a}}$ is the total number of iterations for aggregation, and $i_{\mathrm{a}}=3$ for unconditional queries (Locatello et al., 2020) while $i_{\mathrm{a}}=1$ for conditional queries (Kipf et al., 2022). $\bm{A}^{i}_{\mathrm{a}}\in\mathbb{R}^{s\times h\times w}$ is aggregation attention map and can be converted to object segmentation with $\mathrm{argmax(\cdot)}$ along dimension $s$ . $\bm{\phi}_{\mathrm{a}}$ can be parameterized into different Slot Attention variants like in Biza et al. (2023); Jia et al. (2023). We call $\bm{S}^{i}$ as “query slots” and $\bm{S}$ as “aggregated slots”. Note that $\bm{\phi}_{\mathrm{a}}$ enables all query vectors in $\bm{S}^{i}$ to compete with one another for some super-pixels in $\bm{Z}$ .

To obtain the self-supervision signal, the aggregated slots $\bm{S}$ are decoded by a decoder $\bm{\phi}_{\mathrm{d}}$ to reconstruct the input $\bm{X}$ in some format $\bm{Y}\in\mathbb{R}^{h^{\prime}\times w^{\prime}\times c^{\prime}}$ , with some clue $\bm{C}$ :

(3)

\bm{Y}^{\prime},\bm{A}_{\mathrm{d}}=\bm{\phi}_{\mathrm{d}}(\bm{C},\bm{S})

where $\bm{A}_{\mathrm{d}}\in\mathbb{R}^{s\times h^{\prime}\times w^{\prime}}$ is decoding attention map and can be converted to object segmentation masks by binarization; $\bm{Y}^{\prime}$ is the reconstructed $\bm{Y}$ ; and $\bm{Y}$ can be $\bm{X}$ (Locatello et al., 2020), $\bm{Z}$ (Seitzer et al., 2023), or $\bm{Z}$ ’s quantization (Zhao et al., 2025b), etc. $\bm{C}$ is the destructed $\bm{Y}$ , which varies across methods.

The supervision signal comes from the reconstruction loss, either Mean Squared Error (MSE) or Cross Entropy (CE):

(4)

l_{\mathrm{recon}}=\mathrm{MSE}(\bm{Y}^{\prime},\bm{Y})

(5)

l_{\mathrm{recon}}=\mathrm{CE}(\bm{Y}^{\prime},\bm{Y})

The choice depends on specific methods. Mixture- and diffusion-based decoding use Equation 4, while auto-regression-based decoding uses either Equation 4 or Equation 5.

3.2. Re-Initialized Aggregation

Here is the first issue we address.

When there are fewer objects (plus the background) than the number of slots, redundant slots should be removed, or they will compete for object super-pixels and cause objects to be segmented into parts. SOLV (Aydemir et al., 2023) and AdaSlot (Fan et al., 2024) are the only two studies so far that explore this idea. They both conduct the redundancy reduction $\bm{\phi}_{\mathrm{r}}$ after the aggregation:

(6)

\bm{S}_{\mathrm{r}},\bm{M}_{\mathrm{r}}=\bm{\phi}_{\mathrm{r}}(\bm{S})

(7)

\bm{Y}^{\prime},\bm{A}_{\mathrm{d}}=\bm{\phi}_{\mathrm{d}}(\bm{C},\bm{S}_{\mathrm{r}},\bm{M}_{\mathrm{r}})

where $\bm{S}_{\mathrm{r}}$ is in the same shape as $\bm{S}$ with redundant slots being zeroed out; $\bm{M}_{\mathrm{r}}\in\mathbb{R}^{s}$ tells the decoder $\bm{\phi}_{\mathrm{d}}$ which slots are zeroed. Equation 7 is no different from Equation 3 except masking.

But as the slots aggregation is competitive, the remaining slots are already affected by those redundant slots and are still of low quality in object representing. As shown in Figure 1 o1, with redundancy reduction solely, we yield some boosts on object discovery but very limited boosts on object recognition.

Thus, we propose using the remaining slots after redundancy reduction to re-initialize an extra aggregation. Thus our aggregation and encoding are formulated as below:

(8)

\bm{S}^{i},\bm{A}^{i}_{\mathrm{a}}=\bm{\phi}_{\mathrm{a}}(\bm{S}^{i-1},\bm{Z})\quad i=1...i_{\mathrm{a}}-1

(9)

\bm{S}_{\mathrm{r}},\bm{M}_{\mathrm{r}}=\bm{\phi}_{\mathrm{r}}(\bm{S}^{i_{\mathrm{a}}-1})

(10)

\bm{S},\bm{A}_{\mathrm{a}}=\bm{\phi}_{\mathrm{a}}(\bm{S}_{\mathrm{r}},\bm{Z},\bm{M}_{\mathrm{r}})

(11)

\bm{Y}^{\prime},\bm{A}_{\mathrm{d}}=\bm{\phi}_{\mathrm{d}}(\bm{C},\bm{S},\bm{M}_{\mathrm{r}})

Here Equation 8 is the initial aggregation, Equation 9 is the redundancy reduction, Equation 10 is the extra aggregation, and Equation 11 is the masked decoding. For the initial aggregation, we use $i_{\mathrm{a}}-1$ iterations, and $i_{\mathrm{a}}\geq 2$ . For the redundancy reduction, we adopt SOLV’s solution, agglomerative clustering (Learn, 2025b) by cosine distance given some threshold. For the extra aggregation, we apply the redundancy mask $\bm{M}_{\mathrm{r}}$ to the query-key attention logits so as to discard the effect of those zeroed-out slots (Fan et al., 2024). For the masked decoding, we follow the solutions from SOLV and AdaSlot.

As shown in Figure 1 o1, with further updating the remaining slots after redundancy reduction by re-initializing extra aggregation, we can yield boosts in both object discovery and recognition.

ClevrTex #slot=11

COCO #slot=7

VOC #slot=6

ARI

ARI_fg

mBO

mIoU

ARI

ARI_fg

mBO

mIoU

ARI

ARI_fg

mBO

mIoU

SLATE

17.4

±2.9

87.4

±1.7

44.5

±2.2

43.3

±2.4

17.5

±0.6

28.8

±0.3

26.8

±0.3

25.4

±0.3

18.6

±0.1

26.2

±0.8

37.2

±0.5

36.1

±0.4

DINOSAUR

50.7

±24.1

89.4

±0.3

53.3

±5.0

52.8

±5.2

18.2

±1.0

35.0

±1.2

28.3

±0.5

26.9

±0.5

21.5

±0.7

35.2

±1.3

40.6

±0.6

39.7

±0.6

SlotDiffusion

66.1

±1.3

82.7

±1.6

54.3

±0.5

53.4

±0.8

17.7

±0.5

28.7

±0.1

27.0

±0.4

25.6

±0.4

17.0

±1.2

21.7

±1.8

35.2

±0.9

34.0

±1.0

AdaSlot

–

19.1

±2.3

35.2

±2.0

29.4

±1.2

27.5

±0.8

–

SPOT

65.2

±11.5

89.6

±2.4

56.3

±4.1

55.1

±3.7

20.3

±0.7

41.1

±0.3

30.4

±0.1

29.0

±0.9

25.3

±5.2

36.4

±2.8

44.4

±1.7

43.4

±2.1

DIAS

image

69.4 ±8.7

89.9 ±1.3

59.4 ±2.9

57.8 ±2.4

22.1 ±0.8

42.7 ±0.2

32.8 ±0.1

31.3 ±0.4

27.3 ±4.6

37.7 ±2.5

46.1 ±2.1

45.0 ±1.7

Table 1. Object discovery on images. Using DINO2 ViT s/14 for OCL encoding; input resolution is 256

\times

256 (224

\times

224).

3.3. Self-Distilled Aggregation

Here is the second issue we address.

The key to realizing self-distillation or simply distillation is that the teacher should be better than the student during training (of course after training the student can be better than the teacher). SPOT (Kakogeorgiou et al., 2024) resorts to pretrain an extra copy of the model as the teacher to realize such performance gap.

However, the decoding attention map $\bm{A}_{\mathrm{a}}$ generally exhibits better object segmentation than the aggregation attention map $\bm{A}_{\mathrm{d}}$ , but not always, as shown in Figure 1 o2 left. In contrast, the attention map at the last aggregation iteration is almost always better than that at the first iteration. Thus, we can enforce additional supervision by forcing the first aggregation attention map to approximate the attention map at the last aggregation iteration.

SPOT (Kakogeorgiou et al., 2024) utilizes all the decoder attention map as the target without selection. And if some decoding attention compositions are worse than the corresponding aggregation attention, then such approximation would harm the performance. This is also why they have to pretrain an extra copy of the model and use it as the teacher to guide the training of the (student) model. Such offline distillation at least doubles the training cost.

(12)

\bm{M}_{\mathrm{a}}^{*}=\mathrm{Hungarian}(\mathrm{bin}_{s}(\bm{A}_{\mathrm{a}}),\mathrm{bin}_{s}(\bm{A}_{\mathrm{d}}^{\mathrm{teacher}}))

(13)

l_{\mathrm{approx}}=\mathrm{CE}(\bm{A}_{\mathrm{a}},\bm{M}_{\mathrm{a}}^{*})

To address this issue, we gain the signal for self-distillation from the intrinsic performance gap between the last aggregation iteration and the first. As shown in Figure 1 o2 right, in 90% cases, the attention map at the last aggregation iteration is better than that at the first iteration.

And our solution is quite simple:

(14)

\bm{M}_{\mathrm{a}}^{*}=\mathrm{Hungarian}(\mathrm{bin}_{s}(\bm{A}_{\mathrm{a}}^{1}),\mathrm{bin}_{s}(\bm{A}_{\mathrm{a}}))

(15)

l_{\mathrm{approx}}=\mathrm{CE}(\bm{A}_{\mathrm{a}}^{1},\bm{M}_{\mathrm{a}}^{*})

where $\mathrm{bin}_{s}$ is binarization along dimension $s$ , with maximum value as one and others as zeros; $\mathrm{Hungarian}(\cdot)$ is Hungarian matching; $l_{\mathrm{approx}}$ is the approximation loss; $\bm{A}_{\mathrm{a}}^{1}$ is the attention map at the first aggregation iteration while $\bm{A}_{\mathrm{a}}$ is the attention map at the last aggregation – after re-initialization defined in Section 3.2.

As the teacher $\bm{A}_{\mathrm{a}}$ is better than $\bm{A}_{\mathrm{a}}^{1}$ in most cases, which is a intrinsic performance gap, we do not need to pre-train a model as the teacher to produce such definitive performance gap.

3.4. Random Auto-Regressive Decoding

We also improve the decoding.

As discussed in Section 2 “OCL decoding”, three types OCL decoding all have their disadvantages. Mixture-based decoders like CNN (Kipf et al., 2022; Elsayed et al., 2022) and MLP (Seitzer et al., 2023) are too computationally expensive, while SlotMixer (Zadaianchuk et al., 2024) is efficient but has limited object representation quality. Diffusion-based decoders (Jiang et al., 2023; Wu et al., 2023b) are expensive and sensitive to train. Auto-regression-based decoders like Transformer decoder (Singh et al., 2022a, b) are very efficient, but their next-token prediction has to flatten the 2-dimensional super-pixels into 1-dimensional in a fixed order, thus they do not utilize the spatial correlation among super-pixels. Transformer9 (Kakogeorgiou et al., 2024) utilizes nine handcrafted permutations $\{\bm{I}^{j}|j=1...9\}$ to better capture spatial dependencies:

(16)

\bm{Y}_{\mathrm{bos}}^{j}=\mathrm{concat}_{n}(\bm{E}_{\mathrm{bos}}^{j},\mathrm{gather}_{n}(\mathrm{flatten}_{h,w}(\bm{Y}),\bm{I}^{j})[:-1])

(17)

{\bm{Y}^{j}}^{\prime}=\bm{\phi}_{\mathrm{d}}(\bm{Y}^{j}_{\mathrm{bos}},\bm{M}_{\mathrm{causal}})

where $\bm{I}^{j}\in\mathbb{R}^{n}$ is the $j$ th handicraft super-pixel ordering, with $n=h\times w$ ; $\bm{E}^{j}_{\mathrm{bos}}\in\mathbb{R}^{c}$ is the $j$ th Beginning-of-Sentence (BoS) token (Kakogeorgiou et al., 2024); $\mathrm{flatten}_{h,w}(\cdot)$ flattens height and width; $\mathrm{gather}_{n}(\cdot,\cdot)$ select the tensor along dimension $n$ given index tensor $\bm{I}^{j}$ ; $\mathrm{concat}_{n}(\cdot,\cdot)$ concatenates two tensors along dimension $n$ ; $\bm{Y}^{j}_{\mathrm{bos}}\in\mathbb{R}^{n\times c}$ is the sequence prepended with BoS; $\bm{M}_{\mathrm{causal}}\in\mathbb{R}^{n\times n}$ is the typical causal mask (Kakogeorgiou et al., 2024); and $\bm{\phi}_{\mathrm{d}}$ is parameterized as a Transformer decoder.

The spatial relationships among super-pixels cannot be sufficiently captured by these nine handcrafted orderings. Therefore, we propose a generalized auto-regressive decoding scheme that accommodates arbitrary super-pixel orderings.

(18)

\quad n^{\#}\sim\mathcal{U}\{0,1,\dots,n\},~~n=h\times w,

(19)

\bm{I}=\mathrm{shuffle}(\bm{I}^{0})

(20)

\bm{E}_{\mathrm{p}}^{\#}=\mathrm{gather}_{n}(\mathrm{flatten}_{h,w}(\bm{E}_{\mathrm{p}}),\bm{I}[:n^{\#}+1])

(21)

\bm{Y}^{\#}=\mathrm{concat}_{n^{\#}}(\mathrm{gather}_{n}(\mathrm{flatten}_{h,w}(\bm{Y}),\bm{I}[:n^{\#}]),\bm{E}_{\mathrm{m}})

(22)

{\bm{Y}^{\#}}^{\prime}=\bm{\phi}_{\mathrm{d}}(\bm{E}^{\#}_{\mathrm{p}}+\bm{Y}^{\#})

Equation 18 obtains a random sequence length $n^{\#}$ ; Equation 19 samples an arbitrary sequence order; Equation 20 shuffles the position embedding tensor $\bm{E}_{\mathrm{p}}\in\mathbb{R}^{h\times w\times c}$ , while Equation 21 shuffles the target tensor $\bm{Y}$ , with mask token $\bm{E}_{\mathrm{m}}\in\mathbb{R}^{c}$ indicating the next token to predict; and Equation 22 takes the positional information and the known tokens to predict the next token.

Accordingly, the reconstruction loss is modified as below:

(23)

l_{\mathrm{recon}}=\mathrm{MSE}({\bm{Y}^{\#}}^{\prime},\bm{Y}^{\#})

Our random auto-regression compels the decoder $\bm{\phi}_{\mathrm{d}}$ to reconstruct the target $\bm{Y}$ using whatever subset of known super-pixels from arbitrary positions in the feature map. This, combined with the former two techniques, further boosts OCL performance.

4. Experiment

We conduct the following experiments using three random seeds. We evaluate object representation quality through object discovery and recognition, as well as downstream tasks, including video prediction and reasoning.

4.1. Object Discovery

The attention maps of OCL aggregation and decoding are the segmentation of the objects and background that are discovered by slots. We measure these segmentation using typical metrics, i.e., ARI (Adjusted Rand Index) (Learn, 2025a), ARI_fg (foreground Adjusted Rand Index), mBO (mean Best Overlap) (Uijlings et al., 2013) and mIoU (mean Intersection-over-Union) (Learn, [n. d.]). ARI measures more about larger areas, i.e., the background; mBO only measures the best overlapped segmentation; mIoU is the most strict measurement.

For baselines, we use the models from (Zhao et al., 2025b), which reproduce very strong baselines by unifying representative OCL models all with current best practices. Among them, SLATE, STEVE, DINOSAUR, SlotDiffusion and SPOT are designed for images; STEVE and VideoSAUR are for videos; VVO with different decoders support images or videos. We also include AdaSlot (for images) (Fan et al., 2024) and SOLV (for videos) (Aydemir et al., 2023).

For datasets, we use ClevrTex (the main set for training and the out-of-distribution set for evaluation) (Karazija et al., [n. d.]), COCO (instance segmentation) (Lin et al., 2014), VOC (Everingham et al., 2010), MOVi-D (Research, [n. d.]) and YTVIS (high-quality) (SysCV, [n. d.]). Among them, ClevrTex is synthetic images with complex textures; COCO and VOC are real-world images; MOVi-D is a set of challenging synthetic videos with up to 20 objects per scene; YTVIS is real-world YouTube videos.

As shown in Table 1, with unconditional query, our DIAS outperforms all the baselines on both synthetic and real-world images in object discovery under all the metrics. In fact, DIAS achieves new state-of-the-art performance. As shown in Table 2, with conditional query, DIAS the temporal version surpasses STEVE by a large margin on synthetic videos; With unconditional query, DIAS defeats the state-of-the-art video OCL model. As shown in Table 3, combined with the general improver VVO (Zhao et al., 2025b), DIAS pushes state-of-the-art performance forward even further.

Specifically, only SOLV, AdaSlot and DIAS, as well as DIAS temporal, have the redundant slot reduction operation, but DIAS is significantly better than the others. Besides, only SPOT and DIAS have the distillation operation, but SPOT has to pretrain an extra copy of the model as the teacher and then run both the teacher and a new copy of model simultaneously to do the distillation. By contrast, DIAS only needs to run one single copy of model without any pretraining, and thus, as shown in Table 4, only consumes average-level training resources.

	ARI	ARI_fg	mBO	mIoU
	MOVi-D #slot=21 conditional
STEVE	31.2 ±3.8	62.9 ±5.1	22.2 ±2.2	19.9 ±2.4
DIAS video	37.2 ±3.5	64.7 ±3.7	25.9 ±2.4	22.7 ±2.6
	YTVIS #slot=7 #step=20
VideoSAUR	33.0 ±0.6	49.0 ±0.9	30.8 ±0.4	30.1 ±0.6
SOLV	36.2 ±1.1	50.4 ±1.2	30.7 ±0.8	31.0 ±0.7
DIAS video	38.7 ±1.0	52.1 ±0.4	33.3 ±0.7	34.6 ±0.6

Table 2. Object discovery on videos. Using DINO2 ViT s/14 for OCL encoding; input resolution is 256

\times

256 (224

\times

224).

	COCO #slot=7
	ARI	ARI_fg	mBO	mIoU
VVO Tfd	21.1 ±2.1	31.5 ±1.1	29.6 ±0.7	28.2 ±0.8
VVO Mlp	19.2 ±0.4	35.9 ±0.6	28.7 ±0.3	27.4 ±0.3
VVO Dfz	18.3 ±0.4	29.0 ±1.0	27.2 ±0.1	25.8 ±0.1
DIAS image +VVO	22.9 ±1.7	43.1 ±0.8	33.2 ±0.7	31.7 ±0.5

Table 3. Combined with technique VVO, DIAS pushes SotA forward even further. Using DINO2 ViT s/14 for OCL encoding; input resolution is 256

\times

256 (224

\times

224).

	COCO #slot=7
	pretraining		training
	memory	time	memory	time
SPOT	6.4 GB	11.6 h	7.0 GB	12.1 h
DIAS image	N/A	N/A	6.5 GB	12.0 h

Table 4. DIAS spends merely no more than 1/2 training cost compared with SPOT. Using one RTX 3080; auto-mixed precision.

4.2. Object Recognition

With slots representing different objects or background, we can further evaluate how much object information they aggregate. In other studies, this is also called set prediction (Locatello et al., 2020) or object property prediction (Fan et al., 2024). Following the routine of (Seitzer et al., 2023), we reuse the trained models from Section 4.1 to extract the real-world image dataset COCO into slots representation, and train a small MLP to take each slot as input to predict the corresponding object’s class label and bounding box. We use metrics top1 (Learn, 2025d) and R2 (Learn, 2025c) to measure the classification and regression performance.

As shown in Table 5, whether using the general improve VVO, our method achieves obviously better classification accuracy and bounding box regression accuracy on those real-world images. Also considering its object discovery superiority, we can assert that DIAS can represent visual scenes with better object representation quality than all the baselines.

			COCO #slot=7
			class top1 $\uparrow$	bbox R2 $\uparrow$
DINOSAUR	+	MLP	0.61 ±0.0	0.57 ±0.1
SPOT	+	MLP	0.67 ±0.0	0.62 ±0.1
DIAS image	+	MLP	0.70 ±0.0	0.63 ±0.0

Table 5. Object recognition on COCO.

4.3. Visual Prediction and Reasoning

With improved object representation, our DIAS also helps the downstream tasks, like visual prediction and reasoning. Following the rountine from (Wu et al., 2023a), we firstly train our DIAS and other baselines on Physion (Bear et al., 2021), which is a synthetic video dynamics modeling dataset, and represent the dataset as slots. Then we train an object-centric world model SlotFormer (Wu et al., 2023a) to auto-regressively predict future frames of slots, given initial five frames of slots as the input. And based on such predictions, we further train a reasoning model to classify whether the stimuli objects will contact finally.

As shown in Figure 4 upper, in terms of visual prediction, measured by L2 norm normalized MSE, our combination DIAS+SlotFormer accumulates errors significantly slower than the baseline combination VideoSAUR+SlotFormer. As shown in Figure 4 lower, in terms of visual reasoning, measured by binary classification accuracy, our combination DIAS+SlotFormer achieves better overall reasoning accuracy thanks to the much lower prediction error.

By the way, the reasoning accuracy goes down-up-down because: Early on, the model’s predictions are accurate, allowing the classifier to quickly infer the final state and achieve high accuracy. As prediction errors start to accumulate, these errors distract the classifier and reduce accuracy. When more frames are predicted, the model receives additional (though noisy) evidence near the end, which boosts accuracy temporarily–until the accumulated error becomes too great and overwhelms the useful information.

4.4. Ablation Study

redundancy	re-	self-	rand ar	COCO #slot=7
reduction	initializ.	distill.	decoding	ARI	ARI_fg
✓	✓	✓	✓	22.1 ±0.8	42.7 ±0.2
✓		✓	✓	21.3 ±0.8	42.0 ±0.4
		✓	✓	20.6 ±0.7	41.5 ±0.3
✓	✓		✓	21.4 ±1.7	41.8 ±0.5
✓	✓	✓		19.9 ±1.1	42.1 ±0.7

Table 6. Effects of four techniques.

		COCO #slot=7
redundancy reduction	re-initializ.	class top1	bbox R²
✓	✓	0.70 ±0.0	0.63 ±0.0
✓		0.68 ±0.0	0.62 ±0.0
SPOT as a reference		0.67 ±0.0	0.62 ±0.1

Table 7. Effects of re-initialization.

Here we ablate our model to evaluate how much our three techniques contribute to the performance boosts respectively. We conduct the ablation on COCO using both object discovery and recognition tasks.

As shown in Table 6 and Table 7, redundancy reduction in aggregated slots, which is not our contribution, is important to the performance. But, this does not mean that our technique re-initialization is not necessary; instead, re-initialization builds on the reduction and improves it even further, especially in object recognition.

As shown in Table 6, our arbitrary-order auto-regressive decoding surely contributes positively to our DIAS model architecture. Specifically, our self-distillation is also beneficial to both the background segmentation performance, i.e., ARI, and the foreground segmentation performance, i.e., ARI_fg. Our random AR decoding contributes much more to the background segmentation.

5. Conclusion

We explored three techniques in OCL aggregation and decoding modules, i.e., re-initialization in aggregation, self-distillation from the last aggregation attention to the first aggregation attention, and arbitrary-order auto-regressive decoding. With these techniques, our DIAS achieves new state-of-the-art in OCL tasks on both images and videos. The basis of our re-initialization still relies on manual threshold and so does our self-distillation. These could be future works to improve DIAS further.

Acknowledgements.

We acknowledge the support of Finnish Center for Artificial Intelligence (FCAI), Research Council of Finland flagship program. We thank the Research Council of Finland for funding the projects ADEREHA (grant no. 353198), BERMUDA (362407), PROFI7 (352788) and MARL (357301). We also appreciate CSC - IT Center for Science, Finland, for granting access to supercomputers Mahti and Puhti, as well as LUMI, owned by the European High Performance Computing Joint Undertaking (EuroHPC JU) and hosted by CSC Finland in collaboration with the LUMI consortium. Furthermore, we acknowledge the computational resources provided by the Aalto Science-IT project through the Triton cluster.

References

(1)
Aydemir et al. (2023) Görkay Aydemir, Weidi Xie, and Fatma Guney. 2023. Self-Supervised Object-Centric Learning for Videos. Advances in Neural Information Processing Systems 36 (2023), 32879–32899.
Bar (2004) Moshe Bar. 2004. Visual Objects in Context. Nature Reviews Neuroscience 5, 8 (2004), 617–629.
Bear et al. (2021) Daniel Bear, Elias Wang, Damian Mrowca, Felix Jedidja Binder, Hsiao-Yu Tung, RT Pramod, Cameron Holdaway, Sirui Tao, Kevin A Smith, Fan-Yun Sun, et al. 2021. Physion: Evaluating Physical Prediction from Vision in Humans and Machines. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
Biza et al. (2023) Ondrej Biza, Sjoerd van Steenkiste, Mehdi SM Sajjadi, Gamaleldin Elsayed, Aravindh Mahendran, and Thomas Kipf. 2023. Invariant Slot Attention: Object Discovery with Slot-Centric Reference Frames. In International Conference on Machine Learning. 2507–2527.
Chen (2012) Zhe Chen. 2012. Object-based attention: A Tutorial Review. Attention, Perception, & Psychophysics 74 (2012), 784–802.
Elsayed et al. (2022) Gamaleldin Elsayed, Aravindh Mahendran, Sjoerd Van Steenkiste, et al. 2022. SAVi++: Towards End-to-End Object-Centric Learning from Real-World Videos. Advances in Neural Information Processing Systems 35 (2022), 28940–28954.
Everingham et al. (2010) Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The Pascal Visual Object Classes (VOC) Challenge. International journal of computer vision 88 (2010), 303–338.
Fan et al. (2024) Ke Fan, Zechen Bai, Tianjun Xiao, Tong He, Max Horn, Yanwei Fu, Francesco Locatello, and Zheng Zhang. 2024. Adaptive Slot Attention: Object Discovery with Dynamic Slot Number. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 23062–23071.
Gou et al. (2021) Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. 2021. Knowledge Distillation: A Survey. International Journal of Computer Vision 129, 6 (2021), 1789–1819.
Jia et al. (2023) Baoxiong Jia, Yu Liu, and Siyuan Huang. 2023. Improving Object-centric Learning with Query Optimization. In The Eleventh International Conference on Learning Representations.
Jiang et al. (2023) Jindong Jiang, Fei Deng, Gautam Singh, and Sungjin Ahn. 2023. Object-Centric Slot Diffusion. Advances in Neural Information Processing Systems (2023).
Kahneman et al. (1992) Daniel Kahneman, Anne Treisman, and Brian J Gibbs. 1992. The reviewing of object files: Object-specific integration of information. Cognitive psychology 24, 2 (1992), 175–219.
Kakogeorgiou et al. (2024) Ioannis Kakogeorgiou, Spyros Gidaris, Konstantinos Karantzalos, and Nikos Komodakis. 2024. Spot: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22776–22786.
Karazija et al. ([n. d.]) Laurynas Karazija, Iro Laina, and Christian Rupprecht. [n. d.]. ClevrTex: A Texture-Rich Benchmark for Unsupervised Multi-Object Segmentation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
Kipf et al. (2022) Thomas Kipf, Gamaleldin Elsayed, Aravindh Mahendran, et al. 2022. Conditional Object-Centric Learning from Video. International Conference on Learning Representations (2022).
Learn ([n. d.]) Sci-Kit Learn. [n. d.]. jaccard_score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html. Accessed: 2025-04-06.
Learn (2025a) Sci-Kit Learn. 2025a. adjusted_rand_score. Accessed: 2025-04-06.
Learn (2025b) Sci-Kit Learn. 2025b. AgglomerativeClustering. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Learn (2025c) Sci-Kit Learn. 2025c. r2_score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html. Accessed: 2025-04-06.
Learn (2025d) Sci-Kit Learn. 2025d. top_k_accuracy_score. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.top_k_accuracy_score.html. Accessed: 2025-04-06.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 740–755.
Locatello et al. (2020) Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, et al. 2020. Object-Centric Learning with Slot Attention. Advances in Neural Information Processing Systems 33 (2020), 11525–11538.
Research ([n. d.]) Google Research. [n. d.]. Multi-Object Video (MOVi) datasets. https://github.com/google-research/kubric/tree/main/challenges/movi. Accessed: 2025-04-06.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
Seitzer et al. (2023) Maximilian Seitzer, Max Horn, Andrii Zadaianchuk, et al. 2023. Bridging the Gap to Real-World Object-Centric Learning. International Conference on Learning Representations (2023).
Singh et al. (2022a) Gautam Singh, Fei Deng, and Sungjin Ahn. 2022a. Illiterate DALL-E Learns to Compose. International Conference on Learning Representations (2022).
Singh et al. (2022b) Gautam Singh, Yi-Fu Wu, and Sungjin Ahn. 2022b. Simple Unsupervised Object-Centric Learning for Complex and Naturalistic Videos. Advances in Neural Information Processing Systems 35 (2022), 18181–18196.
SysCV ([n. d.]) SysCV. [n. d.]. HQ-YTVIS: High-Quality Video Instance Segmentation Dataset. https://github.com/SysCV/vmt?tab=readme-ov-file#hq-ytvis-high-quality-video-instance-segmentation-dataset. Accessed: 2025-04-06.
Uijlings et al. (2013) Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. 2013. Selective Search for Object Recognition. International Journal of Computer Vision 104 (2013), 154–171.
Wang et al. (2024) Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. 2024. OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning. arXiv preprint arXiv:2405.01533 (2024).
Wikipedia ([n. d.]) Wikipedia. [n. d.]. Exponential Smoothing. https://en.wikipedia.org/wiki/Exponential_smoothing. Accessed: 2025-04-06.
Wu et al. (2023a) Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, and Animesh Garg. 2023a. SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. International Conference on Learning Representations (2023).
Wu et al. (2023b) Ziyi Wu, Jingyu Hu, Wuyue Lu, Igor Gilitschenski, and Animesh Garg. 2023b. SlotDiffusion: Object-Centric Generative Modeling with Diffusion Models. Advances in Neural Information Processing Systems 36 (2023), 50932–50958.
Zadaianchuk et al. (2024) Andrii Zadaianchuk, Maximilian Seitzer, and Georg Martius. 2024. Object-Centric Learning for Real-World Videos by Predicting Temporal Feature Similarities. Advances in Neural Information Processing Systems 36 (2024).
Zhang et al. (2024) Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, and Shuicheng Yan. 2024. OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding. Advances in Neural Information Processing Systems 37 (2024), 71737–71767.
Zhao et al. (2024) Rongzhen Zhao, Vivienne Wang, Juho Kannala, and Joni Pajarinen. 2024. Grouped Discrete Representation for Object-Centric Learning. arXiv preprint arXiv:2411.02299 (2024).
Zhao et al. (2025a) Rongzhen Zhao, Vivienne Wang, Juho Kannala, and Joni Pajarinen. 2025a. Multi-Scale Fusion for Object Representation. ICLR (2025).
Zhao et al. (2025b) Rongzhen Zhao, Vivienne Wang, Juho Kannala, and Joni Pajarinen. 2025b. Vector-Quantized Vision Foundation Models for Object-Centric Learning. arXiv preprint arXiv:2502.20263 (2025).