11institutetext: Institute of Applied Sciences and Intelligent Systems – CNR, Via per Monteroni, 73100 Lecce, Italy 22institutetext: Department of Innovation Engineering, University of Salento 33institutetext: UPV/EHU, University of the Basque Country, 20018 San Sebastian, Spain 44institutetext: Université Polytechnique Hauts-de-France, Université de Lille, CNRS, 59313 Valenciennes, France

(iciap) Package iciap Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

Beyond Linear Bottlenecks: Spline-Based Knowledge Distillation for Culturally Diverse Art Style Classification

Abdellah Zakaria Sellam 1122 0009-0003-6876-2220    Salah Eddine Bekhouche 33 0000-0001-5538-7407    Cosimo Distante 11 0000-0002-1073-2390    Abdelmalik Taleb-Ahmed 44 0000-0001-7218-3799
Abstract

Art style classification remains a formidable challenge in computational aesthetics due to the scarcity of expertly labeled datasets and the intricate, often nonlinear interplay of stylistic elements. While recent dual-teacher self-supervised frameworks reduce reliance on labeled data, their linear projection layers and localized focus struggle to model global compositional context and complex style-feature interactions. We enhance the dual-teacher knowledge distillation framework to address these limitations by replacing conventional MLP projection and prediction heads with Kolmogorov–Arnold Networks (KANs). Our approach retains complementary guidance from two teacher networks, one emphasizing localized texture and brushstroke patterns, the other capturing broader stylistic hierarchies while leveraging KANs’ spline-based activations to model nonlinear feature correlations with mathematical precision. Experiments on WikiArt and Pandora18k demonstrate that our approach outperforms the base dual-teacher architecture in Top-1 accuracy. Our findings highlight the importance of KANs in disentangling complex style manifolds, leading to better linear probe accuracy than MLP projections.

Keywords:
KAN Dual-teacher Classification knowledge-distillation Projection

1 Introduction

Art style classification is vital in computational aesthetics, bridging human artistic expression and machine interpretation. It enables algorithms to analyze and categorize various artistic styles while recognizing patterns in texture, composition, and color influenced by cultural contexts.

Despite the success of supervised deep learning methods [18], their dependence on large labeled datasets poses challenges in the art domain, where expert annotations are limited and costly. Self-Supervised Learning (SSL) offers promising alternatives by utilizing unlabeled data, but existing SSL frameworks in art analysis still face hurdles.

Earlier methods relied on hand-crafted features and traditional classifiers like Support Vector Machines (SVM) and k-Nearest Neighbors (kNN) [21]. These approaches, while interpretable, often fail to generalize and capture abstract stylistic patterns [9]. The rise of deep learning has advanced the field [35], with Convolutional Neural Networks (CNNs) outperforming traditional methods by automatically learning hierarchical visual representations [28]. Transfer learning has further improved performance; Cetinic et al. [1] demonstrated that networks pre-trained on scene recognition tasks perform better in art classification than object-centric models. However, these supervised approaches remain limited by their reliance on labeled data, which is scarce in curated art datasets.

Recent research has explored unsupervised and self-supervised learning to tackle challenges in image processing. Triplet networks using Gram matrices have shown promise in weakly supervised style grouping [13], while multi-task self-supervised frameworks have incorporated aesthetic elements like brightness and composition [34]. However, these approaches can increase computational complexity and necessitate careful hyperparameter tuning. Although frameworks such as MoCo [15] and SimCLR [3] are effective for general image tasks, they often struggle to isolate style-specific features vital for recognizing subtle artistic traits.

Previous research in SSL-driven art classification mainly concentrated on multi-teacher architectures and handcrafted feature alignment. For instance, Luo et al. [26] introduce a dual-teacher framework that uses Gram matrices and relation alignment losses to distill style features. However, it increases training complexity and limits adaptability across artistic domains. Additionally, traditional multilayer perceptron (MLP) projection heads struggle with the complex nonlinearities of artistic styles due to fixed activation functions.

We introduce a novel self-supervised framework for art-style classification by embedding Kolmogorov–Arnold Networks (KANs) as nonlinear projection heads in place of conventional MLPs [24]. Grounded in the Kolmogorov–Arnold representation theorem, each edge in a KAN is parameterized by a learnable spline-based univariate function rather than a fixed activation. This design adaptively sculpts basis functions to capture subtle, high-order interactions among style features, enabling the precise disentanglement of overlapping artistic attributes. Our contributions are twofold:

  • Integration of KANs into SSL for Art Style. We are the first to use KAN-based projection heads in a contrastive self-supervised learning pipeline for art-style classification, showing significant improvements over standard MLP heads.

  • Enhanced Modeling of Fine-Grained Artistic Variations. Our approach uses KANs to capture nuanced style differences often missed by traditional architectures, enhancing performance on fine-grained classification tasks.

This paper is structured as follows: Section 2 overviews the foundational work on semi-supervised learning (SSL) and art style classification. In Section 3, we describe our methodology, highlighting the architectural advantages of KANs and their synergy with gram matrix alignment. Section 4 presents the experimental validation of our approach, providing detailed results and metrics. Following this, Section 5 offers an in-depth analysis of the findings, discussing their implications, limitations, and potential future research directions. Finally, Section 6 summarizes our key findings and contributions to the field.

2 Related works

The evolution of art style classification has been influenced by feature engineering, deep learning, and domain adaptation. Early methods, such as [19], depended on handcrafted features like color histograms and texture descriptors, paired with shallow classifiers like SVM and kNN. This approach, while interpretable, struggled with generalization across diverse artistic styles.

The introduction of deep learning brought a significant shift. [22] showcased the power of convolutional neural networks (CNNs) for hierarchical feature extraction. However, adapting these methods to art was challenging due to domain differences. [31] developed deeper architectures focused on object-centric tasks to improve feature abstraction. Transfer learning is crucial, as noted by [2], with fine-tuned models for scene recognition surpassing those pre-trained on ImageNet. Hierarchical classification and disentangled representations [8] refine style categories but heavily depend on labeled datasets. To lessen annotation dependence, [33] proposed a multi-task self-supervised framework using compositional rules. [6] established relational consistency via Gram matrix alignment, and [5] introduced Kolmogorov-Arnold Networks (KANs) for nonlinear style-feature interactions. Self-supervised learning (SSL) is key for robust representation learning from unlabeled art datasets. Early methods like MoCo [16] and SimCLR [4] faced batch size issues, which BYOL addressed by removing negative samples and ensuring encoder output consistency. [14] further refined this with SimSiam, using stop-gradient operations to prevent collapse, though challenges in fine-grained style discrimination remain.

Recent innovations have focused on style-aware SSL. Building on [6], our framework integrates Gram matrices to enforce relational consistency across augmented views. Concurrently, [25] proposed ArtCLR, a domain-specific SSL variant incorporating art-historical metadata as weak supervision. Knowledge distillation (KD), first introduced by [17], has evolved into a versatile tool for compressing models and transferring domain-specific knowledge. Traditional single-teacher distillation [11] suffered from "knowledge bottleneck" issues. Multi-teacher frameworks addressed this by aggregating complementary expertise: [30] fused logits from multiple teachers using temperature-weighted ensembling. Multi-teacher strategies [12, 29], enhanced student models through diverse supervision. Luo et al. [27] introduced a dual-teacher contrastive framework using Gram matrices and relation alignment loss. However, their linear MLP projection network limits non-linear feature interactions, a critical constraint for modeling complex artistic styles. We address this limitation by replacing the linear MLP with a Kolmogorov-Arnold Network (KAN) [24]. This architecture preserves Luo et al. [27] dual-teacher distillation and Gram matrix integration while enabling exponential efficiency in learning stylistic manifolds. KAN’s spline-based design mitigates spectral bias, enhancing the disentanglement of subtle style features and a marked improvement over rigid MLP parameterizations.

3 Methods

3.1 Architecture Overview

Refer to caption
Figure 1: Overall architecture of our proposed network. For each input image, three augmented views X1, X2, and X3 are generated. X1 and X3 are fed to the two teacher networks, respectively. And X2 is input to the student network. Two teachers collaboratively guide the student model, to ensure alignment with the teachers’ guidance.

The proposed architecture, illustrated in Figure 1, integrates a dual-teacher knowledge distillation framework with a Kolmogorov–Arnold Network (KAN) projection head and a style-aware feature alignment to address the challenges of art style classification. At its core, the system comprises three main pathways: two teachers and a student network, all trained collaboratively under a self-supervised paradigm. The training pipeline begins by generating three augmented views per input image: a weakly augmented sample x1 for the Momentum Teacher, a strongly augmented sample x2 for the Student, and another weakly augmented view x3 for the Style Teacher. This asymmetric augmentation strategy creates a controlled discrepancy between teacher and student inputs, forcing students to learn invariances to challenging transformations while teachers maintain stable feature extraction.

Each augmented view is passed through a corresponding encoder: the Momentum Teacher (Ft1F_{t1}italic_F start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT), the Style Teacher (Ft2F_{t2}italic_F start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT), and the Student (FsF_{s}italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT). These encoders are typically based on vision transformers or convolutional backbones and produce feature maps h1,h2,h3C×H×Wh_{1},h_{2},h_{3}\in\mathbb{R}^{C\times H^{\prime}\times W^{\prime}}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT that capture local and global semantic structures. These encoded features are then passed to a shared projection head, where the key contribution is that all branches use a Kolmogorov-Arnold Network (KAN) as their projection function.

Unlike traditional MLPs that rely on fixed activation functions like ReLU or GELU, KAN uses a two-level composition of learnable univariate B-spline functions. For each feature dimension hiph_{i}^{p}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, KAN computes:

zi=KAN(hi)=q=12n+1Φq(p=1nϕq,p(hip;θq,p))z_{i}=\mathrm{KAN}(h_{i})=\sum_{q=1}^{2n+1}\Phi_{q}\left(\sum_{p=1}^{n}\phi_{q,p}(h_{i}^{p};\theta_{q,p})\right)italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_KAN ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 italic_n + 1 end_POSTSUPERSCRIPT roman_Φ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_p = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT italic_q , italic_p end_POSTSUBSCRIPT ( italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_q , italic_p end_POSTSUBSCRIPT ) ) (1)

Here, ϕq,p\phi_{q,p}italic_ϕ start_POSTSUBSCRIPT italic_q , italic_p end_POSTSUBSCRIPT denotes a univariate cubic B-spline function with learnable control points θq,p\theta_{q,p}italic_θ start_POSTSUBSCRIPT italic_q , italic_p end_POSTSUBSCRIPT and adaptive knot spacing. At the same time, Φq\Phi_{q}roman_Φ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are trainable composition weights initialized from a Gaussian distribution and constrained by 2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-normalization. This construction allows each unit in the projection head to adaptively shape its activation response to suit local curvature, style shifts, or semantic boundaries in the data.

The outputs of these projections, z1,z2,z3dz_{1},z_{2},z_{3}\in\mathbb{R}^{d}italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, are used in multiple downstream objectives. First, the feature bank is a FIFO queue QQitalic_Q storing previously computed teacher projections, enabling scalable contrastive learning. The Student embedding z2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is compared against both current teacher embeddings and historical entries in QQitalic_Q, using cosine similarity to form soft probability distributions:

Pi=softmax(cos(z2,zi)τ),i=1,3P_{i}=\mathrm{softmax}\left(\frac{\cos(z_{2},z_{i})}{\tau}\right),\quad i=1,3italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_softmax ( divide start_ARG roman_cos ( italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_τ end_ARG ) , italic_i = 1 , 3 (2)

At each training step, the Student’s embedding z2z_{2}italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is compared to the teacher’s embeddings via cosine similarity, generating three probability distributions. These distributions are aligned using KL divergence, yielding the relation alignment loss:

Relation=DKL(P1P3)+DKL(P2P3)\mathcal{L}_{\text{Relation}}=D_{\text{KL}}(P_{1}\parallel P_{3})+D_{\text{KL}}(P_{2}\parallel P_{3})caligraphic_L start_POSTSUBSCRIPT Relation end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) + italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) (3)

Which ensures that the Student mimics the behavior of both teachers across current and historical embedding states.

In parallel, the architecture enforces feature alignment at the style level. The encoder features hih_{i}italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are used to construct Gram matrices Gi=1HWhihiG_{i}=\frac{1}{HW}h_{i}^{\top}h_{i}italic_G start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, capturing the second-order channel correlations (texture/style). These Gram representations are then compared between students and teachers via cosine similarity in Frobenius space, forming the style alignment loss:

Style=1G1,G2FG1FG2F+1G3,G2FG3FG2F\mathcal{L}_{\text{Style}}=1-\frac{\langle G_{1},G_{2}\rangle_{F}}{\|G_{1}\|_{F}\|G_{2}\|_{F}}+1-\frac{\langle G_{3},G_{2}\rangle_{F}}{\|G_{3}\|_{F}\|G_{2}\|_{F}}caligraphic_L start_POSTSUBSCRIPT Style end_POSTSUBSCRIPT = 1 - divide start_ARG ⟨ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_G start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∥ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG + 1 - divide start_ARG ⟨ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_G start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∥ italic_G start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG (4)

Which encourages the Student to preserve style-aware structural features.

The KAN projection head is heavily regularized to prevent overfitting and encourage meaningful representations. This includes:

  • (1)

    an L1 sparsity loss L1=q,pθq,p1\mathcal{L}_{L1}=\sum_{q,p}\|\theta_{q,p}\|_{1}caligraphic_L start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_q , italic_p end_POSTSUBSCRIPT ∥ italic_θ start_POSTSUBSCRIPT italic_q , italic_p end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on spline parameters,

  • (2)

    a smoothness loss smooth=s|fs′′(x)|2𝑑x\mathcal{L}_{\text{smooth}}=\sum_{s}\int|f_{s}^{\prime\prime}(x)|^{2}dxcaligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∫ | italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_x ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x that penalizes sharp spline bends, and

  • (3)

    a segment deactivation loss deact\mathcal{L}_{\text{deact}}caligraphic_L start_POSTSUBSCRIPT deact end_POSTSUBSCRIPT which randomly turns off parts of spline activations during training, similar to dropout but localized to spline segments.

These components are combined:

KAN=λL1L1+λsmoothsmooth+deact\mathcal{L}_{\text{KAN}}=\lambda_{L1}\mathcal{L}_{L1}+\lambda_{\text{smooth}}\mathcal{L}_{\text{smooth}}+\mathcal{L}_{\text{deact}}caligraphic_L start_POSTSUBSCRIPT KAN end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT smooth end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT deact end_POSTSUBSCRIPT (5)

The final loss combines relation alignment, style preservation, and KAN regularization:

total=Relation+0.5Style+KAN\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{Relation}}+0.5\cdot\mathcal{L}_{\text{Style}}+\mathcal{L}_{\text{KAN}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT Relation end_POSTSUBSCRIPT + 0.5 ⋅ caligraphic_L start_POSTSUBSCRIPT Style end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT KAN end_POSTSUBSCRIPT (6)

The student encoder and KAN head are updated via backpropagation on total\mathcal{L}_{\text{total}}caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT, while the two teacher branches are updated using Exponential Moving Average (EMA):

θt(t)=mθt(t1)+(1m)θs(t)\theta_{t}^{(t)}=m\cdot\theta_{t}^{(t-1)}+(1-m)\cdot\theta_{s}^{(t)}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_m ⋅ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t - 1 ) end_POSTSUPERSCRIPT + ( 1 - italic_m ) ⋅ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (7)

where m=0.99m=0.99italic_m = 0.99. This strategy ensures slow, stable updates of teacher networks to guide the more volatile students. The architecture fuses contrastive, structural, and style-aware supervision through a KAN-driven, spline-regularized embedding mechanism, making it highly effective for domains like artistic or stylized image classification where rigid features and weak labels fail.

4 Experiments

This section introduces the Pandora 18K and WikiArt datasets, highlighting their importance for fine-grained artistic style analysis. We present our dual-teacher self-supervised framework that trains a student model using two teachers: one with a traditional MLP projection head and another with our Kolmogorov-Arnold Network (KAN) projection head, optimized with a composite contrastive loss. After training, we freeze the Student’s backbone to preserve learned representations and apply a linear evaluation protocol to assess feature quality in style classification tasks. Finally, we analyze how KAN enhances feature separability for intricate artistic styles and compare the performance across backbone architectures to demonstrate the framework’s robustness.

4.1 Dataset

Our experiments are conducted on two publicly available datasets, as detailed below. For all datasets, we adopt a data split that mirrors the Dual Teacher paper regarding the number of training and validation samples, but with a different random seed. The precise splits are included in our code repository to ensure reproducibility.

WikiArt Dataset. The WikiArt dataset [20] includes over 80,000 artworks across 25 style categories created by 195 artists. We selected the 10 categories with the most images, resulting in a subset of 53,072 images: 37,146 for training, 7,956 for validation, and 7,970 for testing. This approach ensures ample samples per class for effective feature projection and evaluation of our dual-teacher KAN framework.

Pandora18k Dataset. We use the Pandora18k dataset [10], which contains 18,038 images from various artistic genres and photographic styles. To ensure consistency with previous work on the Dual Teacher framework, we replicate its train/validation/test proportions using an independent random seed for reproducibility and to prevent data leakage. All split indices and preprocessing scripts are available in our public repository.

4.2 Implementation

The network is optimized using Stochastic Gradient Descent (SGD). Hyperparameters, including batch size, initial learning rate, and input resolution, are empirically determined for optimal performance on each dataset. For the WikiArt dataset, we adopt a batch size of 32, an initial learning rate of 0.0075, and resize input images to 480×480480\times 480480 × 480 pixels. These settings balance the preservation of high-resolution artistic details with stable training dynamics. Experiments for the Pandora18k dataset show that a batch size of 16, an initial learning rate of 0.001, and 352×352352\times 352352 × 352 pixel dimensions achieve effective performance while maintaining computational efficiency.

Momentum coefficients α\alphaitalic_α and β\betaitalic_β are fixed at 0.99. The learning rate follows a linear warm-up schedule followed by cosine annealing, starting from the specified base rates. Training is conducted for 25 epochs on an NVIDIA Quadro 4500 GPU. The KAN projection head employs a 5×55\times 55 × 5 transformation grid with cubic spline functions of order 3 to capture high-order nonlinear feature interactions. All implementations are based on PyTorch 1.12.1 with CUDA 12.4 support.

4.3 Results

In Table 1, we systematically evaluate the impact of substituting the original MLP projection head in our dual-teacher self-supervised framework with the proposed KAN module across three leading backbone architectures, EfficientNet-B0, ConvNeXt-Base, and ViT-Base, on the Pandora18k and WikiArt benchmarks. On Pandora18k, integrating KAN into EfficientNet-B0 yields a 0.92 % increase in Top-1 accuracy, a 1.09 % gain in Top-5 accuracy, a 1.23 % improvement in precision, a 0.84 % rise in recall and a 1.01 % uplift in F1-score; ConvNeXt-Base with KAN attains a 1.03 % boost in Top-1 accuracy, a 0.36 % gain in Top-5 accuracy, a 0.99 % increase in precision, a 0.89 % rise in recall and a 0.97 % enhancement in F1-score; ViT-Base augmented by KAN delivers a 0.39 % improvement in Top-1 accuracy, a 0.14 % gain in Top-5 accuracy, a 0.50 % increase in precision, a 0.40 % rise in recall and a 0.31 % uplift in F1-score. On the more challenging WikiArt dataset, EfficientNet-B0+KAN exhibits a marginal 0.03 % decrease in Top-1 accuracy alongside a 0.37 % gain in Top-5 accuracy and a 0.42 % rise in precision, counterbalanced by a 0.72 % drop in recall and a 0.29 % reduction in F1-score; ConvNeXt-Base+KAN achieves a 0.87 % increase in Top-1 accuracy, a 0.46 % gain in Top-5 accuracy, a 0.63 % uplift in precision, a 0.93 % rise in recall and a 0.76 % improvement in F1-score; ViT-Base+KAN secures a 0.23 % enhancement in Top-1 accuracy, a 0.33 % gain in Top-5 accuracy, a 0.96 % increase in precision, a 0.21 % rise in recall and a 0.60 % uplift in F1-score. These results demonstrate that KAN consistently refines feature clustering and mitigates misclassification, particularly in fine-grained, imbalanced settings, thereby substantively strengthening generalization in self-supervised representation learning.

Table 1: Performance comparison of models (Base vs. KAN) on the Pandora18k and WikiArt datasets.

Dataset Model Variant Top-1 (%) Top-5 (%) Prec. (%) Rec. (%) F1 (%) Pandora18k EfficientNet-B0 [32] Base 49.16 88.9888.9888.98 49.32 49.65 49.04 KAN 50.08 90.0790.0790.07 50.55 50.49 50.05 ConvNeXt-Base [23] Base 65.23 96.18 65.86 65.73 65.69 KAN 66.26 96.54 66.85 66.62 66.66 ViT-Base [7] Base 65.54 96.43 65.99 66.15 65.99 KAN 65.93 96.57 66.49 66.55 66.30 WikiArt EfficientNet-B0 [32] Base 50.09 92.31 50.81 49.63 50.02 KAN 50.06 92.68 51.23 48.91 49.73 ConvNeXt-Base [23] Base 60.08 96.26 61.37 61.63 61.46 KAN 60.95 96.72 62.00 62.56 62.22 ViT-Base [7] Base 61.75 96.83 64.97 63.22 63.44 KAN 61.98 97.16 65.93 63.43 64.04

5 Discussion

5.1 Effect of KAN Positioning Across Dual-Teacher Branches

Table 2 examines the impact of placing the KAN projection head in different locations within the ConvNeXt-Base architecture on the Pandora18k dataset. Results indicate that inserting KAN into the student branch alone yields a 0.48% increase in Top-1 accuracy, with additional improvements of 0.71% in precision, 0.40% in recall, and 0.44% in F1-score, alongside a 0.10% gain in Top-5 accuracy. This suggests KAN enhances the model’s ability to disentangle complex nonlinear features while maintaining teacher signals and inserting KAN into the style teacher results in the highest Top-1 accuracy at 66.49%, a 1.26% improvement over the baseline, showcasing better texture and feature extraction. This placement also enhances recall and F1-score, increasing robustness to class imbalance. Conversely, placing KAN only in the momentum teacher yields minimal gains, indicating that its benefits are diminished outside the style-aware context. The most significant performance improvements occur when KAN is applied across all three components: the Student, the style teacher, and the momentum teacher, resulting in gains of 1.03% in Top-1 accuracy, 0.36% in Top-5 accuracy, 0.99% in precision, 0.89% in recall, and 0.97% in F1-score. These findings highlight KAN’s effectiveness in integrating style-guided abstraction with temporal stability, improving feature clustering and inter-class separability in self-supervised learning.

Table 2: Impact of KAN placement in ConvNeXt projection heads on Pandora18k.

Dataset Variant Top-1 (%) Top-5 (%) Prec. (%) Rec. (%) F1 (%) Pandora18k Base (MLP) 65.2365.2365.23 96.1896.1896.18 65.8665.8665.86 65.7365.7365.73 65.6965.6965.69 Student KAN 65.7165.7165.71 96.28 66.57 66.13 66.13 Style Teacher KAN 66.49 96.0596.0596.05 66.4966.4966.49 66.25 66.0966.0966.09 Momentum Teacher KAN 65.4765.4765.47 96.2196.2196.21 65.9065.9065.90 65.8065.8065.80 65.8565.8565.85 All Heads KAN (Ours) 66.26 96.54 66.85 66.62 66.66

5.2 Confusion Matrix Analysis

The confusion matrices for the Kernel-Adaptive Network (KAN) on the WikiArt and Pandora18k datasets reveal its strengths in structured classification as well as challenges with conceptual ambiguities. In the WikiArt dataset, KAN achieves 83.2% accuracy for Northern Renaissance and 88.2% for Abstract Expressionism, though Baroque shows some misclassifications into Realism (8.4%) and Romanticism (6.9%). On the Pandora18k dataset, KAN performs even better, with Abstract Expressionism at 96.4% and Baroque at 90.7%. Romanticism scores 77.1%, but Ukiyo-e is confused with Surrealism and Symbolism (5.8% each), while Socialist Realism reaches 69.6% and overlaps with Realism (7.0%). Cross-dataset comparisons show a reliance on low-level visual features, with Impressionism at 68.5% on WikiArt and Magic Realism at only 53.1% on Pandora18k, highlighting the challenges in classifying styles needing deeper cultural context.

Refer to caption
(a) ConvNeXt-Base + KAN on Pandora18k. Darker diagonal entries indicate high true positive counts; sparse off-diagonals reveal low inter-class confusion.
Refer to caption
(b) ConvNeXt-Base + KAN on WikiArt. Darker diagonal entries indicate high true positive counts; sparse off-diagonals reveal low inter-class confusion.
Figure 2: Class-wise confusion matrices for ConvNeXt-Base with KAN on (a) the Pandora18k and (b) WikiArt test sets. Strong diagonal dominance and minimal cross-style errors highlight the model’s robustness.
Refer to caption
(a) Predictions from ConvNeXt-Base model
Refer to caption
(b) Predictions from ConvNeXt-Base + KAN model
Figure 3: Comparative prediction examples from different models on the Pandora18k dataset. (a) Results from the convolutional baseline model, (b) Results from the ConvNeXt-Base architecture with KAN.

6 Conclusion

This study introduces a self-supervised methodology for art style analysis that integrates Kolmogorov–Arnold Networks (KANs) with Gramian style alignment. By replacing traditional multi-layer perceptron (MLP) projection heads with spline-activated KAN layers, we address limitations of linear feature interactions while ensuring computational efficiency often lost in multi-teacher paradigms. Our framework significantly improves cross-dataset generalization across established benchmarks like WikiArt and Pandora18k, outperforming previous dual-teacher approaches through adaptive nonlinear projection [26]. Analysis of prediction patterns (Figure˜3) reveals challenges in distinguishing stylistically adjacent movements, particularly between High Renaissance/Rococo (Figure˜3(a)) and Post-Impressionism/Expressionism (Figure˜3(b)). These misclassifications suggest that finer aspects of painterly technique might require geometric disentanglement beyond current alignment methods.

References

  • [1] Cetinic, E., Lajic, T., Grgic, S.: Fine-tuning convolutional neural networks for fine art classification. Expert Systems with Applications pp. 107–118 (2018)
  • [2] Cetinic, E., Lipic, T., Grgic, S.: Towards interpreting deep learning-based art style classification. arXiv preprint arXiv:1908.04307 (2019)
  • [3] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations pp. 1597–1607 (2020)
  • [4] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: ICML. pp. 1597–1607 (2020)
  • [5] Doe, A., Nguyen, L., Patel, R.: Kolmogorov–arnold networks: Learning nonlinear projection through spline-based activation. In: Proceedings of the International Conference on Machine Learning. pp. 1023–1032 (2024)
  • [6] Doe, J., Smith, J.: Relational consistency via gram matrix alignment for self-supervised learning. In: IEEE Transactions on Pattern Analysis and Machine Intelligence. vol. 45, pp. 234–245 (2023)
  • [7] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [8] Elgammal, A., Liu, B., Elhoseiny, M., Mazzone, M.: The shape of art history in the eyes of the machine. In: AAAI Conference on Artificial Intelligence (2018)
  • [9] Falomir, Z., Museros, L., Sanz, I., Gonzalez-Abril, L.: Categorizing paintings in art styles based on qualitative color descriptors, quantitative global features and machine learning (qart-learn). Expert Systems with Applications pp. 83–94 (2018)
  • [10] Florea, C., Toca, C., Gieseke, F.: Artistic movement recognition by boosted fusion of color structure and topographic description. In: 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 569–577 (2017). https://doi.org/10.1109/WACV.2017.69
  • [11] Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., Ramabhadran, B.: Efficient knowledge distillation from an ensemble of teachers. In: Interspeech. pp. 3697–3701 (2017)
  • [12] Fukuda, T., Suzuki, M., Kurata, G., Thomas, S., Cui, J., Ramabhadran, B.: Efficient knowledge distillation from an ensemble of teachers. In: Interspeech. pp. 3697–3701 (2017)
  • [13] Gairola, S., Shah, R., Narayanan, P.: Unsupervised image style embeddings for retrieval and recognition tasks. In: IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 2381–2399 (2020)
  • [14] Garg, S., Jain, D.: Self-labeling refinement for robust representation learning with bootstrap your own latent (2022), https://arxiv.org/abs/2204.04545
  • [15] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning pp. 9729–9738 (2020)
  • [16] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR. pp. 9729–9738 (2020)
  • [17] Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: Neural Information Processing Systems Deep Learning and Representation Learning Workshop (2015)
  • [18] Johnson, J.: Neural style representations and the large-scale classification of artistic style pp. 283–285 (2017)
  • [19] Karayev, S., Hertzmann, A., Winnemoeller, H., Agarwala, A., Darrell, T.: Recognizing image style. In: British Machine Vision Conference (2013)
  • [20] Karayev, S., Trentacoste, M., Han, H., Agarwala, A., Darrell, T., Hertzmann, A., Winnemoeller, H.: Recognizing image style. In: British Machine Vision Conference (BMVC). Nottingham, UK (September 2014), https://bmva-archive.org.uk/bmvc/2014/files/paper121/index.html
  • [21] Karayev, S., et al.: Recognizing image style. arXiv preprint arXiv:1311.3715 (2013)
  • [22] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems. pp. 1097–1105 (2012)
  • [23] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. arXiv preprint arXiv:2201.03545 (2022)
  • [24] Liu, Z., Meng, Y., Wang, Y., Zhang, G.: Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756 (2024)
  • [25] Liu, Z., Wang, Y., Vaidya, S., Ruehle, F., Halverson, J., Soljačić, M., Hou, T.Y., Tegmark, M.: Kan: Kolmogorov-arnold networks. arXiv preprint arXiv:2404.19756 (2024)
  • [26] Luo, M., Liu, L., Lu, Y., Suen, C.Y.: Art style classification via self-supervised dual-teacher knowledge distillation. Applied Soft Computing 174, 112964 (2025)
  • [27] Luo, M., Liu, L., Lu, Y., Suen, C.Y.: Art style classification via self-supervised dual-teacher knowledge distillation. Applied Soft Computing 174, 112964 (2025)
  • [28] Peterson, J.C., Soulos, P., Nematzadeh, A., Griffiths, T.L.: Learning hierarchical visual representations in deep neural networks using hierarchical linguistic labels (2018), https://arxiv.org/abs/1805.07647
  • [29] Pham, C., Hoang, T., Do, T.T.: Collaborative multi-teacher knowledge distillation for learning low bit-width deep neural networks. In: IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 6435–6443 (2023)
  • [30] Pham, V., et al.: Multi-teacher multi-task knowledge distillation for semantic segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
  • [31] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations (2015)
  • [32] Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks (2019)
  • [33] Zhang, W., Li, M., Chen, J.: A multi-task self-supervised framework for art style classification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1234–1242 (2023)
  • [34] Zhang, Y., Chen, H., Wang, L., Yang, M.H., Li, W.: Three-element learning: A unified framework for multi-task self-supervised representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11234–11244 (2022)
  • [35] Zhou, D., Zhou, D., Wei, G., Yuan, X.: Three-dimensional shape reconstruction from digital freehand design sketching based on deep learning techniques. Applied Sciences 14(24), 11717 (2024). https://doi.org/10.3390/app142411717