Background & Summary

Similarity perception is an abstract and complex concept that differs widely across mental and computational models, as explored in (computational) neuroscience1,2, computer vision3,4,5,6, or (computational) cognitive science7,8. For mental and conceptual models, similarity refers to resemblance or alikeness and describes groups with some shared properties, as prominently outlined in Wittgenstein’s remarks on family resemblance9,10. Conversely, in computational models, similarity denotes proximity and is conventionally defined as inversely correlated with distance between data points in a metric space.

Computer Vision applications heavily rely on such visual similarity, often utilizing vector embeddings that set up a measurable multidimensional space to index images. In similarity learning the goal is to train models that can accurately capture the underlying similarities between data points, enabling tasks such as image retrieval or classification based on similarity metrics11,12,13,14,15,16,17. However, similarity in these contexts is often implied to be understood in a singular notion, overlooking the multifaceted nature of similarity perception crucial for informed decision-making in selecting models or methods. For instance, Ref. 18 utilizes CLIP19 and DINO20 to evaluate subject fidelity of generated images, acknowledging the varying importance of different similarity aspects. It is generally considered that CLIP captures semantic relationships, while DINO focuses more on visual features. Yet, validating such assumptions poses a significant challenge.

Research in quantitative and computational aesthetics21,22,23, as well as the interplay of computation and human cultures24,25, requires reliable benchmark datasets that are interpretable by machines and humans. Previous work has relied on embeddings of large amounts of well known artworks26,27 or synthetic datasets of limited size28,29,30,31.

Benchmark image datasets for perceptual similarity judgment exist, with some relying on annotated text captions of real-world images32, while others utilize synthetic image triplets designed to better align with mental models6. However, these datasets primarily focus on specific tasks or aspects of similarity perception and alignment, such as zero-shot evaluation or similarity metric optimization.

Here we propose Style Aligned Artwork Datasets (SALAD), with the fruit-SALAD serving as an exemplar. This synthetic image dataset comprises 10,000 generated images featuring 10 easily recognizable fruit categories, each represented in 10 visually distinct styles, with 100 instances each (see example set of one instance in Fig. 1). Developed as a benchmark tool rather than for training purposes, the dataset is constructed on two highly controlled property dimensions – semantic (fruit category) and stylistic (artistic style) – that cannot be isolated at this level in existing real-world image datasets and therefore required image generation. The deliberate control over semantic and stylistic properties inherent to each image facilitates comparative analysis of different image embedding and complexity models, enabling an exploration of their similarity perception, only possible on scale through synthetic images.

Fig. 1
figure 1

Overview of the first instance of 10 fruit categories in 10 styles. Columns display fruit categories and rows display style categories with labels trying to describe the style prompts. The full dataset contains 100 instances of each category-style combination resulting in 10,000 unique fruit depictions. See Fig. 3 as an example for 100 instances of one combination.

We characterize the dataset through various machine learning models and measures of aesthetic complexity, showcasing how simple pairwise comparisons of image vectors can yield robust inter-comparable measures. Our examples reveal significant differences in similarity awareness across these methods and models, shedding light on anecdotal considerations stemming from differences in model or algorithm design, training data, parameter configuration, or similarity measures. In turn, this approach can be used to guide model training and alignment.

The fruit-SALAD offers opportunities for joint robust quantification and qualitative human interpretation, enhancing algorithmic and human perception regarding differences in measuring vector similarity and visual resemblance across computational, and statistical models. This approach allows for a more comprehensive assessment of similarity perception, beyond the scope of existing benchmark datasets, ultimately contributing to a deeper understanding of computational and human similarity perception mechanisms.

Methods

Image generation

We used Stable Diffusion XL (SDXL)33 and StyleAligned34 to create the fruit-SALAD by carefully crafting image generation prompts and supervising the automation process. Diffusion probabilistic models35 are typically trained with the objective of denoising blurred images. By leveraging their ability to iteratively refine images by processing random noise, these models can be used in conjunction with text prompts to generate images. Such Text-to-Image models, exemplified by DALL-E36, Midjourney and StableDiffusion37, have recently gained significant attention in various creative and commercial domains. These models and services have simplified the synthesis of high-quality individual images, enabling unprecedented ease of use through natural language. However, scaling the generation process or achieving stylistically consistent images remains challenging but can be improved by style alignment methods34 to coordinate shared attention across multiple generations based on a reference style image.

We utilized a computational approach to scale the image generation process (see Fig. 2). Initially, we experimented in a trial-and-error fashion with different style prompts in conjunction with different fruit categories, using SDXL33 for image generation. Successful results were selected as style references. We then used style alignment34 to generate multiple instances of different fruits within the same style using diffusion inversion38 of the reference image. Through several iterations and adjustments to the prompts, we refined the process and eventually automated the generation to produce 100 instances for each fruit-style combination (see all 100 instances of one fruit-style example in Fig. 3).

Fig. 2
figure 2

Overview of the image generation process. 1. Style reference image generation with Stable Diffusion XL33 in manual trial-and-error fashion using text prompts of style description in combination with “an apple”. 2. Style aligned image generation34 based on each style reference image using diffusion inversion and text prompts iterating over 10 fruit categories generating 100 instances each, resulting in 10,000 images. 3. Manual curation with selection criteria examples: tolerated minor issues which do not impact recognition of category or style (green), and rejected major issues which are either unrecognizable or inconsistent across the style (red). The final step includes feature extraction to construct image embeddings for model comparison.

Fig. 3
figure 3

All instances of fruit category 3 (apple) in style category 1 (Watercolor). Corresponds to 100 dataset files 3_1_0.png to 3_1_99.png. Text prompt: “watercolor sketch of a gala apple, aquarelle, wet paint”.

The fruit prompts and stylistic references we selected were carefully curated to improve the robustness of the style alignment generation method. Among the fruit prompts, we balanced between fruit prototypicality and variability across different stylistic prompts to ensure compatibility with generation on scale, while simultaneously covering a wide range of fruit shapes and colors. Similarly, our selection of stylistic references was based on their effectiveness in aligning with the generation space, focusing on those that demonstrated superior performance in achieving stylistic coherence.

We maintained dataset quality by visually assessing the entire dataset in 100 batches of 10 by 10 image grids and manually replaced images that were inconsistent across all instances (see examples of the manual selection criteria in Fig. 2). Therefore, the final dataset with category and style classes may be biased by our own aesthetic arbitration, which is akin to the inherent specificity of a chosen set of handwritten digits39.

Image embeddings

Our exemplary vector embeddings are derived from machine learning models and compression algorithms through various commonly employed methods (Table 1). For19,20,40,41,42,43 we extracted feature vectors using the flattened last hidden states. For44,45,46 we used average pooling from the second to last layer.

Table 1 Pre-trained machine learning models used for feature extraction.

As an example of a quantitative aesthetics measure, we used the Compression Ensembles method21, which captures polymorphic family resemblance via a number of transformations (87 in our implementation). We used GIF image compression ratios, taking advantage of the Lempel–Ziv–Welch algorithm47. We also provide the PNG file sizes as comparison (Table 2).

Table 2 Other methods used for feature extraction.

To provide simple conceptual models for reference, we used binary, one-hot encoded vectors. In this encoding scheme, each vector represents a fruit category or style, with a value of 1 indicating the presence and 0 indicating the absence of the corresponding category or style (Table 2). We are consciously providing a simple conceptual reference, to avoid the complications of full blown conceptual reference models, such as the CIDOC-CRM48.

Data Records

The fruit-SALAD_10k is available at Zenodo under record number 1115852249 (https://zenodo.org/records/11158522). The repository includes 10,000 PNG files of fruit images (1024 × 1024 pixel), 10 PNG files of style reference images, 10 CSV files with text prompts, 100 PNG files of grid overview plots (10 × 10 images per instance), 23 CSV vector files, 23 PNG files of model heatmaps, 1 CSV file containing 23 model vectors and 1 CSV file with index labels. We provide a detailed overview of all dataset repository files in Supplementary Fig. S2.

The 10,000 fruit image filenames adhere to the following format: fruit_style_instance.png. For example, an image with the filename 8_1_42.png signifies fruit category 8 (avocado) rendered in style category 1 (Watercolor), and it represents generation number 42.

For accessibility, we provide all vector files as comma-separated values (.csv) with image file names as indices.

Technical Validation

Self recognition test

One expects that, despite inevitable variation in similarity perception, the similarity of images from the same category-style combination should be systematically larger than between images of different categories and/or styles. To assess this, we conduct a self-recognition test on the fruit-SALAD_10k dataset. This test involves retrieving the top 100 nearest neighbors for each image and counting how many instances of the same category-style combination are found within this set. The average number of successful retrievals across all 100 instances per model is then calculated. To validate the self-recognition of image instances, we select the maximum values across all computational models (Fig. 4).

Fig. 4
figure 4

Self-recognition tests. Each cell represents the mean number of same instances in the top 100 nearest neighbors of its fruit category (column) and style (row) combination images. White cells without values have a perfect score of 100 out of 100 correctly recognized instances. Left: Maximum values from all computational models, taking into account that high scores within 100 out of 10,000 images reflect higher than chance results. Right: ResNet50_IN21k as an example model.

If a category-style combination cannot be sufficiently recognized in any of the computational models, we consider the self-recognition test failed. Notably, we found that “apples” and “oranges” in the “Watercolor” style pose the greatest challenge, achieving sufficient accuracy only after various iterations of image generation (see Fig. 3 for all 100 instances of the apple-Watercolor combination).

Model heatmaps

We characterize the dataset, and concurrently exemplify its possible future use by a set of category- and style-ordered distance matrices, which demonstrate salient differences in category and style similarity weights, across various computational models (Supplementary Fig. S1; see examples in Figs. 5 and 6). As a measure of similarity between two sets of images we calculate the average distances between all pairs of elements. To better generalize standardization, we use Mahalanobis distance50,51, which normalizes and decorrelates the coordinates.

Fig. 5
figure 5

DINO-ViT-B-16_IN1k heatmaps indicating the mutual Mahalanobis distances of fruit-SALAD images. The matrix cells correspond to the mean of all 10,000 distance pairs of 100 by 100 instances of fruit-SALAD_10k images. Below the diagonal: sorted by style first and fruit category second. Above the diagonal: sorted by fruit category first and style second. The color indicates the pairwise Mahalanobis distance of image embedding vectors obtained from the respective model or algorithm, from low to high (blue to yellow) while low values indicate higher similarity. The figure construction is comprehensive as the matrices are symmetric; diagonal cells can be left out. See all model heatmaps in Supplementary Fig. S1.

Fig. 6
figure 6

Heatmaps indicating the mutual Mahalanobis distance of fruit-SALAD_10k images according to different models (see Fig. 5). Top row from left to right: CLIP-ViT-B-16_L400M, DINOv2-B_LVD, CompressionEnsembles. Bottom row from left to right: VGG19_IN1k, ViT-B-32_IN21, style_blind. The matrix ordering is identical.

Model comparison

Each of the multiple embedding models can be characterized by a set of distances between images in this embedding. One can consider this set of distances as a multidimensional vector, characterizing a model. Thus, the different models are represented as vectors in a shared space, which enables their direct comparison. As coordinates we used standardized pairwise distances between all unique pairs of 100 fruit category-style combinations, i.e., all entries of the model heatmaps. The principle components of the resulting embedding are shown in Fig. 7.

Fig. 7
figure 7

Relative model comparison using principal component analysis (PCA) based on 23 standardized model vectors of 4,950 dimensions. These dimensions encompass the mutual Mahalanobis distances of all unique category-style combinations of the fruit-SALAD_10k images, excluding self-pairing. Each fruit category-style combination is the mean of all 10,000 mutual distances of 100 by 100 fruit-SALAD image instances.

Investigating the differences in similarity perception can also be accomplished by examining fruit categories and styles through the image embeddings of individual models (Fig. 8). We provide an interactive exploration tool based on the Collection Space Navigator52 to visually compare such projections of model embeddings (https://style-aligned-artwork-datasets.github.io/fruit-explorer).

Fig. 8
figure 8

Scatter plots of apples and oranges using multidimensional scaling (MDS) based on normalized image embedding vectors from two different models. Left: CLIP-ViT-B-16_L400M; right: DINO-ViT-B-16_IN1k. Colors indicate fruit categories and dot shapes indicate styles.