An efficient framework for text-to-image retrieval using complete feature aggregation and cross-knowledge conversion

38 Accesses
Explore all metrics

Abstract

Text-driven person retrieval aims to find a particular person within an array of images using a natural language description of that individual. The main challenge in this task lies in calculating a similarity score that measures the likeness between the person’s image and the given description. This procedure involves uncovering the intricate, underlying connections between various levels of image sub-regions and textual phrases. Previous efforts have sought to tackle this issue by utilizing individual pre-trained models for visual and textual data to extract features. However, these methods lack essential alignment capabilities regarding efficient multimodal data matching. Moreover, these approaches rely on prior information to investigate explicit part alignments, potentially causing the disruption of information within the same modality. To increase the performance of existing state-of-the-art text-to-image retrieval systems, we proposed a novel technique aimed at enhancing the effectiveness of the text-image pedestrian matching pipeline, referred to as Complete Feature Aggregation and Cross-Knowledge Conversion (CFA-CKC), which integrates diverse supervisory signals from various textual and visual perspectives. Our proposed algorithm achieves new state-of-the-art results on all three public datasets, including CUHK-PEDES, ICFG-PEDES, and RSTPReid. Experimental results show that our CFA-CKC outperforms the state-of-the-art methods by up to 6.0% in rank-1 accuracy. The code and pre-trained models are available at this link https://github.com/AIVIETNAMResearch/Pedestrian-Matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+

from $39.99 /Month

Starting from 10 chapters or articles per month
Access and download chapters and articles from more than 300k books and 2,500 journals
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-modal alignment with synthetic caption for text-based person search

Article 10 March 2025

Cross-Modal Dual Matching and Comparison for Text-to-Image Person Re-identification

GLANCE—Guided Language Through Autoregression Establishing Natural and Classifier-Free Editing

Data availability

No datasets were generated or analysed during the current study.

References

Lin X, Wang J, Huang R, Wang C, Zhang H (2024) Learning discriminative local contexts for person re-identification in vehicle surveillance scenarios. Pattern Anal Appl 27:1–12
Article Google Scholar
Chouchane A, Bessaoudi M, Boutellaa E, Ouamane A (2023) A new multidimensional discriminant representation for robust person re-identification. Pattern Anal Appl 26:1191–1204
Article Google Scholar
Zhou S, Zhang M (2022) Occluded person re-identification based on embedded graph matching network for contrastive feature relation. Pattern Anal Appl 26:487–503
Article Google Scholar
Huang F, Zhang L, Zhou Y, Gao X (2023) Adversarial and isotropic gradient augmentation for image retrieval with text feedback. IEEE Trans Multimed 25:7415–27
Article Google Scholar
Ngo BH, Nguyen DT, Do-Tran N-T, Thien PPH, An M-H, Nguyen T-N, Hoang LN, Nguyen VD, Dinh V (2023) Comprehensive visual features and pseudo labeling for robust natural language-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5409–5418
Li Z, Guo C, Feng Z, Hwang J-N, Du Z (2023) Integrating language guidance into image-text matching for correcting false negatives. IEEE Trans Multimed 26:103–16
Article Google Scholar
Zhao L, Huang P, Chen T, Fu C, Hu Q, Zhang Y (2023) Multi-sentence complementarily generation for text-to-image synthesis. IEEE Trans Multimed 26:103–16
Google Scholar
Zhang K, Mao Z, Liu A-A, Zhang Y (2023) Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans Multimed 25:1320–1332
Article Google Scholar
Zhang L, Luo M, Liu J, Chang X, Yang Y, Hauptmann AG (2020) Deep top-$k$ ranking for image–sentence matching. IEEE Trans Multimed 22(3):775–785
Article Google Scholar
Cheng Q, Wen K, Gu X (2023) Vision-language matching for text-to-image synthesis via generative adversarial networks. IEEE Trans Multimed 25:7062–7075
Article Google Scholar
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dualpath convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput, Commun, and Appl 16(2):1–23
Google Scholar
Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: Visual-textual attributes alignment in person search by natural language. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 402–420
Jiang D, Ye M (2023) Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2787–2797
Dosovitskiy A et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR)
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 3864–3872
Liu J, Yang M, Li C, Xu R (2021) Improving cross-modal image-text retrieval with teacher-student learning. IEEE Trans Circuits Syst Video Technol 31(8):3242–3253
Article Google Scholar
Xu J, Liu Z, Pei X, Wang S, Gao S (2023) Fb-net: dual-branch foreground-background fusion network with multi-scale semantic scanning for image-text retrieval. IEEE Access 11:36516–36537
Article Google Scholar
Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference (BMVC)
Jia C, et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 4904–4916
Bai Y, Cao M, Gao D, Cao Z, Chen C, Fan Z, Nie L, Zhang M (2023) Rasa: Relation and sensitivity aware representation learning for text-based person search. In: International Joint Conference on Artificial Intelligence Organization
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 212–228
Chen H, Ding G, Lin Z, Zhao S, Han J (2019) Cross-modal image text retrieval with semantic consistency. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1749–1757
Ding Z, Ding C, Shao Z, Tao D (2021) Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666
Chen Y, Zhang G, Lu Y, Wang Z, Zheng Y (2022) Tipcb: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494:171–181
Article Google Scholar
Qin Y, Chen Y, Peng D, Peng X, Zhou J, Hu P (2024) Noisy-correspondence learning for text-to-image person re-identification. In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)
Yang F, Li W, Yang M, Liang B, Zhang J (2024) Multi-modal disordered representation learning network for description-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 16316–16324
Gu J, Cai J, Joty S, Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7181–7189
Radford A, et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 8748–8763
Li J, Selvaraju RR, Gotmare A, Joty S, Xiong C, Hoi SCH (2021) Align before fuse: Vision and language representation learning with momentum distillation. In: Proceedings of the International Conference on Neural Information Processing Systems, pp. 9694–9705
Zeng Y, Zhang X, Li H, Wang J, Zhang J, Zhou W (2024) $X^2$ -vlm: all-in-one pre-trained model for vision-language tasks. IEEE Trans Pattern Anal Mach Intell 46(5):3156–3168
Article Google Scholar
Yan S, Dong N, Zhang L, Tang J (2022) Clip-driven fine-grained text-image person re-identification. arXiv:2210.10276
Yang S, Zhou Y, Zheng Z, Wang Y, Zhu L, Wu Y (2023) Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In: Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, pp. 4492–4501
Peng Y, Qi J, Huang X, Yuan Y (2018) Ccl: cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimed 20(2):405–420
Article Google Scholar
Kim W, Son B, Kim I (2021) Vilt: Vision-and-language transformer without convolution or region supervision. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 5583–5594
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 3533–3542
Li Y et al (2021) Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In: Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–17
He K et al (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738
Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297
Fan H et al (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6824–6835
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, pp. 1597–1607
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), pp. 9912–9924
Grill J-B et al (2020) Bootstrap your own latent: A new approach to self-supervised learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS)
Peng Z, Dong L, Bao H, Ye Q, Wei F (2022) Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv
Bao H, Dong L, Piao S, Wei F (2022) Beit: Bert pre-training of image transformers. In: Proceedings of the International Conference on Learning Representations (ICLR)
Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 5187–5196
Zhu A et al (2021) Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the ACM International Conference on Multimedia (MM), pp. 209–217
Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations (ICLR)
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5814–5824
Gao C et al (2021) Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036
Wu Y, Yan Z, Han X, Li G, Zou C, Cui S (2021) Lapscore: Language-guided person search via color reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1604–1613
Yan S, Tang H, Zhang L, Tang J (2022) Image-specific information suppression and implicit local alignment for text-based person search. arXiv:2208.14365
Wang Z et al., (2022) Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In: Proceedings of the 30th ACM International Conference on Multimedia
Li S, Cao M, Zhang M (2022) Learning semantic-aligned feature representation for text-based person search. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2724–2728
Wang Z, et al (2022) Caibc: Capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the ACM International Conference on Multimedia (MM), pp. 5314–5322
Fu Z, Wang C, Xu J, Zhou H, Li L (2022) Contextual representation learning beyond masked language modeling. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2701–2714
Shao Z, Zhang X, Fang M, Lin Z, Wang J, Ding C (2022) Learning granularity-unified representations for text-to-image person re-identification. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1–9
Shu W et al (2022) See finer, see more: Implicit modality alignment for text-based person retrieval. In: Proceedings of the European Conference on Computer Vision Workshops (ECCVW), pp. 624–641
Abnar S, Zuidema W (2020) Quantifying attention flow in transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4190–4197
Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 11966–11976

Download references

Author information

Bach Hoang Ngo and Minh-Hung An have contributed equally to this work.

Authors and Affiliations

University of Science, VNU-HCM, Ho Chi Minh, Vietnam
Bach Hoang Ngo & Minh-Duc Bui
FPT Telecom, Hanoi, Vietnam
Minh-Hung An
AI VIETNAM, Khanh Hoa, Vietnam
Khoa Nguyen Tho Anh & Quang-Vinh Dinh
Department of Information Technology, FPT University, Can Tho, Vietnam
Vinh Dinh Nguyen

Authors

Bach Hoang Ngo
View author publications
Search author on:PubMed Google Scholar
Minh-Hung An
View author publications
Search author on:PubMed Google Scholar
Khoa Nguyen Tho Anh
View author publications
Search author on:PubMed Google Scholar
Minh-Duc Bui
View author publications
Search author on:PubMed Google Scholar
Quang-Vinh Dinh
View author publications
Search author on:PubMed Google Scholar
Vinh Dinh Nguyen
View author publications
Search author on:PubMed Google Scholar

Contributions

Bach Hoang Ngo and Minh-Hung implemented and evaluated the performance of the proposed system. Khoa Nguyen Tho Anh and Minh-Duc-Bui handled the steps of collecting, preprocessing, and analyzing the data. Quang-Vinh Dinh and Vinh Dinh Nguyen proposed the idea and solution, and assisted in writing the main manuscript text.

Corresponding author

Correspondence to Vinh Dinh Nguyen.

Ethics declarations

Conflict of interest

The authors declare that there is no Conflict of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ngo, B.H., An, MH., Anh, K.N.T. et al. An efficient framework for text-to-image retrieval using complete feature aggregation and cross-knowledge conversion. Pattern Anal Applic 28, 176 (2025). https://doi.org/10.1007/s10044-025-01554-2

Download citation

Received: 04 May 2024
Accepted: 13 September 2025
Published: 26 September 2025
DOI: https://doi.org/10.1007/s10044-025-01554-2

An efficient framework for text-to-image retrieval using complete feature aggregation and cross-knowledge conversion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-modal alignment with synthetic caption for text-based person search

Cross-Modal Dual Matching and Comparison for Text-to-Image Person Re-identification

GLANCE—Guided Language Through Autoregression Establishing Natural and Classifier-Free Editing

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

An efficient framework for text-to-image retrieval using complete feature aggregation and cross-knowledge conversion

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Cross-modal alignment with synthetic caption for text-based person search

Cross-Modal Dual Matching and Comparison for Text-to-Image Person Re-identification

GLANCE—Guided Language Through Autoregression Establishing Natural and Classifier-Free Editing

Explore related subjects

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now