Abstract
Text-driven person retrieval aims to find a particular person within an array of images using a natural language description of that individual. The main challenge in this task lies in calculating a similarity score that measures the likeness between the person’s image and the given description. This procedure involves uncovering the intricate, underlying connections between various levels of image sub-regions and textual phrases. Previous efforts have sought to tackle this issue by utilizing individual pre-trained models for visual and textual data to extract features. However, these methods lack essential alignment capabilities regarding efficient multimodal data matching. Moreover, these approaches rely on prior information to investigate explicit part alignments, potentially causing the disruption of information within the same modality. To increase the performance of existing state-of-the-art text-to-image retrieval systems, we proposed a novel technique aimed at enhancing the effectiveness of the text-image pedestrian matching pipeline, referred to as Complete Feature Aggregation and Cross-Knowledge Conversion (CFA-CKC), which integrates diverse supervisory signals from various textual and visual perspectives. Our proposed algorithm achieves new state-of-the-art results on all three public datasets, including CUHK-PEDES, ICFG-PEDES, and RSTPReid. Experimental results show that our CFA-CKC outperforms the state-of-the-art methods by up to 6.0% in rank-1 accuracy. The code and pre-trained models are available at this link https://github.com/AIVIETNAMResearch/Pedestrian-Matching.







Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Lin X, Wang J, Huang R, Wang C, Zhang H (2024) Learning discriminative local contexts for person re-identification in vehicle surveillance scenarios. Pattern Anal Appl 27:1–12
Chouchane A, Bessaoudi M, Boutellaa E, Ouamane A (2023) A new multidimensional discriminant representation for robust person re-identification. Pattern Anal Appl 26:1191–1204
Zhou S, Zhang M (2022) Occluded person re-identification based on embedded graph matching network for contrastive feature relation. Pattern Anal Appl 26:487–503
Huang F, Zhang L, Zhou Y, Gao X (2023) Adversarial and isotropic gradient augmentation for image retrieval with text feedback. IEEE Trans Multimed 25:7415–27
Ngo BH, Nguyen DT, Do-Tran N-T, Thien PPH, An M-H, Nguyen T-N, Hoang LN, Nguyen VD, Dinh V (2023) Comprehensive visual features and pseudo labeling for robust natural language-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5409–5418
Li Z, Guo C, Feng Z, Hwang J-N, Du Z (2023) Integrating language guidance into image-text matching for correcting false negatives. IEEE Trans Multimed 26:103–16
Zhao L, Huang P, Chen T, Fu C, Hu Q, Zhang Y (2023) Multi-sentence complementarily generation for text-to-image synthesis. IEEE Trans Multimed 26:103–16
Zhang K, Mao Z, Liu A-A, Zhang Y (2023) Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans Multimed 25:1320–1332
Zhang L, Luo M, Liu J, Chang X, Yang Y, Hauptmann AG (2020) Deep top-\(k\) ranking for image–sentence matching. IEEE Trans Multimed 22(3):775–785
Cheng Q, Wen K, Gu X (2023) Vision-language matching for text-to-image synthesis via generative adversarial networks. IEEE Trans Multimed 25:7062–7075
Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701
Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dualpath convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput, Commun, and Appl 16(2):1–23
Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: Visual-textual attributes alignment in person search by natural language. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 402–420
Jiang D, Ye M (2023) Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2787–2797
Dosovitskiy A et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR)
Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 3864–3872
Liu J, Yang M, Li C, Xu R (2021) Improving cross-modal image-text retrieval with teacher-student learning. IEEE Trans Circuits Syst Video Technol 31(8):3242–3253
Xu J, Liu Z, Pei X, Wang S, Gao S (2023) Fb-net: dual-branch foreground-background fusion network with multi-scale semantic scanning for image-text retrieval. IEEE Access 11:36516–36537
Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference (BMVC)
Jia C, et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 4904–4916
Bai Y, Cao M, Gao D, Cao Z, Chen C, Fan Z, Nie L, Zhang M (2023) Rasa: Relation and sensitivity aware representation learning for text-based person search. In: International Joint Conference on Artificial Intelligence Organization
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 212–228
Chen H, Ding G, Lin Z, Zhao S, Han J (2019) Cross-modal image text retrieval with semantic consistency. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1749–1757
Ding Z, Ding C, Shao Z, Tao D (2021) Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666
Chen Y, Zhang G, Lu Y, Wang Z, Zheng Y (2022) Tipcb: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494:171–181
Qin Y, Chen Y, Peng D, Peng X, Zhou J, Hu P (2024) Noisy-correspondence learning for text-to-image person re-identification. In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)
Yang F, Li W, Yang M, Liang B, Zhang J (2024) Multi-modal disordered representation learning network for description-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 16316–16324
Gu J, Cai J, Joty S, Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7181–7189
Radford A, et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 8748–8763
Li J, Selvaraju RR, Gotmare A, Joty S, Xiong C, Hoi SCH (2021) Align before fuse: Vision and language representation learning with momentum distillation. In: Proceedings of the International Conference on Neural Information Processing Systems, pp. 9694–9705
Zeng Y, Zhang X, Li H, Wang J, Zhang J, Zhou W (2024) \(X^2\) -vlm: all-in-one pre-trained model for vision-language tasks. IEEE Trans Pattern Anal Mach Intell 46(5):3156–3168
Yan S, Dong N, Zhang L, Tang J (2022) Clip-driven fine-grained text-image person re-identification. arXiv:2210.10276
Yang S, Zhou Y, Zheng Z, Wang Y, Zhu L, Wu Y (2023) Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In: Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, pp. 4492–4501
Peng Y, Qi J, Huang X, Yuan Y (2018) Ccl: cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimed 20(2):405–420
Kim W, Son B, Kim I (2021) Vilt: Vision-and-language transformer without convolution or region supervision. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 5583–5594
Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 3533–3542
Li Y et al (2021) Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In: Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–17
He K et al (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738
Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297
Fan H et al (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6824–6835
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, pp. 1597–1607
Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), pp. 9912–9924
Grill J-B et al (2020) Bootstrap your own latent: A new approach to self-supervised learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS)
Peng Z, Dong L, Bao H, Ye Q, Wei F (2022) Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv
Bao H, Dong L, Piao S, Wei F (2022) Beit: Bert pre-training of image transformers. In: Proceedings of the International Conference on Learning Representations (ICLR)
Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 5187–5196
Zhu A et al (2021) Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the ACM International Conference on Multimedia (MM), pp. 209–217
Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations (ICLR)
Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5814–5824
Gao C et al (2021) Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036
Wu Y, Yan Z, Han X, Li G, Zou C, Cui S (2021) Lapscore: Language-guided person search via color reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1604–1613
Yan S, Tang H, Zhang L, Tang J (2022) Image-specific information suppression and implicit local alignment for text-based person search. arXiv:2208.14365
Wang Z et al., (2022) Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In: Proceedings of the 30th ACM International Conference on Multimedia
Li S, Cao M, Zhang M (2022) Learning semantic-aligned feature representation for text-based person search. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2724–2728
Wang Z, et al (2022) Caibc: Capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the ACM International Conference on Multimedia (MM), pp. 5314–5322
Fu Z, Wang C, Xu J, Zhou H, Li L (2022) Contextual representation learning beyond masked language modeling. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2701–2714
Shao Z, Zhang X, Fang M, Lin Z, Wang J, Ding C (2022) Learning granularity-unified representations for text-to-image person re-identification. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1–9
Shu W et al (2022) See finer, see more: Implicit modality alignment for text-based person retrieval. In: Proceedings of the European Conference on Computer Vision Workshops (ECCVW), pp. 624–641
Abnar S, Zuidema W (2020) Quantifying attention flow in transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4190–4197
Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 11966–11976
Author information
Authors and Affiliations
Contributions
Bach Hoang Ngo and Minh-Hung implemented and evaluated the performance of the proposed system. Khoa Nguyen Tho Anh and Minh-Duc-Bui handled the steps of collecting, preprocessing, and analyzing the data. Quang-Vinh Dinh and Vinh Dinh Nguyen proposed the idea and solution, and assisted in writing the main manuscript text.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that there is no Conflict of interest.
Ethical approval
This article does not contain any studies with human participants performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ngo, B.H., An, MH., Anh, K.N.T. et al. An efficient framework for text-to-image retrieval using complete feature aggregation and cross-knowledge conversion. Pattern Anal Applic 28, 176 (2025). https://doi.org/10.1007/s10044-025-01554-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10044-025-01554-2