Skip to main content
Log in

An efficient framework for text-to-image retrieval using complete feature aggregation and cross-knowledge conversion

  • Original Article
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Text-driven person retrieval aims to find a particular person within an array of images using a natural language description of that individual. The main challenge in this task lies in calculating a similarity score that measures the likeness between the person’s image and the given description. This procedure involves uncovering the intricate, underlying connections between various levels of image sub-regions and textual phrases. Previous efforts have sought to tackle this issue by utilizing individual pre-trained models for visual and textual data to extract features. However, these methods lack essential alignment capabilities regarding efficient multimodal data matching. Moreover, these approaches rely on prior information to investigate explicit part alignments, potentially causing the disruption of information within the same modality. To increase the performance of existing state-of-the-art text-to-image retrieval systems, we proposed a novel technique aimed at enhancing the effectiveness of the text-image pedestrian matching pipeline, referred to as Complete Feature Aggregation and Cross-Knowledge Conversion (CFA-CKC), which integrates diverse supervisory signals from various textual and visual perspectives. Our proposed algorithm achieves new state-of-the-art results on all three public datasets, including CUHK-PEDES, ICFG-PEDES, and RSTPReid. Experimental results show that our CFA-CKC outperforms the state-of-the-art methods by up to 6.0% in rank-1 accuracy. The code and pre-trained models are available at this link https://github.com/AIVIETNAMResearch/Pedestrian-Matching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  1. Lin X, Wang J, Huang R, Wang C, Zhang H (2024) Learning discriminative local contexts for person re-identification in vehicle surveillance scenarios. Pattern Anal Appl 27:1–12

    Article  Google Scholar 

  2. Chouchane A, Bessaoudi M, Boutellaa E, Ouamane A (2023) A new multidimensional discriminant representation for robust person re-identification. Pattern Anal Appl 26:1191–1204

    Article  Google Scholar 

  3. Zhou S, Zhang M (2022) Occluded person re-identification based on embedded graph matching network for contrastive feature relation. Pattern Anal Appl 26:487–503

    Article  Google Scholar 

  4. Huang F, Zhang L, Zhou Y, Gao X (2023) Adversarial and isotropic gradient augmentation for image retrieval with text feedback. IEEE Trans Multimed 25:7415–27

    Article  Google Scholar 

  5. Ngo BH, Nguyen DT, Do-Tran N-T, Thien PPH, An M-H, Nguyen T-N, Hoang LN, Nguyen VD, Dinh V (2023) Comprehensive visual features and pseudo labeling for robust natural language-based vehicle retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 5409–5418

  6. Li Z, Guo C, Feng Z, Hwang J-N, Du Z (2023) Integrating language guidance into image-text matching for correcting false negatives. IEEE Trans Multimed 26:103–16

    Article  Google Scholar 

  7. Zhao L, Huang P, Chen T, Fu C, Hu Q, Zhang Y (2023) Multi-sentence complementarily generation for text-to-image synthesis. IEEE Trans Multimed 26:103–16

    Google Scholar 

  8. Zhang K, Mao Z, Liu A-A, Zhang Y (2023) Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans Multimed 25:1320–1332

    Article  Google Scholar 

  9. Zhang L, Luo M, Liu J, Chang X, Yang Y, Hauptmann AG (2020) Deep top-\(k\) ranking for image–sentence matching. IEEE Trans Multimed 22(3):775–785

    Article  Google Scholar 

  10. Cheng Q, Wen K, Gu X (2023) Vision-language matching for text-to-image synthesis via generative adversarial networks. IEEE Trans Multimed 25:7062–7075

    Article  Google Scholar 

  11. Zhang Y, Lu H (2018) Deep cross-modal projection learning for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 686–701

  12. Zheng Z, Zheng L, Garrett M, Yang Y, Xu M, Shen Y-D (2020) Dualpath convolutional image-text embeddings with instance loss. ACM Trans Multimed Comput, Commun, and Appl 16(2):1–23

    Google Scholar 

  13. Wang Z, Fang Z, Wang J, Yang Y (2020) Vitaa: Visual-textual attributes alignment in person search by natural language. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 402–420

  14. Jiang D, Ye M (2023) Cross-modal implicit relation reasoning and aligning for text-to-image person retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2787–2797

  15. Dosovitskiy A et al (2021) An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations (ICLR)

  16. Lin Z, Ding G, Hu M, Wang J (2015) Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 3864–3872

  17. Liu J, Yang M, Li C, Xu R (2021) Improving cross-modal image-text retrieval with teacher-student learning. IEEE Trans Circuits Syst Video Technol 31(8):3242–3253

    Article  Google Scholar 

  18. Xu J, Liu Z, Pei X, Wang S, Gao S (2023) Fb-net: dual-branch foreground-background fusion network with multi-scale semantic scanning for image-text retrieval. IEEE Access 11:36516–36537

    Article  Google Scholar 

  19. Faghri F, Fleet DJ, Kiros JR, Fidler S (2018) Vse++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference (BMVC)

  20. Jia C, et al (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 4904–4916

  21. Bai Y, Cao M, Gao D, Cao Z, Chen C, Fan Z, Nie L, Zhang M (2023) Rasa: Relation and sensitivity aware representation learning for text-based person search. In: International Joint Conference on Artificial Intelligence Organization

  22. Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for image-text matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 212–228

  23. Chen H, Ding G, Lin Z, Zhao S, Han J (2019) Cross-modal image text retrieval with semantic consistency. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1749–1757

  24. Ding Z, Ding C, Shao Z, Tao D (2021) Semantically self-aligned network for text-to-image part-aware person re-identification. arXiv:2107.12666

  25. Chen Y, Zhang G, Lu Y, Wang Z, Zheng Y (2022) Tipcb: a simple but effective part-based convolutional baseline for text-based person search. Neurocomputing 494:171–181

    Article  Google Scholar 

  26. Qin Y, Chen Y, Peng D, Peng X, Zhou J, Hu P (2024) Noisy-correspondence learning for text-to-image person re-identification. In: IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)

  27. Yang F, Li W, Yang M, Liang B, Zhang J (2024) Multi-modal disordered representation learning network for description-based person search. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 16316–16324

  28. Gu J, Cai J, Joty S, Niu L, Wang G (2018) Look, imagine and match: Improving textual-visual cross-modal retrieval with generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7181–7189

  29. Radford A, et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 8748–8763

  30. Li J, Selvaraju RR, Gotmare A, Joty S, Xiong C, Hoi SCH (2021) Align before fuse: Vision and language representation learning with momentum distillation. In: Proceedings of the International Conference on Neural Information Processing Systems, pp. 9694–9705

  31. Zeng Y, Zhang X, Li H, Wang J, Zhang J, Zhou W (2024) \(X^2\) -vlm: all-in-one pre-trained model for vision-language tasks. IEEE Trans Pattern Anal Mach Intell 46(5):3156–3168

    Article  Google Scholar 

  32. Yan S, Dong N, Zhang L, Tang J (2022) Clip-driven fine-grained text-image person re-identification. arXiv:2210.10276

  33. Yang S, Zhou Y, Zheng Z, Wang Y, Zhu L, Wu Y (2023) Towards unified text-based person retrieval: A large-scale multi-attribute and language search benchmark. In: Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, pp. 4492–4501

  34. Peng Y, Qi J, Huang X, Yuan Y (2018) Ccl: cross-modal correlation learning with multigrained fusion by hierarchical network. IEEE Trans Multimed 20(2):405–420

    Article  Google Scholar 

  35. Kim W, Son B, Kim I (2021) Vilt: Vision-and-language transformer without convolution or region supervision. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 5583–5594

  36. Zhang Q, Lei Z, Zhang Z, Li SZ (2020) Context-aware attention network for image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 3533–3542

  37. Li Y et al (2021) Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In: Proceedings of the International Conference on Learning Representations (ICLR), pp. 1–17

  38. He K et al (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9729–9738

  39. Chen X, Fan H, Girshick R, He K (2020) Improved baselines with momentum contrastive learning. arXiv:2003.04297

  40. Fan H et al (2021) Multiscale vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6824–6835

  41. Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, pp. 1597–1607

  42. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), pp. 9912–9924

  43. Grill J-B et al (2020) Bootstrap your own latent: A new approach to self-supervised learning. In: Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS)

  44. Peng Z, Dong L, Bao H, Ye Q, Wei F (2022) Beit v2: Masked image modeling with vector-quantized visual tokenizers. arXiv

  45. Bao H, Dong L, Piao S, Wei F (2022) Beit: Bert pre-training of image transformers. In: Proceedings of the International Conference on Learning Representations (ICLR)

  46. Li S, Xiao T, Li H, Zhou B, Yue D, Wang X (2017) Person search with natural language description. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 5187–5196

  47. Zhu A et al (2021) Dssl: Deep surroundings-person separation learning for text-based person retrieval. In: Proceedings of the ACM International Conference on Multimedia (MM), pp. 209–217

  48. Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: Proceedings of the International Conference on Learning Representations (ICLR)

  49. Sarafianos N, Xu X, Kakadiaris IA (2019) Adversarial representation learning for text-to-image matching. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 5814–5824

  50. Gao C et al (2021) Contextual non-local alignment over full-scale representation for text-based person search. arXiv:2101.03036

  51. Wu Y, Yan Z, Han X, Li G, Zou C, Cui S (2021) Lapscore: Language-guided person search via color reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1604–1613

  52. Yan S, Tang H, Zhang L, Tang J (2022) Image-specific information suppression and implicit local alignment for text-based person search. arXiv:2208.14365

  53. Wang Z et al., (2022) Look before you leap: Improving text-based person retrieval by learning a consistent cross-modal common manifold. In: Proceedings of the 30th ACM International Conference on Multimedia

  54. Li S, Cao M, Zhang M (2022) Learning semantic-aligned feature representation for text-based person search. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2724–2728

  55. Wang Z, et al (2022) Caibc: Capturing all-round information beyond color for text-based person retrieval. In: Proceedings of the ACM International Conference on Multimedia (MM), pp. 5314–5322

  56. Fu Z, Wang C, Xu J, Zhou H, Li L (2022) Contextual representation learning beyond masked language modeling. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2701–2714

  57. Shao Z, Zhang X, Fang M, Lin Z, Wang J, Ding C (2022) Learning granularity-unified representations for text-to-image person re-identification. In: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1–9

  58. Shu W et al (2022) See finer, see more: Implicit modality alignment for text-based person retrieval. In: Proceedings of the European Conference on Computer Vision Workshops (ECCVW), pp. 624–641

  59. Abnar S, Zuidema W (2020) Quantifying attention flow in transformers. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4190–4197

  60. Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 11966–11976

Download references

Author information

Authors and Affiliations

Authors

Contributions

Bach Hoang Ngo and Minh-Hung implemented and evaluated the performance of the proposed system. Khoa Nguyen Tho Anh and Minh-Duc-Bui handled the steps of collecting, preprocessing, and analyzing the data. Quang-Vinh Dinh and Vinh Dinh Nguyen proposed the idea and solution, and assisted in writing the main manuscript text.

Corresponding author

Correspondence to Vinh Dinh Nguyen.

Ethics declarations

Conflict of interest

The authors declare that there is no Conflict of interest.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ngo, B.H., An, MH., Anh, K.N.T. et al. An efficient framework for text-to-image retrieval using complete feature aggregation and cross-knowledge conversion. Pattern Anal Applic 28, 176 (2025). https://doi.org/10.1007/s10044-025-01554-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10044-025-01554-2

Keywords