Cross-modal Graph Matching Network for Image-text Retrieval

説明

<jats:p> Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into independent representation matching methods and cross-interaction matching methods. The independent representation matching methods generate the embeddings of images and sentences independently and thus are convenient for retrieval with hand-crafted matching measures (e.g., cosine or Euclidean distance). As to the cross-interaction matching methods, they achieve improvement by introducing the interaction-based networks for inter-relation reasoning, yet suffer the low retrieval efficiency. This article aims to develop a method that takes the advantages of cross-modal inter-relation reasoning of cross-interaction methods while being as efficient as the independent methods. To this end, we propose a graph-based <jats:bold>Cross-modal Graph Matching Network (CGMN)</jats:bold> , which explores both intra- and inter-relations without introducing network interaction. In CGMN, graphs are used for both visual and textual representation to achieve intra-relation reasoning across regions and words, respectively. Furthermore, we propose a novel graph node matching loss to learn fine-grained cross-modal correspondence and to achieve inter-relation reasoning. Experiments on benchmark datasets MS-COCO, Flickr8K, and Flickr30K show that CGMN outperforms state-of-the-art methods in image retrieval. Moreover, CGMM is much more efficient than state-of-the-art methods using interactive matching. The code is available at <jats:ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="url" xlink:href="https://github.com/cyh-sj/CGMN">https://github.com/cyh-sj/CGMN</jats:ext-link> . </jats:p>

収録刊行物

被引用文献 (2)*注記

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ