Interpreting Art by Leveraging Pre-Trained Models

DOI

説明

In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models\footnote{You can try our best-performing model under \url{https://ai-interprets.art/}.}. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.

収録刊行物

  • IEICE Proceeding Series

    IEICE Proceeding Series 78 O3-1-1-, 2023-07-23

    The Institute of Electronics, Information and Communication Engineers

詳細情報 詳細情報について

  • CRID
    1390579842139620864
  • DOI
    10.34385/proc.78.o3-1-1
  • ISSN
    21885079
  • 本文言語コード
    en
  • データソース種別
    • JaLC
  • 抄録ライセンスフラグ
    使用不可

問題の指摘

ページトップへ