Interpreting Art by Leveraging Pre-Trained Models

Penzel Niklas, Denzler Joachim

doi:10.34385/proc.78.o3-1-1

In many domains, so-called foundation models were recently proposed. These models are trained on immense amounts of data resulting in impressive performances on various downstream tasks and benchmarks. Later works focus on leveraging this pre-trained knowledge by combining these models. To reduce data and compute requirements, we utilize and combine foundation models in two ways. First, we use language and vision models to extract and generate a challenging language vision task in the form of artwork interpretation pairs. Second, we combine and fine-tune CLIP as well as GPT-2 to reduce compute requirements for training interpretation models. We perform a qualitative and quantitative analysis of our data and conclude that generating artwork leads to improvements in visual-text alignment and, therefore, to more proficient interpretation models\footnote{You can try our best-performing model under \url{https://ai-interprets.art/}.}. Our approach addresses how to leverage and combine pre-trained models to tackle tasks where existing data is scarce or difficult to obtain.

Interpreting Art by Leveraging Pre-Trained Models

説明

収録刊行物

詳細情報詳細情報について

書き出し

問題の指摘

Interpreting Art by Leveraging Pre-Trained Models

説明

収録刊行物

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について