りんだろぐ rindalog: Multimodal Approach：Using CLIP

Yuxi (Hayden) Liu 著の Python Machine Learning By Example - Fourth Edition: Unlock machine learning best practices with real-world use cases のことは後日取り上げたいが、ここでは Chapter 14. Building an Image Search Engine Using CLIP: a Multimodal Approach を紹介。ここでは主に結果だけを紹介、コードの詳細は GitHub を参照。

次は、本書 Figure 14.1: CLIP architecture：

Learning Transferable Visual Models From Natural Language Supervision

この図で大まかなモデルのイメージは掴める。

CLIP モデル構築で使用した pre-trained モデル：

　Vision Encoder: ResNet50

　Text Encoder: DistilBERT

学習データは Flickr8k dataset、captions.txt は「画像ファイル名」と「caption」：

less -N captions.txt

先ずは Flickr8k dataset の「カスタム PyTorch データセットクラス」を作成。DistilBRET tokenizer を使用：

結果：

CLIP モデルの構築（学習）には、無謀にも M1 MacBook Air を使用（放置してたらいつの間にか完了、10時間以上は要したと思う）：

構築したモデルを使って "kids jumping into a pool" で画像を検索、上位２件を表示：

結果：

一件目（つまり「最も一致」）は良いように思うが、二件目は "kids" はいるが、"pool" に "jump" という画像ではない。この後 pre-trained モデルで同じ検索して、この結果が良くないことは記した。とはいえ、このモデルも fine-tune 可能：

The CLIP model we implemented employs the pre-trained ResNet50 model as the vision encoder and the pre-trained DistilBERT model as the text encoder. Remember that we kept the parameters frozen for both ResNet50 and DistilBERT, utilizing them as image and text feature extractors. If desired, you can fine-tune these models by allowing their parameters to be trainable.

Pre-Trained CLIP Model

Here, we use the Vision Transformer (ViT)-based CLIP model. The “B-32" designation refers to the size of the ViT model, which means it has 32 times more parameters than the base ViT model.

先のモデルと同じ "kids jumping into a pool" で画像を検索、上位３件を表示：

結果：

明らかに本モデルの結果が良好。

The pre-trained CLIP model excels in both text-to-image and image-to-image search tasks. Notably, the model may not have been specifically trained on the Flickr8k dataset, but it performs well in zero-shot learning, as you saw in this section.

特筆すべきは「このモデルは Flickr8k dataset で特に学習してないであろう」こと。

この後の "Zero-shot classification" は割愛するが、次の "knowledge distillation" は興味深い：

Zero-shot classification using CLIP is powerful, but its performance is limited by training data used in a pre-trained model. Intuitively, this can be improved by leveraging pre-trained models on larger datasets. Another approach is knowledge distillation, which transfers knowledge from a complex and high-performance model to a smaller and faster model. You can read more about knowledge distillation in Distilling the Knowledge in a Neural Network (2015) by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean (https://arxiv.org/abs/2006.05525).

りんだろぐ rindalog

2025年2月4日火曜日

Multimodal Approach：Using CLIP

0 件のコメント:

コメントを投稿