Mitsua Japanese CLIP ViT-B-16

CLIP

明示的な許諾を得たオプトインデータ、オープンライセンスデータ、パブリックドメインデータのみでトレーニングされた日本語/英語バイリンガルCLIP (Contrastive Language-Image Pre-training)モデルです。 学習データにAI生成物は含まれません。

私たちの目標は、事前学習済みモデルの知識を一切使用せずに、CLIPモデルを完全にゼロからトレーニングすることでした。 したがって、PD12Mなどの「倫理的な」データセットに一般的に採用されている合成キャプションや美的スコアリングは使用しませんでした。 また、LAIONデータセット等の作成に採用されているOpenAI CLIPスコアフィルタリングも行いませんでした。 これらのモデルを使用した前処理は、著作物の知識のリークを引き起こすためです。

This is a Japanese/English bilingual CLIP (Contrastive Language-Image Pre-training) model trained exclusively on opt-in licensed data, openly licensed data and public domain data. We believe training data does not contain AI generated data.

Our goal was to train a CLIP model completely from scratch, without using any pretrained models' knowledge. Thus, we did not use any synthetic captions (AI generated captions) nor any aethestic scoring which is commonly adopted for "ethically sourced" open dataset such as PD12M. Also, we did not do any OpenAI CLIP score filtering which is adopted for creating LAION dataset or similar. This is because these preprocessing will result in knowledge leakage of copyrighted works.

Model Details

  • Developed by: ELAN MITSUA Project / Abstract Engine
  • Model type: Contrastive Language-Image Pre-trained Model
  • Language(s): Japanese and English
  • License: CC BY-SA 4.0
  • This means you can use, adapt and redistribute this as long as you give appropriate credit, indicate if changes were made, and distribute any adapted work under the same license.

Usage

  1. Install the python packages

pip install transformers sentencepiece

  • This model is verified on transformers==4.40.2
  1. Run
from PIL import Image
from transformers import AutoProcessor, AutoModel
import io
import requests
import torch

device = "cuda"
model = AutoModel.from_pretrained("Mitsua/mitsua-japanese-clip-vit-b-16", trust_remote_code=True).to(device)
processor = AutoProcessor.from_pretrained("Mitsua/mitsua-japanese-clip-vit-b-16", trust_remote_code=True)

# get CC0 licensed image from Wikimedia Commons
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Boxer_%28dog%29_%2C_Iran_08.jpg/800px-Boxer_%28dog%29_%2C_Iran_08.jpg"
image = Image.open(io.BytesIO(requests.get(image_url).content))

# we can input either Japanese or English
texts = ["犬", "猫", "人間"]
# texts = ["dog", "cat", "human"]

inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt")
inputs = {k:v.to(device) for k,v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=-1)
for t, p in zip(texts, probs[0]):
    print(f"'{t}' : {p:.1%}")

Output should look like

'犬' : 95.5%
'猫' : 0.2%
'人間' : 4.3%
'dog' : 99.4%
'cat' : 0.1%
'human' : 0.5%

Training Data

Our dataset is a mix of opt-in licensed data, openly licensed data and public domain data. Pre-filtering based on metadata and captions are applied to exclude potential rights-infringing, harmful or NSFW data. For pre-filtering data, we built 146,041 words database which contains artist names, celebrity names, fictional character names, trademarks and bad words, based on Wikidata licensed under CC0. We pre-process with face-blurring.

  • Color Multi Fractal DB 1k (CC BY 4.0)
    • Created by ELAN MITSUA Project / Abstract Engine
    • This dataset is used for image encoder (ViT-B) pretraining.
  • VRM Color Concept 550K (CC BY-NC 4.0)
    • Created by ELAN MITSUA Project / Abstract Engine
    • Even if this dataset is licensed under NC, we own this dataset and assets used in this dataset is all commercially permissive license (CC0 or explicit permission), so we can use this dataset for commercial use.
  • "Mitsua Likes" Dataset : Our licensed data from opt-in contributors
    • Contributors Credit (Attribution)
    • All training data can be browsed on our Discord server "Mitsua Contributors"
    • All contributors were screened upon entry and all submitted images were human verified.
    • AI generated contents detector is used to exclude potential AI generated images.
    • "3R" and "3RG" licensed images and its captions are used to train this model.
    • Poly Haven HDRI images licensed under CC0 are used to augment background composition.
  • Localized Narratives (CC BY 4.0)
    • Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, and Vittorio Ferrari, "Connecting Vision and Language with Localized Narratives" ECCV (Spotlight), 2020
    • A subset of images licensed under CC BY 2.0 are used for training.
    • Finally 642,789 images are used for training. All attributons are found here.
  • STAIR Captions (CC BY 4.0)
    • Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi, “STAIR Captions: Constructing a Large-Scale Japanese Image Caption Dataset”, Annual Meeting of the Association for Computational Linguistics (ACL), Short Paper, 2017.
    • A subset of images licensed under CC BY 2.0, CC BY-SA 2.0 are used for training.
    • Finally 26,164 images are used for training. All attributons are found here.
  • Wikimedia Commons Balanced Image-Text Dataset (CC BY-SA 4.0, We curated this dataset and will release soon.)
    • This is a largest portion of this CLIP model training data. All images and texts come from Wikimedia Commons, Wikidata and Japanese / English Wikipedia.
    • Images license is either Public Domain, CC0, CC BY or CC BY-SA (varies by image).
    • Text license is either CC0 (from Wikidata and Wikimedia Commons structured data) or CC BY-SA 4.0 (from Wikipedia and Wikimedia Commons non-structured data).
    • Curated by ELAN MITSUA Project / Abstract Engine.
    • All image attributions are found here.
    • How we curate this dataset
      • Problem statement :
      • Our goal to build this dataset is to achieve both quality and copyright/privacy safety.
        1. Creating rights-cleared and safe-to-use dataset from an uncurated and noisy data source.
        2. Creating diversified and balanced dataset from an uncurated and noisy data source.
      • Dataset curation :
        1. We used category tags to limit the data to safe use, and then conducted word based filtering.
        • For public domain data, we used following categories only: CC-PD-Mark, PD-self, PD-user, PD-author, PD-link, PD-old-70, PD-old-80, PD-old-90, PD-old-100
        • Images with these tags are removed even if they are tagged as public domain: Images with watermarks, PD-algorithm, ~AI-generated works, With trademark, Unidentified logos, License review needed, Deletion requests, Flickr images~, Personality rights warining, Cosplay, Media from YouTube (XXXX=Year)
        • This means we solely use public domain data whose copyright is expired globally (US, EU and Japan) or waived directly by authors, without using AI generated contents.
        • To address copyright laundering concerns, we also do not use any data sourced from Flickr. See: Flickr Washing
        • After category tag based filtering, we conducted word based filtering described above for mitigating possible rights infringing or harmful data.
        1. We also improved the quality of our dataset by doing the following without using a pretrained model
        • Image deduplication is conducted by using simple imagehash algorithm.
        • To build diversified dataset with limited datasources, we use WordNet, and word count based balancing method introduced in the original CLIP paper and the research paper by Hu Xu et al, "Demystifying CLIP Data"
          • Princeton University "About WordNet." WordNet. Princeton University. 2010.
        • To improve caption accuracy, we performed a Commons API query on the words in WordNet and sorted them by relevance to add additional captions by query words.
        • Also we conducted machine translation of captions between Japanese and English using our ElanMT model which is trained exclusively on openly licensed corpus.
  • Art Museums PD Dataset (CC0, We curated this dataset and will release soon.)
  • Even if the dataset itself is CC-licensed, we did not use it if the image contained in the dataset is not properly licensed, is based on unauthorized use of copyrighted works, or is based on the synthetic data output of other pretrained models.
  • English captions are translated into Japanese using ElanMT model which is trained solely on openly licensed corpus.
  • For additional tagging, Mitsua Japanese Tagger model which is trained solely on opt-in / openly licensed data is used.

Training Procedure

As mentioned above, this model does not use any pretrained model and is trained completely from scratch.

  1. Pretrain Image Encoder (Vision Transformer)
  • ViT-B-16 Vision Transformer model was pre-trained on Color Multi Fractal DB 1k (1 million images, 1k classes) at resolution 224x224 for 300 epochs.
  • This model is trained exclusively on 1 million fractal images which relies solely on mathematical formulas, so no real images or pretrained models are used for this training.
  1. Train sentencepiece text tokenizer
  • Sentencepiece tokenizer was trained on licensed corpus with 64k vocabularies
  • The training corpus was extracted from the image-text training dataset listed above.
  1. Train CLIP model
  • Then, CLIP model is trained on licensed + openly-licensed + public domain dataset. The Contrastive Loss is used.
  • Image Encoder : ViT-B-16 initialized with fractal pretrained weight in 1
  • Text Encoder : 12 layer masked text transformer with 64k sentencepiece tokenizer
  • Training dataset consists of approx. 30M images, which is relatively small for CLIP training
  • Training took approx. 400 H100 GPU hours for 64 epochs.

Implementation Notes

  • For HF-compatible CLIP modeling, SiglipTextModel is used for the text encoder just because it provides better compatibility for our sentencepiece tokenizer.
  • This CLIP model is trained with standard Contrastive Loss, not Siglip loss, since we do not see any improvement for Siglip loss over CLIP loss in our internal ablation study.

Evaluation

We evaluated Japanese zeroshot accuracy.

Dataset

Result

Model Training Data Supported Language jafood101 jaflower30 jafacility20 jalandmark10
Mitsua/mitsua-japanese-clip-vit-b-16 Licensed+PD Japanese and English 0.297 0.707 0.676 0.769
rinna/japanese-clip-vit-b-16 CC12M Japanese 0.235 0.513 0.614 0.625
recruit-jp/japanese-clip-vit-b-32-roberta-base Ja subset of LAION2B-multi Japanese 0.502 0.556 0.647 0.803
google/siglip-base-patch16-256-multilingual WebLI Multilingual 0.776 0.928 0.692 0.762

Disclaimer

  • The recognition result may be very incorrect, harmful or biased. The model was developed to investigate achievable performance with only a relatively small, licensed data, and is not suitable for use cases requiring high recognition accuracy. Under Section 5 of the CC BY-SA 4.0 License, ELAN MITSUA Project / Abstract Engine is not responsible for any direct or indirect loss caused by the use of the model.
  • 免責事項:識別結果は不正確で、有害であったりバイアスがかかっている可能性があります。本モデルは比較的小規模でライセンスされたデータのみで達成可能な性能を調査するために開発されたモデルであり、識別の正確性が必要なユースケースでの使用には適していません。絵藍ミツアプロジェクト及び株式会社アブストラクトエンジンはCC BY-SA 4.0ライセンス第5条に基づき、本モデルの使用によって生じた直接的または間接的な損失に対して、一切の責任を負いません。
Downloads last month
120
Safetensors
Model size
221M params
Tensor type
F32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Datasets used to train Mitsua/mitsua-japanese-clip-vit-b-16

Collection including Mitsua/mitsua-japanese-clip-vit-b-16