CrossEncoder based on Qwen/Qwen3.5-0.8B

This is a Cross Encoder model finetuned from Qwen/Qwen3.5-0.8B on the image_to_text and text_to_image datasets using the sentence-transformers library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.

Model Details

Model Description

Model Sources

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import CrossEncoder

# Download from the 🤗 Hub
model = CrossEncoder("tomaarsen/reranker-Qwen3.5-0.8B-doodles-image-text-to-text")
# Get scores for pairs of texts
pairs = [
    ['How many calories in an egg', 'There are on average between 55 and 80 calories in an egg depending on its size.'],
    ['How many calories in an egg', 'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.'],
    ['How many calories in an egg', 'Most of the calories in an egg come from the yellow yolk in the center.'],
]
scores = model.predict(pairs)
print(scores.shape)
# (3,)

# Or rank different texts based on similarity to a single text
ranks = model.rank(
    'How many calories in an egg',
    [
        'There are on average between 55 and 80 calories in an egg depending on its size.',
        'Egg whites are very low in calories, have no fat, no cholesterol, and are loaded with protein.',
        'Most of the calories in an egg come from the yellow yolk in the center.',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

Evaluation

Metrics

Cross Encoder Reranking

  • Datasets: doodles-image-to-text-eval and doodles-text-to-image-eval
  • Evaluated with CrossEncoderRerankingEvaluator with these parameters:
    {
        "at_k": 10
    }
    
Metric doodles-image-to-text-eval doodles-text-to-image-eval
map 0.9883 0.9208
mrr@10 0.9883 0.9208
ndcg@10 0.9913 0.9406

Training Details

Training Datasets

image_to_text

  • Dataset: image_to_text at a575ac6
  • Size: 4,500 training samples
  • Columns: image, text, and label
  • Approximate statistics based on the first 1000 samples:
    image text label
    type PIL.PngImagePlugin.PngImageFile string int
    details
    • min: 29 tokens
    • mean: 33.45 tokens
    • max: 40 tokens
    • 0: ~80.00%
    • 1: ~20.00%
  • Samples:
    image text label
    a cobain glasses character with a gradient 2 head and purple puffballs hair wearing a white sweater, gradient 4 background 1
    a content character with a orange head and purple long hair wearing a striped sweater, yellow background 0
    a neutral note character with a orange head and green puffballs hair wearing a combo 2 puffer, light blue background 0
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": null
    }
    

text_to_image

  • Dataset: text_to_image at a575ac6
  • Size: 4,500 training samples
  • Columns: text, image, and label
  • Approximate statistics based on the first 1000 samples:
    text image label
    type string PIL.PngImagePlugin.PngImageFile int
    details
    • min: 29 tokens
    • mean: 33.51 tokens
    • max: 40 tokens
    • 0: ~80.00%
    • 1: ~20.00%
  • Samples:
    text image label
    a cobain glasses character with a gradient 2 head and purple puffballs hair wearing a white sweater, gradient 4 background 1
    a cobain glasses character with a gradient 2 head and purple puffballs hair wearing a white sweater, gradient 4 background 0
    a cobain glasses character with a gradient 2 head and purple puffballs hair wearing a white sweater, gradient 4 background 0
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": null
    }
    

Evaluation Datasets

image_to_text

  • Dataset: image_to_text at a575ac6
  • Size: 500 evaluation samples
  • Columns: image, text, and label
  • Approximate statistics based on the first 500 samples:
    image text label
    type PIL.PngImagePlugin.PngImageFile string int
    details
    • min: 27 tokens
    • mean: 33.05 tokens
    • max: 38 tokens
    • 0: ~80.00%
    • 1: ~20.00%
  • Samples:
    image text label
    a content character with a tan head and purple puffballs hair wearing a blue fleece, green background 1
    a grumpy character with a green head and pink hair wearing a light blue puffer, iridescent background 0
    a surprised character with a pale head and green mullet hair wearing a blue backpack, purple background 0
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": null
    }
    

text_to_image

  • Dataset: text_to_image at a575ac6
  • Size: 500 evaluation samples
  • Columns: text, image, and label
  • Approximate statistics based on the first 500 samples:
    text image label
    type string PIL.PngImagePlugin.PngImageFile int
    details
    • min: 27 tokens
    • mean: 33.21 tokens
    • max: 38 tokens
    • 0: ~80.00%
    • 1: ~20.00%
  • Samples:
    text image label
    a content character with a tan head and purple puffballs hair wearing a blue fleece, green background 1
    a content character with a tan head and purple puffballs hair wearing a blue fleece, green background 0
    a content character with a tan head and purple puffballs hair wearing a blue fleece, green background 0
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": null
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 4
  • num_train_epochs: 1
  • learning_rate: 5e-06
  • warmup_steps: 0.1
  • bf16: True
  • eval_strategy: steps
  • per_device_eval_batch_size: 4
  • prompts: Judge whether the image and text match. Respond with 1 if they match, 0 if they don't.

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 4
  • num_train_epochs: 1
  • max_steps: -1
  • learning_rate: 5e-06
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 1
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: steps
  • per_device_eval_batch_size: 4
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: Judge whether the image and text match. Respond with 1 if they match, 0 if they don't.
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss image to text loss text to image loss doodles-image-to-text-eval_ndcg@10 doodles-text-to-image-eval_ndcg@10
-1 -1 - - - 0.8080 0.4314
0.1 225 0.5822 - - - -
0.2 450 0.3373 - - - -
0.2502 563 - 0.1263 0.3633 0.9782 0.8903
0.3 675 0.1652 - - - -
0.4 900 0.1480 - - - -
0.5 1125 0.1187 - - - -
0.5004 1126 - 0.1300 0.2718 0.9906 0.9136
0.6 1350 0.1268 - - - -
0.7 1575 0.0706 - - - -
0.7507 1689 - 0.0775 0.1671 0.9913 0.9374
0.8 1800 0.1097 - - - -
0.9 2025 0.1242 - - - -
1.0 2250 0.0737 - - - -
-1 -1 - - - 0.9913 0.9406

Framework Versions

  • Python: 3.11.6
  • Sentence Transformers: 5.3.0.dev0
  • Transformers: 5.3.0.dev0
  • PyTorch: 2.10.0+cu128
  • Accelerate: 1.13.0.dev0
  • Datasets: 4.3.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
33
Safetensors
Model size
0.9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tomaarsen/reranker-Qwen3.5-0.8B-doodles-image-text-to-text-original

Finetuned
(81)
this model

Dataset used to train tomaarsen/reranker-Qwen3.5-0.8B-doodles-image-text-to-text-original

Paper for tomaarsen/reranker-Qwen3.5-0.8B-doodles-image-text-to-text-original

Evaluation results