colgemma-300m-gguf

GGUF conversion of bowang0911/colgemma-300m for use with llama.cpp.

Files

colgemma-300m-q4.gguf — Gemma3 transformer in GGUF Q4_K_M format (225 MB, recommended)
colgemma-300m-f16.gguf — Gemma3 transformer in GGUF f16 format (584 MB)
dense_head.npz — ColBERT projection layers as numpy arrays (18 MB): 768→3072→768→128

Quality

Variant	Size	Cosine sim vs PyTorch f32	Cosine sim vs f16
f16	584 MB	0.992	—
q4	225 MB	0.991	0.9997

Usage

Get per-token embeddings from llama.cpp, then apply the dense head:

# Per-token embeddings (768-dim)
llama-embedding -m colgemma-300m-q4.gguf --pooling none -p "your text" --embd-output-format json

import numpy as np

# Load dense head weights
weights = np.load("dense_head.npz")

# Apply projection: 768 -> 3072 -> 768 -> 128
x = token_embeddings  # (seq_len, 768) from llama.cpp
x = x @ weights["d1_w"].T
x = x @ weights["d2_w"].T
x = x @ weights["d3_w"].T

# L2 normalize
x = x / np.maximum(np.linalg.norm(x, axis=1, keepdims=True), 1e-12)

Downloads last month: -

GGUF

Model size

0.3B params

Architecture

gemma-embedding

Hardware compatibility

16-bit

View +1 variant

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bowang0911/colgemma-300m-gguf

Base model

google/embeddinggemma-300m

Finetuned

bowang0911/colgemma-300m

Quantized

(1)

this model