colgemma-300m-gguf

GGUF conversion of bowang0911/colgemma-300m for use with llama.cpp.

Files

  • colgemma-300m-q4.gguf β€” Gemma3 transformer in GGUF Q4_K_M format (225 MB, recommended)
  • colgemma-300m-f16.gguf β€” Gemma3 transformer in GGUF f16 format (584 MB)
  • dense_head.npz β€” ColBERT projection layers as numpy arrays (18 MB): 768β†’3072β†’768β†’128

Quality

Variant Size Cosine sim vs PyTorch f32 Cosine sim vs f16
f16 584 MB 0.992 β€”
q4 225 MB 0.991 0.9997

Usage

Get per-token embeddings from llama.cpp, then apply the dense head:

# Per-token embeddings (768-dim)
llama-embedding -m colgemma-300m-q4.gguf --pooling none -p "your text" --embd-output-format json
import numpy as np

# Load dense head weights
weights = np.load("dense_head.npz")

# Apply projection: 768 -> 3072 -> 768 -> 128
x = token_embeddings  # (seq_len, 768) from llama.cpp
x = x @ weights["d1_w"].T
x = x @ weights["d2_w"].T
x = x @ weights["d3_w"].T

# L2 normalize
x = x / np.maximum(np.linalg.norm(x, axis=1, keepdims=True), 1e-12)
Downloads last month
-
GGUF
Model size
0.3B params
Architecture
gemma-embedding
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bowang0911/colgemma-300m-gguf

Quantized
(1)
this model