colgemma-300m-gguf
GGUF conversion of bowang0911/colgemma-300m for use with llama.cpp.
Files
colgemma-300m-q4.ggufβ Gemma3 transformer in GGUF Q4_K_M format (225 MB, recommended)colgemma-300m-f16.ggufβ Gemma3 transformer in GGUF f16 format (584 MB)dense_head.npzβ ColBERT projection layers as numpy arrays (18 MB): 768β3072β768β128
Quality
| Variant | Size | Cosine sim vs PyTorch f32 | Cosine sim vs f16 |
|---|---|---|---|
| f16 | 584 MB | 0.992 | β |
| q4 | 225 MB | 0.991 | 0.9997 |
Usage
Get per-token embeddings from llama.cpp, then apply the dense head:
# Per-token embeddings (768-dim)
llama-embedding -m colgemma-300m-q4.gguf --pooling none -p "your text" --embd-output-format json
import numpy as np
# Load dense head weights
weights = np.load("dense_head.npz")
# Apply projection: 768 -> 3072 -> 768 -> 128
x = token_embeddings # (seq_len, 768) from llama.cpp
x = x @ weights["d1_w"].T
x = x @ weights["d2_w"].T
x = x @ weights["d3_w"].T
# L2 normalize
x = x / np.maximum(np.linalg.norm(x, axis=1, keepdims=True), 1e-12)
- Downloads last month
- -
Hardware compatibility
Log In to add your hardware
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support