EvoQuality: a model that grades AI-generated images so you don't ship the broken ones

If you generate AI images at any scale, you have a problem nobody talks about: a chunk of what you generate is broken. Compression artifacts, blurry edges, garbled content, weird hands. The model just spit out a picture and there’s no easy way to know whether it’s good or trash before you show it to a user.

We released a tool for that. EvoQuality is a small model that looks at one image and rates it 1–5 (1 = visibly broken, 5 = clean and sharp). It’s the first GGUF version of ByteDance’s EvoQuality, Apache-2.0 licensed, runs on a single consumer GPU, and we’re using it in production.

What you can do with it

Filter broken AI-generated images before users see them. Stable Diffusion, Flux, WAN, LTX — whatever you use, occasional outputs are visibly wrong. Pipe each output through EvoQuality, drop anything that scores 1 or 2, retry. The user sees the good takes only.

Curate training data. Scraped a dataset off the internet? Bought a stock photo bundle? Generating synthetic images for a LoRA? A quarter of it is probably garbage. EvoQuality is fast enough to score 100,000 images in an hour on a single GPU. You set the floor, anything underneath gets quarantined.

Score image collections. Got 50,000 product photos and want to know which ones are usable? Same recipe.

Compare outputs from different models. Run the same prompt on three different image generators, score the outputs, see which model’s outputs your audience would actually prefer.

It doesn’t need a reference image. It doesn’t need labels. You feed it a picture and ask “how clean does this look?” — it answers in a number from 1 to 5.

Speed: 200+ tokens per second on a single consumer GPU

We benchmarked it on one of our RTX PRO 6000s using our standard test setup (measure-llamacpp-tps.sh, 128-token outputs). The Q5_K_M variant (the one we shipped to production) is the headline:

Setup	Throughput
Single request at a time	210 tokens/sec
8 requests in parallel	406 tokens/sec aggregate

The model serves an image score in well under a second on average. Even at 8 concurrent requests it stays comfortably above 50 tok/s per request. That’s fast enough to QA every output of a production image-generation service in real time, or to filter through 100k training images in about an hour.

This isn’t a giant model. EvoQuality is 8 billion parameters (small, by 2026 standards) and the Q5_K_M quant is 5 GB on disk. It fits in 8 GB of VRAM with room to spare for context and KV cache.

Quality: matches the original within bench noise

The whole point of using a smaller, quantized model is to save VRAM. But that’s worthless if the quantization breaks the model. We benchmarked all six size variants against the AGIQA-3K test set — a standard image-quality benchmark with human-rated scores for 3,000 AI-generated images. We sampled 99 images evenly across the full quality range and scored them with each variant.

How each size variant matches human ratings

The bars show how closely each variant matches the human ratings. Higher is better. The dashed lines are the upstream paper’s claimed scores on the same kind of benchmark.

Size variant	File size	How well it matches humans	Recommendation
Q8_0	7.5 GB	Identical to the original (lossless)	Use if you have 12 GB+ VRAM
Q5_K_M	5.1 GB	Within noise of the original	What we use in production
Q4_K_M	4.4 GB	Slight measurable hit (1%)	Good for 8 GB GPUs
IQ4_XS	4.0 GB	4% hit, still useful	Smallest, for 6 GB GPUs

All four variants we ship beat what the upstream paper says is achievable. We picked Q5_K_M as our default because it’s 33% smaller than Q8_0 with no meaningful quality loss — better for the way we run it (more on that below).

Charts: how every variant performs

Each panel shows a different size variant. The dots are individual images, the line is the variant’s tendency. The closer the line follows the diagonal, the better the model matches human judgment.

Per-variant scatter grid

The five smaller variants are nearly indistinguishable from the original. Only the smallest (IQ4_XS) shows a visible deviation, and only at the high-quality end.

Code: serve it locally in 30 seconds

# Grab the weights
huggingface-cli download Doradus-AI/EvoQuality-IQA-GGUF \
  evoquality-iqa-Q5_K_M.gguf \
  mmproj-evoquality-iqa-f16.gguf \
  --local-dir models

# Start the server (needs llama.cpp installed)
llama-server \
  --model models/evoquality-iqa-Q5_K_M.gguf \
  --mmproj models/mmproj-evoquality-iqa-f16.gguf \
  --host 0.0.0.0 --port 8259 \
  --n-gpu-layers 999

Score an image:

curl -s http://localhost:8259/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "evoquality-iqa",
    "messages": [{
      "role": "user",
      "content": [
        {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}},
        {"type": "text", "text": "Rate the perceptual quality of this image from 1 to 5. Respond with a single integer."}
      ]
    }],
    "max_tokens": 32, "temperature": 0
  }'

You get back a single digit, 1 through 5.

There’s also a Docker harness if you want a one-line docker compose up.

How we use it in our own stack

We have a lot of pipelines that generate images: WAN 2.2 for video, LTX-2 for shorter clips, TripoSplat for 3D, plus a handful of straight image generators. Every one of those pipelines now passes its output through EvoQuality before it ships. If the score is too low, the pipeline retries; if it’s still low, the user sees a “couldn’t produce a usable image” message instead of a broken one.

We also run it as a pre-filter on training data. When we’re prepping a LoRA for a new art style, we score the source images first, drop the bottom of the distribution, and only feed clean stuff into training.

EvoQuality lives on a small pool of GPUs alongside other models, getting paged in and out on demand. It loads in about 2 seconds when needed and gets swapped out when something else needs the slot. Smaller weights help here — that’s the practical reason we ship Q5_K_M as our default and not the bigger Q8_0.

Under the hood (skip if you’re not into the methodology)

Upstream EvoQuality is an 8B Qwen2.5-VL-7B that’s been GRPO-tuned against pairwise majority-vote pseudo-labels. The paper claims PLCC +25% and SRCC +27% over the Qwen2.5-VL-7B baseline.

Our bench: 99 images stratified across the full MOS range of the AGIQA-3K test split. PLCC computed via scipy.stats.pearsonr, SRCC via scipy.stats.spearmanr with tie-corrected ranks. Per-image scores ship as JSONL at benchmarks/. BF16 reads PLCC=0.8033 / SRCC=0.7177; Q5_K_M reads 0.7999 / 0.7158. Our numbers run slightly above the paper because of the stratified subsample (i.i.d. on the full 598-image test split lands closer to the paper’s 0.770/0.726). The relative ordering of the six quants is the reliable signal.

Conversion is vanilla llama.cpp/convert_hf_to_gguf.py for BF16 and Q8_0; the K-quants come from llama-quantize against the BF16 GGUF. The vision encoder ships separately as an f16 mmproj sidecar; quantizing it had a larger quality impact than quantizing the language tower in internal tests, so we keep it at f16 across the spread. Full pipeline + scripts: DoradusResearch/EvoQuality-IQA-GGUF. llama.cpp commit b9010-d05fe1d7d or newer handles Qwen2.5-VL mmproj correctly; older revisions can mishandle the vision tower’s tensor metadata.

So if you want to try it

The weights are at huggingface.co/Doradus-AI/EvoQuality-IQA-GGUF. The scripts and Docker harness are at github.com/DoradusResearch/EvoQuality-IQA-GGUF. Pick a quant that fits your VRAM (we recommend Q5_K_M for most people), spin up llama-server, point it at the GGUF + mmproj, send images. You’ll get a number from 1 to 5 back.

If you use it on a dataset that’s not AI-generated images, you’ll likely want to run the bench harness against your own held-out set to confirm the quant still preserves the model’s judgment on your distribution. The harness is in the GitHub repo and takes about 20 minutes of GPU time end-to-end. Apache-2.0, no strings.