Back to Wiki
Wiki Article

Model Comparison Matrix

By u/trashacct383, u/Almarma, u/An-R-Nguyen

Model Comparison Matrix

Source: r/hermesagent community testing and discussion (May 2026) Based on: 121 comments from "What model are you running?" thread + multiple setup discussions


Quick Reference: Best Models by Use Case

Use Case Recommended Model Provider Cost Tier
Daily driver (general tasks) Qwen 3.6-27B Local/vLLM or OpenRouter Free-Paid
Budget option MiniMax M2.7 AIStudio ($10/mo plan) $
Best value cloud model DeepSeek V4 Pro DeepSeek API directly $$
Complex reasoning tasks Qwen 3.6-35B or GPT-5.5 OpenRouter/Cloud $$$
Coding assistant Qwen 3.6-27B (local) + Claude/GPT for review Mixed $$-$$$
Vision/image analysis DeepSeek V4 Flash or Gemini 3.1 Flash Preview Various $$-$$
Auxiliary tasks (search, extraction) DeepSeek V4 Flash or OSS 120B AIStudio/OpenRouter $

Detailed Model Reviews

Qwen 3.6 Series

Qwen 3.6-27B - Community favorite, "custom-made for Hermes"

  • Strengths: Excellent tool calling, agentic workflows, reasoning
  • Context: Up to 128k (some users report degradation past this point)
  • Local setup: vLLM recommended over Ollama for full context support. FP8 quant uses ~60GB VRAM. Q8 GGUF via llama.cpp also viable.
  • Performance: 90+ TPS on single Pro 6000 with MTP=3 (u/trashacct383)
  • Community verdict: "Absolute workhorse" - best balance of capability and cost

Qwen 3.6-35B - Step up from 27B

  • Strengths: Better reasoning, handles complex multi-step tasks
  • Local setup: Requires more VRAM. Q4 quant on RTX 3090 (24GB) gets ~45 TPS with 200k context (u/ObjectiveMediocre748)
  • Community verdict: "122b for tasks that need more detail" - use as upgrade path from 27B

Qwen 3.6 Plus 35B - Cloud variant

  • Strengths: Full capability without local hardware requirements
  • Cost: Competitive on OpenRouter and DeepSeek platforms
  • Community verdict: u/Jonathan_Rivera's favorite for weeks before switching to DS V4 Pro

MiniMax M2.7

MiniMax M2.7 - Budget champion with caveats

  • Strengths: Cheap ($10/mo token plan), decent for basic tasks, good auxiliary model
  • Weaknesses: "All over the place" consistency (u/idefix1515), not top-tier intelligence
  • Best use: Auxiliary tasks, paired with stronger main model for reasoning
  • Community verdict: "Forces me to think more and learn twice" - good for learning, not for complex work

DeepSeek Series

DeepSeek V4 Pro - Current community favorite for cloud

  • Strengths: Excellent capability, cheap via direct API (not OpenRouter), great caching
  • Cost: $1-1.5/day vs $2-3/day on OpenRouter for same usage (u/Almarma)
  • Community verdict: "Really cheap and really efficient using cache" - best cloud value

DeepSeek V4 Flash - Lightweight option

  • Strengths: Very cheap, good for auxiliary tasks and vision
  • Best use: Vision-only tasks, search/extraction, delegated simple work
  • Community verdict: Good auxiliary model, not recommended as main driver

Gemma 4 Series

Gemma 4 (all variants) - Generally NOT recommended for Hermes

  • Weaknesses: Poor agentic performance, weak tool calling
  • Context limitation: Limited context size on local hardware
  • Community verdict: "Tried all Gemma4 models, none was great at Agentic" (u/EmuHefty)

Kimi K2.6

Kimi K2.6 - Solid alternative

  • Strengths: Good general reasoning and tool handling
  • Best use: Medium-tier tasks, monitoring, scraping
  • Community verdict: "Solid all-around" but not the top pick (u/Fair-Yogurtcloset-21)

GPT Series

GPT-5.4 Mini / GPT-5.5 - Premium option

  • Strengths: High capability, reliable tool calling
  • Weaknesses: "Very chatty" (u/Ryankolp), expensive for daily use
  • Best use: Complex tasks where quality matters more than cost
  • Community verdict: Good for specific high-value tasks, not as daily driver

GLM 5.1

GLM 5.1 - Mixed results

  • Issues: "Model generated invalid tool call" errors reported (u/bef349)
  • Status: Overloaded/unstable at time of writing
  • Community verdict: Avoid for now, wait for stability improvements

Provider Comparison

Direct API vs OpenRouter

Factor Direct API OpenRouter
Cost Usually cheaper (no markup) Slightly higher prices
Caching Native caching support Caching may not work as well
Model variety Limited to one provider Access to many models
Reliability Direct connection, fewer hops Additional routing layer
Best for Single-model setups Multi-model experimentation

Community recommendation: Use direct API when you've settled on a model. Use OpenRouter during exploration phase. (u/Maleficent-Anything2)

Ollama Cloud

  • Cost: $20/mo Pro subscription
  • Models: Access to many high-end models
  • Missing: Image generation at time of writing
  • Community verdict: "Great for complex tasks" but image gen gap is a limitation (u/aaronmcbaron)

Model Routing Strategies

  • Main model: Qwen 3.6-27B or DeepSeek V4 Pro
  • Auxiliary model: DeepSeek V4 Flash or MiniMax M2.7
  • Upgrade path: Bump to Qwen 3.6-35B or Claude/GPT for complex tasks

Community Pattern 2: Local + Cloud Hybrid (u/trashacct383)

  • Local: Qwen 3.6-27B via vLLM for daily work
  • Cloud: Claude or GPT for planning and review phases
  • Workflow: Plan with local model -> execute locally -> QC with cloud model

Community Pattern 3: Orchestrator + Worker (u/An-R-Nguyen)

  • Orchestrator profile: Main model handles planning and QC
  • Coder profile: Dedicated coding agent, one-shots requests
  • Pattern: If quality < 80%, nuke and restart rather than fix

Community Pattern 4: Free-Tier Pooling (u/azzbeeter)

  • Tool: llm-keypool proxy
  • Strategy: Rotate across multiple free-tier API keys from different providers
  • Benefit: Zero cost, pooled rate limits
  • Warning: Multiple keys for same provider may violate ToS

Hardware Requirements for Local Models

Model Minimum VRAM Recommended VRAM Quantization
Qwen 3.6-27B (FP8) 48GB 60GB+ FP8 via vLLM
Qwen 3.6-27B (Q8) 32GB 48GB Q8 GGUF via llama.cpp
Qwen 3.6-35B (Q4) 16GB 24GB Q4 GGUF via Ollama/llama.cpp
MiniMax M2.7 Varies Check provider docs Provider-dependent

Note on MoE models: You can offload expert layers to CPU for more context, but expect ~50% TPS reduction. (u/Asleep-Land-3914, u/xeeff)


Model-Specific Issues

Censored vs Uncensored Models

  • Issue: Some Qwen variants refuse browser automation on external portals (e.g., school parent portals)
  • Solution: Use abliterated/uncensored variants for tasks requiring unrestricted access
  • Trade-off: Uncensored models may have slightly reduced accuracy
  • See: Model Variants memory note for specific model names

Context Window Limits

  • Qwen 3.6-27B: Handles 128k well, gradual degradation past that point
  • Ollama reported context: May show lower than actual (e.g., 64k instead of full context)
  • vLLM advantage: Full advertised context available locally

Token Usage Optimization

  • Switch models less frequently - each switch requires re-reading chat history
  • Keep conversations shorter or start new sessions when switching models
  • Use caching-enabled providers (DeepSeek direct API excels here)
  • Set compression at ~70% for long-running sessions

Community Model Testing Results

From the "What model are you running?" thread (121 responses):

Most mentioned models:

  1. MiniMax M2.7 - Budget favorite, widely tested
  2. Qwen 3.6-27B - Local deployment champion
  3. DeepSeek V4 Flash/Pro - Cloud value leader
  4. Kimi K2.6 - Solid alternative
  5. GPT variants - Premium option for specific tasks

Least recommended:

  1. Gemma 4 series - Consistently poor agentic performance
  2. GLM 5.1 - Stability issues at time of writing