Model Comparison Matrix
Source: r/hermesagent community testing and discussion (May 2026) Based on: 121 comments from "What model are you running?" thread + multiple setup discussions
Quick Reference: Best Models by Use Case
| Use Case | Recommended Model | Provider | Cost Tier |
|---|---|---|---|
| Daily driver (general tasks) | Qwen 3.6-27B | Local/vLLM or OpenRouter | Free-Paid |
| Budget option | MiniMax M2.7 | AIStudio ($10/mo plan) | $ |
| Best value cloud model | DeepSeek V4 Pro | DeepSeek API directly | $$ |
| Complex reasoning tasks | Qwen 3.6-35B or GPT-5.5 | OpenRouter/Cloud | $$$ |
| Coding assistant | Qwen 3.6-27B (local) + Claude/GPT for review | Mixed | $$-$$$ |
| Vision/image analysis | DeepSeek V4 Flash or Gemini 3.1 Flash Preview | Various | $$-$$ |
| Auxiliary tasks (search, extraction) | DeepSeek V4 Flash or OSS 120B | AIStudio/OpenRouter | $ |
Detailed Model Reviews
Qwen 3.6 Series
Qwen 3.6-27B - Community favorite, "custom-made for Hermes"
- Strengths: Excellent tool calling, agentic workflows, reasoning
- Context: Up to 128k (some users report degradation past this point)
- Local setup: vLLM recommended over Ollama for full context support. FP8 quant uses ~60GB VRAM. Q8 GGUF via llama.cpp also viable.
- Performance: 90+ TPS on single Pro 6000 with MTP=3 (u/trashacct383)
- Community verdict: "Absolute workhorse" - best balance of capability and cost
Qwen 3.6-35B - Step up from 27B
- Strengths: Better reasoning, handles complex multi-step tasks
- Local setup: Requires more VRAM. Q4 quant on RTX 3090 (24GB) gets ~45 TPS with 200k context (u/ObjectiveMediocre748)
- Community verdict: "122b for tasks that need more detail" - use as upgrade path from 27B
Qwen 3.6 Plus 35B - Cloud variant
- Strengths: Full capability without local hardware requirements
- Cost: Competitive on OpenRouter and DeepSeek platforms
- Community verdict: u/Jonathan_Rivera's favorite for weeks before switching to DS V4 Pro
MiniMax M2.7
MiniMax M2.7 - Budget champion with caveats
- Strengths: Cheap ($10/mo token plan), decent for basic tasks, good auxiliary model
- Weaknesses: "All over the place" consistency (u/idefix1515), not top-tier intelligence
- Best use: Auxiliary tasks, paired with stronger main model for reasoning
- Community verdict: "Forces me to think more and learn twice" - good for learning, not for complex work
DeepSeek Series
DeepSeek V4 Pro - Current community favorite for cloud
- Strengths: Excellent capability, cheap via direct API (not OpenRouter), great caching
- Cost: $1-1.5/day vs $2-3/day on OpenRouter for same usage (u/Almarma)
- Community verdict: "Really cheap and really efficient using cache" - best cloud value
DeepSeek V4 Flash - Lightweight option
- Strengths: Very cheap, good for auxiliary tasks and vision
- Best use: Vision-only tasks, search/extraction, delegated simple work
- Community verdict: Good auxiliary model, not recommended as main driver
Gemma 4 Series
Gemma 4 (all variants) - Generally NOT recommended for Hermes
- Weaknesses: Poor agentic performance, weak tool calling
- Context limitation: Limited context size on local hardware
- Community verdict: "Tried all Gemma4 models, none was great at Agentic" (u/EmuHefty)
Kimi K2.6
Kimi K2.6 - Solid alternative
- Strengths: Good general reasoning and tool handling
- Best use: Medium-tier tasks, monitoring, scraping
- Community verdict: "Solid all-around" but not the top pick (u/Fair-Yogurtcloset-21)
GPT Series
GPT-5.4 Mini / GPT-5.5 - Premium option
- Strengths: High capability, reliable tool calling
- Weaknesses: "Very chatty" (u/Ryankolp), expensive for daily use
- Best use: Complex tasks where quality matters more than cost
- Community verdict: Good for specific high-value tasks, not as daily driver
GLM 5.1
GLM 5.1 - Mixed results
- Issues: "Model generated invalid tool call" errors reported (u/bef349)
- Status: Overloaded/unstable at time of writing
- Community verdict: Avoid for now, wait for stability improvements
Provider Comparison
Direct API vs OpenRouter
| Factor | Direct API | OpenRouter |
|---|---|---|
| Cost | Usually cheaper (no markup) | Slightly higher prices |
| Caching | Native caching support | Caching may not work as well |
| Model variety | Limited to one provider | Access to many models |
| Reliability | Direct connection, fewer hops | Additional routing layer |
| Best for | Single-model setups | Multi-model experimentation |
Community recommendation: Use direct API when you've settled on a model. Use OpenRouter during exploration phase. (u/Maleficent-Anything2)
Ollama Cloud
- Cost: $20/mo Pro subscription
- Models: Access to many high-end models
- Missing: Image generation at time of writing
- Community verdict: "Great for complex tasks" but image gen gap is a limitation (u/aaronmcbaron)
Model Routing Strategies
Community Pattern 1: Tiered Approach (Most Popular)
- Main model: Qwen 3.6-27B or DeepSeek V4 Pro
- Auxiliary model: DeepSeek V4 Flash or MiniMax M2.7
- Upgrade path: Bump to Qwen 3.6-35B or Claude/GPT for complex tasks
Community Pattern 2: Local + Cloud Hybrid (u/trashacct383)
- Local: Qwen 3.6-27B via vLLM for daily work
- Cloud: Claude or GPT for planning and review phases
- Workflow: Plan with local model -> execute locally -> QC with cloud model
Community Pattern 3: Orchestrator + Worker (u/An-R-Nguyen)
- Orchestrator profile: Main model handles planning and QC
- Coder profile: Dedicated coding agent, one-shots requests
- Pattern: If quality < 80%, nuke and restart rather than fix
Community Pattern 4: Free-Tier Pooling (u/azzbeeter)
- Tool: llm-keypool proxy
- Strategy: Rotate across multiple free-tier API keys from different providers
- Benefit: Zero cost, pooled rate limits
- Warning: Multiple keys for same provider may violate ToS
Hardware Requirements for Local Models
| Model | Minimum VRAM | Recommended VRAM | Quantization |
|---|---|---|---|
| Qwen 3.6-27B (FP8) | 48GB | 60GB+ | FP8 via vLLM |
| Qwen 3.6-27B (Q8) | 32GB | 48GB | Q8 GGUF via llama.cpp |
| Qwen 3.6-35B (Q4) | 16GB | 24GB | Q4 GGUF via Ollama/llama.cpp |
| MiniMax M2.7 | Varies | Check provider docs | Provider-dependent |
Note on MoE models: You can offload expert layers to CPU for more context, but expect ~50% TPS reduction. (u/Asleep-Land-3914, u/xeeff)
Model-Specific Issues
Censored vs Uncensored Models
- Issue: Some Qwen variants refuse browser automation on external portals (e.g., school parent portals)
- Solution: Use abliterated/uncensored variants for tasks requiring unrestricted access
- Trade-off: Uncensored models may have slightly reduced accuracy
- See: Model Variants memory note for specific model names
Context Window Limits
- Qwen 3.6-27B: Handles 128k well, gradual degradation past that point
- Ollama reported context: May show lower than actual (e.g., 64k instead of full context)
- vLLM advantage: Full advertised context available locally
Token Usage Optimization
- Switch models less frequently - each switch requires re-reading chat history
- Keep conversations shorter or start new sessions when switching models
- Use caching-enabled providers (DeepSeek direct API excels here)
- Set compression at ~70% for long-running sessions
Community Model Testing Results
From the "What model are you running?" thread (121 responses):
Most mentioned models:
- MiniMax M2.7 - Budget favorite, widely tested
- Qwen 3.6-27B - Local deployment champion
- DeepSeek V4 Flash/Pro - Cloud value leader
- Kimi K2.6 - Solid alternative
- GPT variants - Premium option for specific tasks
Least recommended:
- Gemma 4 series - Consistently poor agentic performance
- GLM 5.1 - Stability issues at time of writing