Marky avatar

New LLM benchmark: llmtester

themarkymark

Published: 07 Apr 2026 › Updated: 07 Apr 2026New LLM benchmark: llmtester

New LLM benchmark: llmtester

image.png

I just published a new LLM Benchmark tool called llmtester.

Over 50,000 tests over 13 categories!

This is the easiest way to benchmark LLM models and see where they go wrong.

  • Interactive CLI - Keyboard-driven benchmark selection and configuration
  • Multi-Provider Support - OpenAI, Anthropic, Together.ai, Groq, Fireworks AI, Perplexity, OpenRouter, and any OpenAI-compatible API
  • LLM-as-Judge - Optional secondary model evaluation for code, math, SQL, bash, and truthfulness benchmarks
  • Progress Tracking - Resume interrupted evaluations from where you left off
  • Result Explorer - Built-in TUI to browse past results, filter by pass/fail, and inspect individual responses
  • Config Persistence - Saves provider, endpoint, and model settings between runs
  • Shuffle & Sampling - Run a percentage of each benchmark with optional shuffling for diverse distribution

Run the latest version without any install with npx llmtester

What makes llmtester so awesome?

50,000+ tests, second LLM judges results on more complex tests, ability to fully explore past tests.

I got tired of the doing everything manually, so built a full test runner.

Includes tests in many domains from Grade school math, advanced math, reasoning, programming, and even sql. Some of these tests are impossible to grade without a judge, llmtester will handle this all for you!

image.png

image.png

image.png

You can find the package on npm and github.

Leave New LLM benchmark: llmtester to:

Written by

Browncoat | Meme Connoisseur | Bitcoin Evangelist | Dev | Gamer | Technical Samurai | AI Nerd | Hive Witness | Black Belt in Dad Jokes

Read more #ai posts


Best Posts From Marky

We have not curated any of themarkymark's posts yet. But you can encourage our curation team to review posts by visiting them regularly and by referring other readers. Because we give priority to frequently read content.

More Posts From Marky