Understanding the creation and purpose of this benchmark
Inspired by Simon Willison's pelican-riding-a-bicycle benchmark.
The original prompt that started this project was:
Simon Willison has created an amusing yet useful benchmark for LLMs. He gives them the prompt 'Generate an SVG of a pelican riding a bicycle' and displays the results. This benchmark has attracted a lot of attention on Hacker News and elsewhere. However, some people have suggested that LLM makers might be training the models to perform well on this benchmark. I would therefore like to create a similar set of benchmarks that no one has used previously. Like Simon's, the prompt should be of the form 'Generate an SVG of [A] [doing] [B].' [A] should be a natural, organic, living being with a complex shape (moose, starfish, etc.). [B] should be an inorganic, manmade object with a complex shape (picnic bench, bulldozer, Ferris wheel, etc.). And [doing] should be an action for which it would be surprising for beings of category [A] to do with respect to [B]. Please suggest 30 prompts of that form for different, original combinations of [A], [doing], and [B]. Do not repeat any entries in the three categories.
After generating the 30 prompts, the following meta prompt was used to create the Claude Code implementation prompt:
Great. Now create a prompt that I can give to Claude Code to generate all of those images through OpenRouter using the following models: moonshotai/kimi-k2-thinking, anthropic/claude-sonnet-4.5, x-ai/grok-code-fast-1, google/gemini-2.5-pro, deepseek/deepseek-v3.2-exp, z-ai/glm-4.6, qwen/qwen3-vl-235b-a22b-thinking. After it has created those images, it should create a static website allowing users to view and compare the images for each model. Below each image should be the model name. Include a placeholder for my OpenRouter API key in the prompt. The website should be self-contained, using only HTML, JavaScript, and CSS, with no external dependencies. One page on the site should include my original prompt to you above, this meta prompt, and the prompt you create for Claude Code to explain how the site was created.
The complete prompt given to Claude Code to implement this benchmark system:
Create a comprehensive SVG generation benchmark system using OpenRouter to test multiple LLM models. Follow these steps: **Step 1: Generate SVG Images** Using the OpenRouter API with API key `YOUR_OPENROUTER_API_KEY_HERE`, generate SVG images for each of the following 30 prompts across 6 different models: **Models to test:** - anthropic/claude-sonnet-4.5 - x-ai/grok-code-fast-1 - google/gemini-2.5-pro - deepseek/deepseek-v3.2-exp - z-ai/glm-4.6 - qwen/qwen3-vl-235b-a22b-thinking **Prompts:** 1. Generate an SVG of an octopus operating a pipe organ 2. Generate an SVG of a giraffe assembling a grandfather clock 3. Generate an SVG of a starfish driving a bulldozer 4. Generate an SVG of a moose conducting a carousel 5. Generate an SVG of a flamingo repairing a telescope 6. Generate an SVG of a hedgehog playing an accordion 7. Generate an SVG of a jellyfish piloting a Ferris wheel 8. Generate an SVG of an elephant typing on a typewriter 9. Generate an SVG of a chameleon tuning a grand piano 10. Generate an SVG of a penguin juggling chainsaws 11. Generate an SVG of a sloth steering an excavator 12. Generate an SVG of a dragonfly balancing a chandelier 13. Generate an SVG of a rhinoceros painting a lighthouse 14. Generate an SVG of a seahorse examining a microscope 15. Generate an SVG of a peacock spinning a pottery wheel 16. Generate an SVG of a kangaroo climbing a radio tower 17. Generate an SVG of a lobster polishing a harp 18. Generate an SVG of a porcupine pushing a lawnmower 19. Generate an SVG of a gecko installing a satellite dish 20. Generate an SVG of an iguana carving a totem pole 21. Generate an SVG of an armadillo lifting a drawbridge 22. Generate an SVG of a mantis studying a sextant 23. Generate an SVG of an ostrich pulling a rickshaw 24. Generate an SVG of a squid disassembling a printing press 25. Generate an SVG of a butterfly inspecting a steam engine 26. Generate an SVG of a crab descending a fire escape 27. Generate an SVG of a venus flytrap swallowing a street lamp 28. Generate an SVG of coral cleaning a ship's wheel 29. Generate an SVG of a sea anemone threading a loom 30. Generate an SVG of an orchid supporting a pergola For each model and prompt combination, make an API call to OpenRouter and save the generated SVG code to appropriately named files (e.g., `model1_prompt1.svg`, `model1_prompt2.svg`, etc.). Extract only the SVG code from the response. Handle errors gracefully and log any failures. **Step 2: Create Model Metadata** Create a metadata object with information for each model including: - Model name (display name) - Model size (parameters - research if needed, or use "TBD" for unknown) - Release date (research if needed, or use "TBD" for unknown) **Step 3: Build Static Website** Create a self-contained static website with the following features: **Main Gallery Page (index.html):** - Grid layout showing all 30 prompts - For each prompt, display SVGs from all 6 models side-by-side - Below each SVG: model name - Model metadata displayed once at the top of the page - Responsive design that works on desktop and mobile - Simple navigation and clean styling - Include prompt text above each row of model outputs **About/Meta Page (about.html):** Include three sections: 1. **Original Prompt to Claude:** The prompt that started this project 2. **Meta Prompt:** The prompt asking for this Claude Code prompt 3. **Claude Code Implementation Prompt:** This entire prompt **Technical Requirements:** - Use only vanilla HTML, CSS, and JavaScript (no frameworks or external libraries) - Embed SVG files as data or inline them in the HTML - Use CSS Grid or Flexbox for layout - Include basic styling with good contrast and readability - Make the site fully functional when opened from local filesystem (file://) - Include error handling for missing SVGs **File Structure:** ``` benchmark/ ├── index.html (main gallery) ├── about.html (meta documentation) ├── styles.css (shared styles) ├── svgs/ │ ├── anthropic_claude-sonnet-4.5_prompt1.svg │ ├── x-ai_grok-code-fast-1_prompt1.svg │ └── ... (180 generated SVGs total) └── README.md (instructions) ``` Implement rate limiting and error handling for the API calls. Log progress as you generate each SVG. Create a summary report of successes and failures at the end.
This benchmark system was fully implemented by Claude Code (Anthropic's CLI tool for Claude). The implementation included:
Models Tested:
Note: Kimi K2 Thinking was initially planned but excluded during implementation due to extremely slow response times (60+ seconds per request). The final benchmark uses 9 models (6 original plus Gemini 3.0 Pro Preview, GPT-5.1, and Claude Opus 4.5 added later), all of which achieved 100% success rates (30/30 prompts) after targeted regeneration of initially failed SVGs.
This benchmark serves several purposes:
The benchmark is inspired by and extends Simon Willison's pioneering work in creating simple yet effective benchmarks for evaluating LLM capabilities. You can find Simon's original pelican-riding-a-bicycle benchmark and other experiments on his blog at simonwillison.net.