How We Test Our 3D Modelling Agent

For months, it's been bothering me: how do I really know if changes to my 3D modelling agent are making it better? Evals (evaluations) are the answer, but building the right kind of eval system for a complex AI agent isn't straightforward.

Evals, for me, are repeatable experiments over the whole agent. Each one uses a prompt set and a specific configuration, runs the workflow end to end, and produces scores I can compare across runs. Runs in result in artefacts (code, preview, 3MF) and the metrics that matter, so a change either wins or loses without blindly guessing.

Because they're standardised and easy to run, I started to lean on them heavily. At the time of writing the suite has produced 2712 generations across 85 runs for $562.43 total (about $6.62 per run). Wall time: 5869m 27s in total, roughly 69m 3s per run. The scale isn't huge, but it's enough signal to steer day to day iteration.

I first tried a simple "battleground" where I could compare two different models side-by-side. It was a start, but it quickly fell apart. My agent isn't just a single LLM call. It's a multi-step workflow. What if I want to use a faster, cheaper model just for repairing generated code? What if I tweak the system prompt? The battleground was too rigid.

AI 3D Model Battleground

Existing eval frameworks didn't fit either. Most are designed for a simple call -> response flow. They also focus on generic metrics like "hallucinations." Frankly, I don't care about hallucinations. I care about whether the agent produces a great, printable 3D model. This meant I needed a custom solution.

The Problem: Evaluating a Complex Agent

So, the core problem was building a flexible evaluation system that could:

Test the entire agent workflow, not just a single model.
Accommodate any change, from a different model to a tiny prompt tweak.
Use custom metrics that actually matter for 3D modelling.

The Solution: Git-Based Evals and Weighted Scoring

After a lot of thought, I landed on a solution: the git commit is the single source of truth.

Any change I want to test (a new model, a different prompt, a refactored workflow) gets committed. I can then run an eval against that specific commit. This encapsulates the entire state of the agent and its configuration. Simple and powerful.

But what are we actually measuring? For generating 3D models, I defined a few key metrics:

Prompt Adherence: Does the model do what I asked? (Subjective)
Aesthetics: Does the model look good? (Subjective)
Duration: How long did it take to generate? (Objective)
Cost: How much did it cost? (Objective)
Printability: Is the model manifold? Any other printing issues? (Objective, work in progress)

The first two are subjective. To handle this, I use a two-pronged approach:

Automatic AI Evals: Another AI model scores adherence and aesthetics.
Human Evals: I built a simple dashboard to rate them myself.

To keep my own ratings consistent, I created a clear scale. I think this is super important, because six months from now, I'll have forgotten what a "0.3" in aesthetics meant.

Adherence Scale

<0.2 (Poor): Misses core intent; largely irrelevant or incorrect.
<0.4 (Weak): Partially relevant; significant omissions or errors.
<0.6 (Fair): Covers main points but lacks completeness or precision.
<0.8 (Good): Mostly accurate; minor gaps or deviations.
<=1.0 (Excellent): Fully aligned; precise, comprehensive, and faithful to intent.

Aesthetics Scale

<0.2 (Poor): Visually unpleasant or chaotic.
<0.4 (Mediocre): Dull, imbalanced, or unrefined.
<0.6 (Decent): Acceptable; some appeal but lacks polish.
<0.8 (Good): Cohesive, attractive, visually satisfying.
<=1.0 (Excellent): Striking, elegant, emotionally engaging.

Weighted Scoring

Okay, so now I have all these metrics. How do I combine them into a single score to tell me if a change is good or bad? With a weighted equation.

This lets me tune what I care about most. For example, I could make human-rated adherence the most important factor and speed secondary, ignoring everything else.

example of weight configuration

Or I could make cost twice as important as speed.

example of weight configuration with cost included

Here's my current setup for making decisions:

final weight configuration for grandpacad

My priorities are:

Prompt Adherence: The most important thing is that the agent actually makes what you ask for.
Speed: I've had configurations where generation times exploded to 20 minutes. That's not a great user experience.
Cost: Generations are already pricey. To offer reasonable plans, I need to keep costs under control.
Aesthetics: Least important right now while adherence is still a challenge. Function over form, for now. A beautiful box that you can't open isn't very useful.

I also weigh my own scores much more heavily than the AI's scores.

One hard-earned lesson from rating 1000+ models: my own scoring drifts. Even with the rubric on screen I often ended up doing a quick pass/fail at the 0.4 threshold (would I be okay showing this to a user or not). That pushed me towards a simpler human signal. I am experimenting with a binary yes/no for adherence, then letting the weighted score use that as a strong feature. It reduces noise, speeds reviews, and matches how I actually make decisions under time pressure. The AI can keep a continuous score; my rating can be crisp and auditable.

Putting it to the Test

Let's see it in action. I recently changed the agent's "reasoning effort" to "high," but I wasn't sure it was actually helping. It was definitely slower and more expensive. Could I get away with "medium" or "low" effort and cut generation time in half?

I have two sets of prompts I use for evals: a "base" set with 7 prompts and an "extra" set with 27. Each prompt is run 3 times to account for variability, so an extra run generates 81 models. I usually start with the base set to get a general idea. If the results are too close to call I run the full extra set. It's not a huge sample size, but it's way better than flying blind. It already costs about $5 per base run and about $15 per extra run, and this is a side project...

The results were clear:

High reasoning effort costs 2x as much as Medium and 4x as much as Low.
There was no noticeable difference in quality between High and Medium.
Low performed slightly worse than the other two.
High took 3.6x longer than Low and about 2x longer than Medium.

The winner was obvious: Medium reasoning effort. It provided the same quality as High at half the cost and time. You can see all the results on the evals page.

Since then GPT-5.1 dropped (18.11.2025). Early runs on the same "medium" reasoning setting show 2-3x faster completions than GPT-5 with slightly lower quality on adherence and aesthetics. On overall capability it sits between Gemini 2.5 Pro and GPT-5 for my workload. Given the speed, I now prefer GPT-5.1 for model generation when I need fast iteration, and fall back to GPT-5 when that last bit of quality matters. Both stay in the harness so the scoreboard decides, not vibes. (That was the state of play in late 2025. The roster has churned a lot since, more on that further down.)

Evals dashboard

I'm constantly tweaking variables like these and re-running evals:

const INITIAL_MODEL = "openai/gpt-5";
const INITIAL_REASONING_EFFORT = "medium";

const CODE_ERRORS_MAX = 4;
const FIX_MODEL = "openai/gpt-5-mini";
const FIX_REASONING_EFFORT = "low";

const VISUAL_ADHERENCE_MIN = 0.4;
const VISUAL_ERRORS_MAX = 1;
const VISUAL_EVALUATION_MODEL = "2.5-flash";
const VISUAL_FIX_MODEL = "openai/gpt-5";
const VISUAL_FIX_REASONING_EFFORT = "low";

The Eval Rabbit Hole Goes Deeper

You might be wondering about those VISUAL_ variables. This is where things get interesting. I implemented a "visual check" feature where an AI model looks at a rendered image of the 3D model and scores its prompt adherence. If the score is too low, the agent tries again.

My gut feeling was that it wasn't helping enough to justify the extra time and cost. It was making generations take three times longer. Running the extra prompt set confirmed this suspicion. The visual checks weren't consistent enough. Why? Was the AI's rating just random? Was my threshold too low, causing unnecessary retries?

A Use Case for Linear Regression

I'd heard of linear regression but never had a real use for it. Until now. I have pairs of scores for the same model: (AI score, human score). This is a perfect fit.

In simple terms, linear regression finds the straight line that best fits a set of data points. In my case, the data points are the (AI score, human score) pairs. If the AI is a good judge, its scores should have a linear relationship with my scores. For example, when the AI gives a 0.2, I might consistently give a 0.3. When it gives a 0.8, I might give a 0.7. Linear regression finds the formula for that line.

This is useful for two reasons. First, it lets me train a tiny model to predict my score based on the AI's score. More importantly, it gives me a value called R-squared (R²). This tells me how well the AI scores predict the human scores. An R² of 1 means a perfect prediction; 0 means no correlation at all. I'd be happy with anything over 0.4.

What I found:

The best model for predicting my scores was 2.5-flash.
There wasn't much difference between flash/pro or GPT-5/GPT-5-mini. Nano was noticeably worse.

After tuning the visual checks with this data, I ran the benchmarks again.

Dashboard with and without visual checks

The initial results showed some improvement in the "pass rate" - the number of models with an adherence score above 0.4. But when I dug deeper with the extra prompt set, the R² was actually closer to 0.1 (or 10%). That's... not great. It means the AI's rating is a poor predictor of my own.

So for now, I've disabled the visual checks. It's not just a matter of waiting for the underlying vision models to get better. There are other things I need to explore, like improving the system prompt for the visual check LLM or rendering the model from multiple angles to give the AI a better view. The good news is, I have the entire system ready to go, so I can easily test these ideas and flip the switch when the checks become reliable enough.

What a year of runs taught me

I wrote the first draft of this post with two models in the harness and a single-shot workflow. Since then the scoreboard has grown past 85 runs, and the picture got a lot richer. A few things stuck.

The harness now tests more than one CAD engine. Part of why the run count exploded is that I benchmark the whole agent separately per engine: OpenSCAD, and the Python CAD libraries CadQuery and Build123d. A lot of those runs exist only to break the current engine. I call them error-mining runs: throw a big set of hard prompts at the agent (116 in the largest set), count the code errors, then feed those specific failures back into the system prompt and the skill docs. High error counts on those runs are the point, not a regression, so I stopped reading a run with 56 code errors as a bad result and started reading it as 56 fixes waiting to happen. Which engine actually won is its own post: OpenSCAD vs CadQuery vs Build123d (short version: OpenSCAD, because current models write it with far fewer code errors).

Models have personalities, and they hold up across runs. For my workload the pattern is consistent. Gemini buys wall-clock speed, but you pay it back in a busier repair loop. Opus is the reliable one: Opus 4.7 became my benchmark for "just works", fast, cheap per generation, only two to four code errors on a full set. And GPT-5.x on high reasoning is a cost trap for CAD, roughly double the cost and time for quality I honestly can't tell apart from medium. Medium reasoning being the sweet spot wasn't a one-off from that first sweep, it repeated across GPT-5, 5.1, and 5.2.

Cheap models usually aren't. I spent a while hunting for a genuinely cheap model that could carry generation, because the sticker prices are seductive (some are a couple of cents a generation). Then you watch something like minimax-m3 rack up 90-plus code errors and run for hours, and the "cheap" model turns out to cost more once you count the repair loops and the latency. Below roughly $0.15 a generation, the savings mostly evaporate into a busier fix loop. Effective cost is not sticker cost, and the final pass-rate hides it while the clock and the token bill don't. For now gemini-3.1-pro on medium reasoning is my OpenSCAD baseline: a sensible middle.

The dashboard grew tools to keep up. A flat list of 85 runs is useless, so I added a few things. Runs get tagged by model, reasoning effort, and engine, so I can filter the board down to a single comparison. I can pin two or more runs and diff them side by side, or send exactly two into a head-to-head battle, generation by generation. There's a drag-and-drop tier board for when I just want to rank things by feel. And a "What we've learned" panel pins the one-line takeaways to the top so I don't lose them between runs. Old runs don't get deleted, they get archived and drop out of the default view, so the board I look at every day stays honest.

Caveats

This system isn't perfect, of course. The test dataset is small, so the results aren't statistically bulletproof, but it's a great start. The human evaluation is also just me and sometimes my girlfriend, so there's personal bias. Finally, the AI evaluation is still a work in progress; an R² of 0.1 shows there's a long way to go.

One more caveat: not all historical evals were configured identically. Early on I discovered a few implementation mistakes and some runs used slightly different accounting, so certain costs or timings are off. The /evals page is still useful, but treat older entries as indicative rather than precise. I'm keeping them for posterity, and I've since added tags and archiving, so the stale runs drop out of the default view and the everyday comparisons stay honest.

Conclusion

This evaluation system has been critical for making informed decisions. It replaces guesswork with a data-driven approach, which is essential when dealing with so many variables.

I can now systematically test hypotheses and quantify the impact of changes:

What's the best LLM for prompt adherence?
Which model generates the most aesthetic results?
How do changes to the system prompt affect the performance vs. cost trade-off?
Is it better to use one powerful model for everything, or smaller, specialized models for tasks like code repair?
Which CAD engine gives the agent the best shot at a printable result? (I ran the full bake-off: OpenSCAD, for now.)
Do visual checks actually improve the final output, or just increase cost and latency? (Spoiler: not yet.)

The framework allows for methodical, iterative improvement. For a solo developer building a complex system, having a robust testing harness like this isn't just a nice-to-have, it's a must.

How We Test Our 3D Modelling Agent

The Problem: Evaluating a Complex Agent