I recently implemented a feature here on my own blog that uses OpenAI's GPT to help me correct spelling and punctuation in posted blog comments. Because I was curious, and because the scale is so small, I take the same prompt and fire it off three times. The pseudo code looks like this:
for model in ("gpt-5", "gpt-5-mini", "gpt-5-nano"):
response = completion(
model=model,
api_key=settings.OPENAI_API_KEY,
messages=messages,
)
record_response(response)
The price difference is large. That's easy to measure it's on their pricing page.
The quality of the responses is harder. I'm still working on that using my personal judgement by comparing the various results.
But the speed difference is fairly large. I measure how long the whole thing takes and now I can calculate the median (P50) and the 90th percentile (P90) and the results currently are:
model | p50 | p90
------------+--------------------+-------------------
gpt-5 | 27.34671401977539 | 43.84699220657348
gpt-5-mini | 9.814127802848816 | 16.00238153934479
gpt-5-nano | 24.380277633666992 | 32.99455285072327
That's in seconds. The smaller the better.
Caveat: I still consider myself a noob when it comes to using the OpenAI API. What I have is a relatively simple application and the amount of money spent is pennies. There might be ways to tune this. Also, at this point I only have about 40 data points but I'll analyze it again in the future when I have more.
Comments
Note-to-self; the query:
select
model, count(*),
percentile_cont(0.5) within group (order by took_seconds) as p50,
percentile_cont(0.9) within group (order by took_seconds) as p90
from llmcalls_llmcall group by model;