Collecting eval results for Spark-sized quants of models

That’s interesting - I think I’ve seen that warning too, but hadn’t done much digging into it.

I’ve just set things up on a different machine so it’ll be easier for me to leave these evals running during the day. I’m tweaking the parameters slightly (doing 5 epochs taking the median result, and increasing the max number of turns and timeout) in attempt to reduce the variance in runs, which means I need to re-run the ones I’d previously collected. I’ll do the 35B-FP8 with those options next and post back (it’s been on my list to run one with kv-cache quantization to see what impact it has, so I may as well do it first).

I ran with/without --kv-cache-dtype fp8 and from the docs I think I was expecting the same results (because I assumed auto would use the same as the model weights), but it actually performed slightly better on average:

I feel like this has to just be some variance between the runs. Even though I’m now running 5 epochs and taking the median, these differences don’t seem right.

I think I’m going to have to do some more testing/iterating on number of epochs/samples. I will keep re-running the test with the same model/params a few times until I can be sure the results are coming out fairly consistent.

Thanks for running these tests.

Achieving reproducibility is indeed very challenging. I usually run the task with 8 parallel agents in separate environments, and I thought this would provide enough statistical weight.

However, even with this setup, I sometimes fail to get consistent results when comparing one full run of eight to another.

Before running any more, I’m going to run the same thing a bunch of times to see what kind of variance I see.

InspectAI doesn’t seem to support recording the results for each epoch, it always combines them (even with --no-epochs-reducer, that just prevents merging the same sample across epochs before producing a single overall score), so I’m just running --limit 1-10 --epochs 5 a bunch of times, into a new folder each time, and I’m going to see what difference I get.

If the difference is large, I’ll go through the logs and try to understand whether it’s really the model, or if there are other factors (like timeouts, or other environmental issues) that might be occurring inconsistently.

I guess if it comes to it, I can just keep increasing the number of epochs. At some point it’s gotta be high enough that the median is be a fair reflection of the model and should be fairly repeatable.

I ran a test of a small set of tasks (the same benchmarks, but only the first 5 tests in each). Each run is 5 epochs, taking the median result. I ran the whole thing 5 times.

Here’s the results… Pretty inconsistent. Because I only did 10 tasks, for pass/fail results, each task is 10% (hence some round numbers), but I’m surprised that the vairance is so high.

I’m going to repeat the whole experiment again with 10 epochs, and collecting the full results (I don’t know if it’s the flags I used, or because I used the json results, but it doesn’t contain the actual llm conversations so I can’t review these… run 5 was terrible for the AssistantBench ones).

If this is normal, I think I’m going to give up :-)

Danny I do appreciate reading your progress updates with great interest.

Sometimes the stochastic nature of these models is perplexing. My favourite example is when they randomly overwrite input files, then another one think the output is input and goes on a choose you own adventure. You never know where it will turn up and what it will be doing when you finally discover it.

I have the agents pretty buttoned down now, but every now and then one session will really surprise me. Not as often now, but it still happens enough. I suspect something like this is happening inside your tests. And when it does the model will just fail to stay on task abysmally.

Indeed, but this is why I was running 5 epochs and taking the median result. It should allow for things to go off the rails now and then but still get a reasonable average. If they are so random that in 5 runs the median result is not actually a good average, that concerns me (although, I’ll note that from the logs I’ve been watching, I don’t have a great amount of confidence in the benchmark scoring, and it’s very possible they are coming to correct answers, but subtly different so the scoring doesn’t work).

I’m re-running 10 lots now, with 10 epochs each (just for the AssistantBench ones which had the greatest variance), and I think I have the transcripts being logged (I can’t verify until one finishes), so hopefully I’ll have better info to figure this out.

If it does turn out to be bad scoring on the benchmarks, I’ll probably just drop those benchmarks and start doing testing of benchmarks for consistency before resuming testing models 😆

Alex Ziskind had an interesting test in his latest video.

Yeah, I saw that - it’s an interesting test, but right now I’m using benchmarks that are already integrated into Inspect AI because it’s easy to run many of them all from a single command and with the same results format.

(although, at least one of my issues was caused by the Inspect AI integration of a benchmark, so if it turns out this is the cause of some of the failures, I might rethink that)

Well, I see now why some of these are so inconsistent… One of these benchmarks is checking if the model can produce a specific URL:

I don’t know if the intention was that there would be some kind of web use tool, but Inspect AI doesn’t appear to have given it one - so it just makes up a URL which is wrong (although sometimes the scores are non-zero, like it’s scoring against segments or something). I don’t think it’s expecting the model to just know the url, because the AssistantBench website specifically walks about web-agents. I think this is probably a badly-integrated benchmark (or, it’s not intended to be run from the CLI like this but from code that provides tools).

I think I’m gonna have to reconsider the benchmark selection, and maybe reconsider whether Inspect AI is the best way to run them.