Slow inference with 31b model Gemma 4? Optimizations?

You prob should have researched and never got them in the first place. If you can host and afford a rack server on industrial level you shouldn’t have considered sparks as they are home/dev boxes for individuals. I can put my two sparks in backpack and work and travel on the airplane. Even 4 will do. You can’t put your server in. That’s the whole difference, mate. But at the scale of 4+ sparks the economy stop making any sense. Just rent a cluster on Lambda or Runpod and run your tasks. You don’t need it consistently for 24x7 and you pay by the minute. Spark is for enthusiasts and tinkerers, far from plug-and-play corporate, who wants to cheap out on boxes.

Check if you’re not suffering from the power delivery bug, that seems extra low.

On 2x Spark I recommend two picks: Minimax M2.7 AWQ and DeepSeek 4 Flash (FP8/4 straight from DeepSeek)

I’m running the latter right now with 500K context (very comfy) and with proper config (MTP) it does 40 tps sustained for the whole context length. I showed my recipe here