We’ve trained a new foundation model, to use to power our generative tooling.
Currently, we’re having trouble getting parallel streams to scale on our H100 infrastructure – a single parallel stream saturates “80% compute” in the profiler, and attempting to run two streams in parallel doesn’t give us much gain in throughput.
We believe we’ve reached the end of what publicly available instrumentation will let us figure out.
Is there a developer relations team at NVIDIA that we can reach out to?
The latter is the natural consequence of the former. How many “percent of compute” are you able to saturate using two streams? With most such metrics, once you reach 85% to 90% utilization (regardless of the number of streams), that’s about as good as it gets.
I take it you would like NVIDIA DevTech to assist you in reducing the amount of computation required to process a single stream?
Friends of ours have had great success getting help identifying inner kernels that don’t map to the current optimized cases in the internal cuda compiler pattern matcher. And, given that our model is a new architecture, it’s very likely that we have some part of the model that doesn’t hit the fully optimized path, so we’re looking for help to validate or disprove this hypothesis, and possible next steps after that.
There are both DevRel (Developer Relations) and DevTech (Developer Technology) teams at NVIDIA. I’m not aware of public portals to reach out to either one of them. GTC is coming up in March, and there are DevRel and DevTech engineers at that event. NVIDIA usually organizes “ask the experts” sub-events at GTC to facilitate discussions.