An issue with a dominant CUDA stream

I used Nsight to profile my program and discovered that one CUDA stream is responsible for almost 89% of the entire pipeline, while the other 48 streams account for only 11% of the application. I’m curious about what could be causing this bottleneck and what steps I can take to address the issue.

I have also included a screenshot of the Nsight output. Can you identify any anomalies that can be easily fixed based on this image?

I’m not an expert on CUDA streams - so I am going to point you at CUDA Stream - Lei Mao's Log Book.

I’ll also ask @jasoncohen to chime in if he has an suggestions.

That being said, you would need to zoom in much further for me to see other anomalies. You could easily have some memory transfer/sync optimizations that you’ll want to zoom in to just a couple of GPU call cycles to see.