In order to achieve more concurrent stream parallelism I’m using the env variable CUDA_DEVICE_MAX_CONNECTIONS, which seems to be working as of CUDA 12.1. However I could find traces of this variable being defined in the Cuda Toolkit 5.5 but not in the latest one. Is there a reason to that?
Also while we are at it, it seems to me that modifying this number is not without consequences on the general performance. The 5.5 toolkit says :
Sets the number of compute and copy engine concurrent connections
(work queues) from the host to each device of compute capability 3.5 and above.
Would it be possible to know a little more about how this variable affects the GPU’s behaviour so that we may tune this variable best given the GPU?
It is still pretty tough to understand why the driver would decide that the limitation is up to 32 compute queue and not 128 on some other computations
My bad, it was indeed in the latest toolkit (and at the most obvious place). Plus the topic you pointed is very interesting. It should be enough for me to move on with my experimentations.