In order to achieve more concurrent stream parallelism I’m using the env variable CUDA_DEVICE_MAX_CONNECTIONS, which seems to be working as of CUDA 12.1. However I could find traces of this variable being defined in the Cuda Toolkit 5.5 but not in the latest one. Is there a reason to that?
Also while we are at it, it seems to me that modifying this number is not without consequences on the general performance. The 5.5 toolkit says :
Sets the number of compute and copy engine concurrent connections
(work queues) from the host to each device of compute capability 3.5 and above.
Would it be possible to know a little more about how this variable affects the GPU’s behaviour so that we may tune this variable best given the GPU?