Thx Greg.
I changed the environment variable as you suggested and it worked.
I was also curious about whether the concurrency of kernels launched dynamically into cudaStreamFireAndForget was limited by CUDA_DEVICE_MAX_CONNECTIONS.
I found that it was not. Even with CUDA_DEVICE_MAX_CONNECTIONS left at the default (8),
I could easily get 32 concurrent kernels by mixing host and dynamic launches.
This makes sense in light of this post that explains a little more about connections:
How Many Streams?
Thx again.