I am posting this just as a note, as the issue is solved. I lost one day over this.
So I am using streams to overlap communication and computation for kernels that run on multiple GPUs. While transforming my code to an asynchronous one, I bumped into two issues. Here they are.
The calls to cudaStreamCreate must be done in the threads controlling the GPUs, not in the main CPU thread. Doing this in the CPU thread results in a invalid handle error. I am using a different stream for each thread, but I don’t know if this is actually a requirement
I have kernels which both use and not use dynamic memory. The ones that use it were called by “kernel <<< CTAS, THREADS, dyn_mem>” while the ones that do not were called by “kernel <<< CTAS, THREADS>”. Moving to streams, I made the mistake to just add stream[i] to each of them, like “kernel <<< CTAS, THREADS, dyn_mem, stream[i]>” and “kernel <<< CTAS, THREADS, stream[i]>”, respectively. Both kernels ran, but nothing overlapped. Of course, this was because stream[i] from “kernel <<< CTAS, THREADS, stream[i]>” was interpreted as the size of the dynamic memory. No complains from the compiler, 'cause cudaStream_t is an int. Fast forward half a day of debugging, the correct “kernel <<< CTAS, THREADS, 0, stream[i]>” did the trick.
Hope this helps someone,