Note on using the streaming API

Hi,

I am posting this just as a note, as the issue is solved. I lost one day over this.

So I am using streams to overlap communication and computation for kernels that run on multiple GPUs. While transforming my code to an asynchronous one, I bumped into two issues. Here they are.

1st issue:
The calls to cudaStreamCreate must be done in the threads controlling the GPUs, not in the main CPU thread. Doing this in the CPU thread results in a invalid handle error. I am using a different stream for each thread, but I don’t know if this is actually a requirement

2nd issue:
I have kernels which both use and not use dynamic memory. The ones that use it were called by “kernel <<< CTAS, THREADS, dyn_mem>” while the ones that do not were called by “kernel <<< CTAS, THREADS>”. Moving to streams, I made the mistake to just add stream[i] to each of them, like “kernel <<< CTAS, THREADS, dyn_mem, stream[i]>” and “kernel <<< CTAS, THREADS, stream[i]>”, respectively. Both kernels ran, but nothing overlapped. Of course, this was because stream[i] from “kernel <<< CTAS, THREADS, stream[i]>” was interpreted as the size of the dynamic memory. No complains from the compiler, 'cause cudaStream_t is an int. Fast forward half a day of debugging, the correct “kernel <<< CTAS, THREADS, 0, stream[i]>” did the trick.

Hope this helps someone,
Serban