I have adopted the vectorAdd example from the CUDA SDK for different kernel launch semantics. See code.
Everything work except for the last call to
cudaLaunchCooperativeKernelMultiDevice(launchParams, numDevices)
I get the error: invalid device ordinal. My system has two GPUs installed and from my understanding the code should start.
Can someone please give a hint on what is wrong with my code.
what are the two GPUs, specifically?
The GPUs are two Tesla P100-SXM2-16GB.
I managed to get the code working. The problem was, that the streams need to be created per device (with a previous call to cudaSetDevice).
In general it would be nice to have a sample how to use the function in the SDK samples.
Yes, streams (and events) are per-device entities. This is covered in the programming guide as well as CUDA multi-GPU sample code e.g. simpleMultiGPU.