cudaLaunchCooperativeKernelMultiDevice fails with invalid device ordinal

I have adopted the vectorAdd example from the CUDA SDK for different kernel launch semantics. See code.

Everything work except for the last call to

cudaLaunchCooperativeKernelMultiDevice(launchParams, numDevices)

I get the error: invalid device ordinal. My system has two GPUs installed and from my understanding the code should start.

Can someone please give a hint on what is wrong with my code.

what are the two GPUs, specifically?

The GPUs are two Tesla P100-SXM2-16GB.

I managed to get the code working. The problem was, that the streams need to be created per device (with a previous call to cudaSetDevice).

In general it would be nice to have a sample how to use the function in the SDK samples.

Yes, streams (and events) are per-device entities. This is covered in the programming guide as well as CUDA multi-GPU sample code e.g. simpleMultiGPU.