Cooperative kernel launch with CUB does not work

Hi

I have attached .cu and makefile. This works fine on a Titan RTX but not on a Jetson Xavier AGX. Could someone please help me with this? A kernel without CUB works fine but when CUB is involved the cooperative kernel launch does not work well.

Thanks.
Archive.zip (2.75 KB)

Hi,

The cuda error from cudaLaunchCooperativeKernel can be reproduced on our environment.
We are checking this with our internal team. Will update more information with you later.

Thanks.

Thanks.

Hi, Anything on this?

Thanks.

Hi,

Sorry for the late update.

You can find some information in our runtime API header:

 /usr/local/cuda-10.0/targets/aarch64-linux/include/driver_types.h 

CUDA error 82 is cudaErrorCooperativeLaunchTooLarge.

 /**
     * This error indicates that the number of blocks launched per grid for a kernel that was
     * launched via either ::cudaLaunchCooperativeKernel or ::cudaLaunchCooperativeKernelMultiDevice
     * exceeds the maximum number of blocks as allowed by ::cudaOccupancyMaxActiveBlocksPerMultiprocessor
     * or ::cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags times the number of multiprocessors
     * as specified by the device attribute ::cudaDevAttrMultiProcessorCount.
     */

The GPU resource on Xavier is smaller than the desktop version.
Could you try to lower the #block to see if helps?

Thanks.

Hi,

Yes I tried to lower the #blocks and it helps, but the results are completely wrong. They do not match with cublas. Could you please try this on your end with lowering #blocks?

Thanks.

Hi,

Could you share the change you made with us?
Thanks.

Hi,
Attached please find the zip file with cu and makefiles.When you set#define WORKING 0then the cooperative kernel launch gives error code 82. When you set #define WORKING 1 then the cooperative kernel launch works, but the values in the std print are different from CUBLAS.

Thanks.

Unable to attach files. Not sure where the paperclip icon went. Please let me know how. Else, please add replace lines 157-163 with the following:

#if WORKING
int n_blocks = min(numblocks*8, blocksPerGrid);
std::printf(“n_blocks: %d\n”, n_blocks);
error = cudaLaunchCooperativeKernel(reinterpret_cast<void *>( &sdot<BLOCK_THREADS, ITEMS_PER_THREAD> ),
n_blocks,
BLOCK_THREADS,
args);
std::printf(“numblocks: %d\n”, numblocks);
#else
error = cudaLaunchCooperativeKernel(reinterpret_cast<void *>( &sdot<BLOCK_THREADS, ITEMS_PER_THREAD> ),
blocksPerGrid,
BLOCK_THREADS,
args);
#endif

And add #define WORKING 1. If set to 0 then cooperative launch will not work.

Hi,

Sorry that we cannot reproduce the working case.
Either #define WORKING 1 or #define WORKING 0 give us error code 82.

nvidia@xavier:~/topic_115430$ ./sdot
num_sms: 6
blocksPerGrid = 54
n_blocks: 32
numblocks: 4
errorcode: 82
n_blocks: 32
numblocks: 4
errorcode: 82
n_blocks: 32
numblocks: 4
errorcode: 82
n_blocks: 32
numblocks: 4
errorcode: 82
n_blocks: 32
numblocks: 4
errorcode: 82
...

Is anything we miss?

Thanks.

Hi, Are you using Jetson AGX Xavier? I see your num_sms is 6. This should be 8 for Jetson AGX Xavier with Volta GPU. Please let me know.

Thanks.

Hi,

Sorry for the missing. We tested this on an Xavier 8GB device before.
Now, we confirmed the incorrect CUB output can be reproduced on a standard Xavier device.

This issue will be passed to our internal team.
Will update more information once we got any feedback.

Thanks.

Hi.

Any updates on this?

Hi,

Sorry that we are still working on this issue.
Will keep you updated once we got a feedback from internal team.

Thanks.

1 Like