Freeze on call to cudaOccupancyMaxPotentialBlockSize (but not always)

Hi, I have a c++ program using cuda (12.4 + latest drivers on gcp ubuntu 22.04) running on an L4 gpu (compiled with cmake, using compute cap 8.9 and latest apt-available nvcc)
I use a utility function to get launch parameters to process dynamically sized device-side arrays, which looks like this:

std::pair<int, int> get_cuda_max_occupancy(size_t n_items, void* kernel_ptr) {
    // Have CUDA calculate the thread block size
    int mingridsize;
    int threadblocksize;
    cudaOccupancyMaxPotentialBlockSize(&mingridsize, &threadblocksize, kernel_ptr, 0, 0);

    int gridsize = ceil(((double)n_items + (double)threadblocksize - 1) / (double)threadblocksize);
    assert(gridsize * threadblocksize >= n_items);
    return std::make_pair(gridsize, threadblocksize);

(while the gridsize calculation there isn’t optimal, that’s not what’s causing me troubles)

When I run my program, the first few calls to the above function complete without issue, and then at (a reproducible) point the process freezes at the cuda API call (cudaOccupancyblabla…).

I’ve isolated it to freezing only when passed this kernel ptr:

__global__ void inplace_bool_not_gpuarrays_kernel(const GpuArrayView<bool> a_inout) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= a_inout.size) return;[tid] = !([tid]);

and NOT when passed this kernel (called from same file, by get_cuda_max_occupancy(number_here, (void*)kernel_name):

__global__ void inplace_bool_or_gpuarrays_kernel(const GpuArrayView<bool> a_inout, const GpuArrayView<bool> b_in) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid >= a_inout.size) return;[tid] =[tid] ||[tid];

I’ve run with all the compute-sanitizer tools, and none of them report any issues. When running under cuda-gdb I can only go down a few frames into the function at the point of freeze, and then I reach headers (since the drivers/cuda API is proprietary), but it never goes back up and returns.
What could be causing this freeze?

I am, in other parts of the program, using all of the following cuda features:
reading/writing constant memory (writing using the host api)
dynamic shared memory request (>48kb)
async memcopies
unified memory (mallocmanaged)
-maxrregcount compilation (all files limited to 64 regs)
but I am not destroying any cuda contexts (indeed I don’t touch any of them, I just use the implicit & default one) inbetween the calls to the occupancy function.

Hope this rings a bell for someone who knows the deep api magic :)

Have repro’d on a colleagues machine (identical hardware), freezes at the same point.

Happily resolved now, no api bugs in sight, it was caused by insufficient error checking (I was of the impression cudaDeviceSynchronize or compute-sanitizer would catch this but no).
I had a underflow in a buffer size calculation that caused me to try to allocate 2^LONG_MAX bytes with cudamallocmanaged a few lines before the offending/freezing get_cuda_max_occupancy, and the next synchronizing api call (which I suppose codaOccupancyMaxPotentialBlockSize was) probably died for that reason.

Hope this helps someone debug their issues in the future :) .

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.