Kernel launch failed while number of threads per block smaller than largest number allowed

I inquired the maximum number of threads per block of my GPU and got the following message: Maximum number of threads per block: 1024.
I have a kernel defined below:

__global__ void mulFwdRotVecLessSharedMem(HorizRCR* rcrDescrs, const int num)
{
    extern __shared__ cuDoubleComplex prod_vec[]; // data for input vector and product, of size (2*p-1) + (2*p-1) = 4*p-2
    for(int i = 0; i < (num+gridDim.x-1)/gridDim.x; i++)
    {
        int idx = blockIdx.x + i*gridDim.x;
        if(idx < num)
        {
            // set pointers to global memory
            HorizRCR *rcrDescr = rcrDescrs + idx;
            cuDoubleComplex *rotMat = rcrDescr->fwdRotMat;
            cuDoubleComplex *coeffs = rcrDescr->coeffs;
            cuDoubleComplex *prod = rcrDescr->prod;
            int p = rcrDescr->p;

            // set pointers for the rotation block, the vector and the product in shared memory
            // cuDoubleComplex *rotBlock = dataMatVec;
            cuDoubleComplex *vecBlock = prod_vec;
            cuDoubleComplex *prodBlock = prod_vec + (2*p-1);
            
            for(int n = 0; n < p; n++)
            {
                // printf("n = %d\n",n);
                cuDoubleComplex *rotBlock = rotMat + ROTMATSZ(n); // address of the current block
                int mp = (int)threadIdx.y-n, m = (int)threadIdx.x-n;
                if(mp == 0 && abs(m) <= n) vecBlock[m+n] = coeffs[NM2IDX0(n,m)];
                if(m == 0 && abs(mp) <= n) prodBlock[mp+n] = make_cuDoubleComplex(0.0,0.0);
                __syncthreads();
                if(abs(mp) <= n && abs(m) <= n) atomicAdd(prodBlock+(mp+n),cuCmul(rotBlock[IDXC0(mp+n,m+n,2*n+1)],vecBlock[m+n]));
                __syncthreads();
                if(abs(mp) <= n && m == 0) prod[NM2IDX0(n,mp)] = prodBlock[mp+n];
            }
        }
    }
}

where HorizRCR is defined by

struct HorizRCR // data structure for RCR operation in the horizontal pass
{
    cuDoubleComplex *fwdRotMat = NULL;
    cuDoubleComplex *coaxMat = NULL;
    int pCoax = 0;
    cuDoubleComplex *invRotMat = NULL;
    
    cuDoubleComplex *coeffs = NULL;
    cuDoubleComplex *bufferCoeffs = NULL;
    cuDoubleComplex *prod = NULL;
    int p = 0;
    
    cuDoubleComplex *tgt = NULL; // target address where prod is to be added
};

I launched the kernel using the following statement:

dim3 blkDim(2*p-1,2*p-1,1);
dim3 gridDim(numRots,1,1);
mulFwdRotVecLessSharedMem<<<gridDim,blkDim,(4*p-2)*sizeof(cuDoubleComplex)>>>(horizTransDescrs_d,numRots);

The issue was, kernel launch failed as p = 12 and was successful as long as p < 12. From my understanding, as p = 12, the number of threads in a block is 23 * 23 = 529 < 1024. Then why did the launch fail?

“launch failed” is not a specific enough diagnosis. A kernel launch can fail pre-launch, by exceeding some resource constraint, of which too many threads is just one of several possibilities. Other possibilities are requesting too much shared memory or too many registers and probably some other reasons I can’t recall right now. It can also fail post-launch, e.g. when its runtime exceeds the operating’s systems watch dog timer limit or it encounters an out-of-bounds memory access.

I don’t recall how 2D thread blocks allocate threads. The allocation granularity may be greater than 1 just like it typically is with register allocation. If so, it would mean that the total number of threads allocated could be greater than the product of the thread block dimensions. Check the documentation to see what it states in this regard (this kind of information would most likely be relegated to one of the appendices of the CUDA Programming Guide).

An easy way to find out yourself is to write a small test program that launches null kernels that do nothing. Vary the thread block size and record which thread block configurations are launchable and which ones return an error.

1 Like

What is the best way to tell if a null kernel launch failed? My previous approach was to check results, as the results are of mathematical meaning to me.

Kernels return an error status. Synchronously for pre-launch errors, and asynchronously for post-launch errors. This is covered in the documentation. Status codes can be converted into a descriptive string with a CUDA API utility function. You could also do a Google search for proper CUDA error checking and should get a bunch of useful links on the first page of results.

1 Like

I followed your advice, used cudaGetLastError(), and got the message: too many resources requested for launch. I cannot think of any resource as a function of variable p except shared memory. However, from my kernel launch, you can see that the size of shared memory requested is only (4*p-2)*sizeof(cuDoubleComplex), which at “p = 12” is only 0.71875 KB, way less than the size of shared memory on each SM.

My advice was to explore the thread block configuration issue with null kernels that do not require any resources other than threads, thus assessing the constraints on the particular resource “threads” in isolation. In other words, I proposed the following experiment:

for x in [1, N]
    for y in [1, M]
        launch null kernel with thread block dimensions (x,y)
        record pass/fail based on error status returned from launch
    endfor
endfor

Sorry. I forgot to mention. I also followed this suggestion using this following empty kernel and a test function:

__global__ void voidKernel2D()
{
    
}

void testKernelLaunchLimit()
{
    for(int x = 1; x <= 33; x++)
    {
        for(int y = 1; y <= 33; y++)
        {
            dim3 blockDim(x,y,1);
            voidKernel2D<<<1,blockDim>>>();
            cudaError_t status = cudaGetLastError();
            if(status != cudaSuccess)
            {
                printf("Error: %s at (%d,%d)\n",cudaGetErrorString(status),x,y);
                return;
            }
        }
    }
}

The error message I got is Error: invalid configuration argument at (32,33). This is why I thought the error was actually not with block generation larger than requested, and why I thought the actual reason was with resources other than threads allocation.

My own program, which produces a shmoo plot, is shown below. So we now know the issue is not one of thread block organization, because any 2D block with 1024 threads or fewer will launch. If your original code encounters the error “too many resources requested for launch”, the first thing to check is shared memory usage per thread block and the second thing to check is register usage.

If you add -Xptxas -v to the nvcc command line, the compiler will tell you how many registers the kernel code needs per thread. Look for a line that starts like so (obviously the number will vary with kernel and target architecture; here it is a null kernel compiler for sm_30):

ptxas info : Used 2 registers,

Note that due to hardware limitations on register allocation, the allocation granularity is usually greater than 1. So if the compiler says the kernel needs 14 registers, but hardware allocation granularity is 4, at kernel launch time each thread will grab 16 registers. These allocation granularities differ by GPU architecture, and the details are “hidden” in the occupancy calculator spreadsheet that comes with CUDA, which is also conveniently linked here:

https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html

Give that a try and see what it tells you.

#include <stdio.h>
#include <stdlib.h>

#define CHECK_LAUNCH_ERROR(pass)                                      \
do {                                                                  \
    (pass) = 1;                                                       \
    /* Check synchronous errors, i.e. pre-launch */                   \
    cudaError_t err = cudaGetLastError();                             \
    if (cudaSuccess != err) {                                         \
        (pass) = 0;                                                   \
    }                                                                 \
    /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
    err = cudaDeviceSynchronize();                                    \
    if (cudaSuccess != err) {                                         \
        (pass) = 0;                                                   \
    }                                                                 \
} while (0)

__global__ void kernel (void) { }

int main (void)
{
    const int N = 40;
    const int M = 40;
    int res[N][M];
    for (int x = 1; x < N; x++) {
        for (int y = 1; y < M; y++) {
            dim3 dimBlock (x, y);
            dim3 dimGrid (1);
            kernel<<<dimGrid,dimBlock>>>();
            CHECK_LAUNCH_ERROR (res[x][y]);
        }
    }

    printf ("   y ");
    for (int y = 1; y < M; y++) {
        printf ("%2d ", y);
    }
    printf ("\n x\n");
    for (int x = 1; x < N; x++) {
        printf ("%2d    ", x);
        for (int y = 1; y < M; y++) {
            printf ("%c  ", res[x][y] ? 'p' : 'F');
        }
        printf ("\n");
    }
    return EXIT_SUCCESS;
}
1 Like

This is what I got from make of my kernel before making any modification:

ptxas info    : Compiling entry function '_Z25mulFwdRotVecLessSharedMemP8HorizRCRi' for 'sm_61'
ptxas info    : Function properties for _Z25mulFwdRotVecLessSharedMemP8HorizRCRi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 103 registers, 332 bytes cmem[0], 4 bytes cmem[2]

I then modified the kernel into this form:

__global__ void mulFwdRotVecLessSharedMem(HorizRCR* rcrDescrs, const int num)
{
    extern __shared__ cuDoubleComplex prod_vec[]; // data for a rotation matrix, input vector and product, of size (2*p-1) + (2*p-1) = 4*p-2
    for(int i = 0; i < (num+gridDim.x-1)/gridDim.x; i++)
    {
        int idx = blockIdx.x + i*gridDim.x;
        if(idx < num)
        {
            // set pointers to global memory
            HorizRCR *rcrDescr = rcrDescrs + idx;
            // cuDoubleComplex *rotMat = rcrDescr->fwdRotMat;
            // cuDoubleComplex *coeffs = rcrDescr->coeffs;
            // cuDoubleComplex *prod = rcrDescr->prod;
            // int p = rcrDescr->p;

            // set pointers for the rotation block, the vector and the product in shared memory
            // cuDoubleComplex *rotBlock = dataMatVec;
            cuDoubleComplex *vecBlock = prod_vec;
            cuDoubleComplex *prodBlock = prod_vec + (2*rcrDescr->p-1);
            
            cuDoubleComplex *rotBlock;
            int mp, m;
            for(int n = 0; n < rcrDescr->p; n++)
            {
                // printf("n = %d\n",n);
                rotBlock = rcrDescr->fwdRotMat + ROTMATSZ(n); // address of the current block
                mp = (int)threadIdx.y-n;
                m = (int)threadIdx.x-n;
                if(mp == 0 && abs(m) <= n) vecBlock[m+n] = rcrDescr->coeffs[NM2IDX0(n,m)];
                if(m == 0 && abs(mp) <= n) prodBlock[mp+n] = make_cuDoubleComplex(0.0,0.0);
                __syncthreads();
                if(abs(mp) <= n && abs(m) <= n) atomicAdd(prodBlock+(mp+n),cuCmul(rotBlock[IDXC0(mp+n,m+n,2*n+1)],vecBlock[m+n]));
                __syncthreads();
                if(abs(mp) <= n && m == 0) rcrDescr->prod[NM2IDX0(n,mp)] = prodBlock[mp+n];
            }
        }
    }
}

You can see that I commented unnecessary variables and moved variable declarations out of the for loop. Surprisingly, the number of registers significantly decreased:

ptxas info    : Compiling entry function '_Z25mulFwdRotVecLessSharedMemP8HorizRCRi' for 'sm_61'
ptxas info    : Function properties for _Z25mulFwdRotVecLessSharedMemP8HorizRCRi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 54 registers, 332 bytes cmem[0], 4 bytes cmem[2]

And right now, the kernel works normally at p = 12 and some larger p. However, I still have a concern about the relation between variable p and number of registers. Since before modification, kernel launches with p < 12 were successful, and only failed at p >= 12, then my question is, how can the variable p influence register size of each thread? Could variable declaration in a for loop increase size of registers?

answer: the original kernel used 103 registers for each thread. As variable p increases, the block size increase. When p = 12, the block size is 23 * 23 = 529. The total number of registers for each block is at least 103 * 529 = 54487. From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications__technical-specifications-per-compute-capability, we can see that the maximum number of registers in a single block is 64K. Considering the granularity of register allocation, it is highly likely that at p = 12, the number of registers requested is larger than maximum allowed, thus causing the launch failure.

I haven’t looked closely at your code. It seems like you may be using more than one variable called p?

If p is a compile-time constant (maybe even a #defined constant that also is used inside a kernel) and is used for loop control, for example, that could have an influence on the code generated after the compiler applies loop unrolling, which in turn can change register usage.

If, however, p is used only to size thread block configurations and the same p is not used inside the kernel at all, then the code generated for the kernel should be completely independent of p, and so should the register usage.

I am not going to work through the details of your code. Between the compiler information about register usage and the occupancy calculator I am confident you can figure out what is happening for each variant of your code.

As a general note, using thread blocks with a large number of threads is rarely a good idea. A good rule of thumb is to pick a thread block size between 128 and 256 threads (ideally a multiple of 32), as this typically allows for higher occupancy and better hardware scheduling efficiency due to the smaller block granularity and avoids most out-of-resources scenarios, like the one you ran into here. With the smaller blocks the grid dimensions will have to become larger, but that has no performance impact I am aware of.

1 Like

Thanks for all the useful remarks. I have one general question which may not be directly relevant with this post here, but since I observed this phenomenon again in my computation using this kernel, I’d like to take this opportunity to ask.

My question is, does GPU typically generate different (although slight) results from CPU on a same algorithm? One reason for the difference is the order of addition executed. However, I checked the logic of my CPU routine and GPU routine. In a kernel launch of block size 1, the two routines should be of the same logic and order of addition. I am still seeing slight difference in results from two executions. For instance, I managed to find the maximum magnitude in an array which is the difference between two executions: The maximum norm in the error array is 0.0000000000000004. Although quite small, I don’t have a clue where this difference originates from. Could you provide a general answer for this question?

Generally speaking, with floating-point computation you are likely to get different results on different platforms (hardware/software combination). It’s just that for the past two decades, most programmers have lived in a x86 + {gcc | MSVC} monoculture and therefore aren’t very familiar with the issues.

Many programmers make invalid assumption about which platform is correct when they have two and the results differ. The reality is: If you have two watches showing different time, you can’t tell which one is closer to the true time. You need a higher precision reference to assess that (e.g. NIST reference).

NVIDIA provides a whitepaper that addresses specific reasons why the x86 host and the GPU might deliver different results, even though both adhere to the IEE-754 floating-point standard. Well worth reading:

https://docs.nvidia.com/cuda/floating-point/index.html

1 Like