Invalid configuration argument for one kernel but works for another

Hi, I am encountering a strange issue and the error message is “Invalid configuration argument”.

I have a grid/block dim configuration, and it works for one kernel but doesn’t work for another. For example,
in this section, I first calculate a grid/block dim, save it to the object p, and run the kernel function with this configuration, it works fine:

// Calculate the kernel dimensions
		std::pair<dim3, dim3> dims = getIdealDims(PLACE, qty);
		dim3 blockDim = dims.first;
		dim3 threadDim = dims.second;
		p.kernelDims[0] = blockDim;
		p.kernelDims[1] = threadDim;

		std::unique_ptr<CudaEventTimer> timer = std::unique_ptr<CudaEventTimer>(new CudaEventTimer());
		timer->startTimer();

		// Initialize the Place array on device
		Place *d_place = nullptr;
		CATCH(cudaMalloc((void **)&d_place, qty * sizeof(PlaceType)));
		printf("Running instantiatePlaceArrayKernel with blockDim %d and threadDim %d\n", blockDim.x, threadDim.x);
		instantiatePlaceArrayKernel<PlaceType><<<blockDim, threadDim>>>(d_place, qty);
		CHECK();
		printf("Finished instantiatePlaceArrayKernel with blockDim %d and threadDim %d\n", blockDim.x, threadDim.x);

Later in the program, I retrieve this configuration and run another kernel function, the error occurs:

dim3 *dims = deviceInfo->getPlacesKernelDims(handle);
            printf("Dims: [%d %d %d] x [%d %d %d]\n", dims[0].x, dims[0].y, dims[0].z, dims[1].x, dims[1].y, dims[1].z);
            updateDevAttributesKernel<<<dims[0], dims[1]>>>(d_place, d_attributeTags, d_attributeDevPtrs, d_attributePitch, deviceInfo->countDevPlaces(handle), nAttributes);
            cudaError_t cudaStatus = cudaGetLastError();
            if (cudaStatus != cudaSuccess) {
                fprintf(stderr, "updateDevAttributesKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
            }

Here is output for the above code:

DeviceConfig::DeviceConfig: Using device 0: NVIDIA RTX A5000 Compute capability: 8.6
Max threads per block: 1024
Max threads per multiprocessor: 1536
Max Grid Size: 2147483647 x 65535 x 65535

Running instantiatePlaceArrayKernel with blockDim 1027 and threadDim 1024
Finished instantiatePlaceArrayKernel with blockDim 1027 and threadDim 1024

Dims: [1027 1 1] x [1024 1 1]
updateDevAttributesKernel launch failed: invalid configuration argument

I found that the error occurs once the grid dim exceeds 1024, but the max grid size is far beyond 1024. Could someone please help me with this? Thanks.

You are exceeding other limits, for example used registers per thread, which depends on the individual kernels. Not every kernel can be run with 1024 threads per block.

The kernel size is changed based on the data size.
If I decrease the data size and set the block size to 512, which makes the kernel size 1024 x 512, it runs fine.
If I increased the data size and set the block size to 1024, which makes the kernel size 1024 x 1024 as shown in the code, it also runs fine.
However, the error occurs when the kernel size is 1025 x 512. If I understand correctly, these three experiments show enough resources to run a kernel with a size of 1025 x 512.

Please show a minimal complete reproducer for your problem.