Limitation on number of device function calls in a kernel? (and its effect on maximum grid size?)

Hello,

This is Mike. I’m a Physics PhD student and am trying to use CUDA for one of my scientific computing research projects.


(short version of the question:)

For a GPU kernel that accepts a rather complicated device lambda function as argument, say integral<<<blocksPerGrid, threadsPerBlock>>>(f,x0,x1), where f can be any device function or the composite of device functions, is this device function expanded inline by the compiler in the kernel code? If so, is there a limitation to the number of function calls (or the depth of function calls, which depends on how complex f is) the kernel can make to f?

I wonder what resource on the GPU does such a “device function call in kernel” use up? Is is the shared memory or the register, or global memory, or some other part of the GPU’s memory? If it affects the shared memory & register, would the length of the function (or number of calls to the function) limit the total number of threads per block? Would it limit the blocks per grid?

Thank you!


(the full question:)

Many thanks to the kind answer from the topic: https://devtalk.nvidia.com/default/topic/1043958/cuda-programming-and-performance/passing-lambda-functions-as-arguments-to-kernels/ I have now learnt to pass device functions/lambdas as arguments to kernels, in order to perform generic operations such as integration(function, x0, x1) and integration2d(function, x0, x1, y0, y1).

I have now finished writing a program (which is part of larger a physical simulation) that can acquire a function F (which describes a physical field) and integrate it over a 2D plane efficiently with the power of GPU, and a simple timer shows it to be over x20 times faster than CPU (even under double precision using a consumer card - which only has 1/32 DP throughput).

However, I met a quite perplexing bug: For 2D integration (say, over square ((0,1),(0,1))), one can set a fixed number of steps in each axis. For instance, for a 8080 search region (a step size of 0.0125 in both x and y direction), suppose thread per block size is 1616, I can then launch a
kernel that looks like intergration2d<<<dim3(5,5), dim3(16,16)>>>(function, range, data), here with a total of 25 blocks in the grid. The finer the search region is (in my simulation I’m using around 1000010000, so there are about 625625 blocks in a grid), the more accurate the integration value would be.

The kernel takes in a function F, which is a double(double ,double) two-variable function, but it’s rather complex and is in fact a lambda in the form of many composite functions with closures too: auto F = [=] device (double x, double y){f(x,y,a,b)*g(x,y,c,d)*h(x,e,f)…} where f, g, h, … are all device functions. The kernel also uses a shared cache equal to the number of threads (here it is an array of float[256].
).

The problem is, when I’m using a small number of search steps (i.e. fewer blocks), the program works with no problem. But as I increase the number of blocks (i.e. larger problem size, for a finer search region), say 12501250 blocks per grid, with the same 3232 threads per block, the kernel fails to launch (cudaPeekAtLastError catchs “unspecified launch failure”).

The card I’m using is a Titan Xp with 12GB global memory (might be slightly less under Windows driver), and max number of threads/block is 1024, max grid size is (2147483647, 65535, 65535), using CUDA 9.0 and code generation compute_61, sm_61. It doesn’t seem like a 12501250 grid exceeds the maximum grid size, and the shared memory used (2564 bytes per block) doesn’t exceed 48KB either. The data passed is only the number of blocks (so here it is float[1250*1250]), which doesn’t exceed the global memory either.

I tried to tweak the code and test it, and it seems that the function call F in the kernel is what’s limiting the maximum grid size. The kernel calls F for 4 times, and if I change it to 1 time, I can choose 2*2 times larger the grid size. If I remove F from the kernel altogether, the grid size is no longer limited. So it appears that the number of calls to a complex device function (as I read, CUDA compiler actually puts all device functions inline?) limits the grid size.

However, I’m not really understanding this phenomenon…Wouldn’t the problem size only decide the total number of blocks (i.e. grid size)? I assume that the usage of e.g. shared memory and registers for each SM only depends on the kernel function, the called F device function, and the block size (threadperBlock) - but not the total number of blocks (since the blocks are relatively independent and out-of-order, and could even be serially excecuted in sequence on the same SM, so a larger grid just means a longer queue)? As it appears, though, apparently the grid size is affected by the size of the kernel too (suppose F is put inline, so more calls to F means longer kernel), so even with the same kernel function and same threads per block, having more blocks in a grid might still cause failure in kernel launch. May I ask what might be causing this phenomenon?

Many thanks for your kind help!


P.S. Attached please find the kernel code in question (for 2d integration):

template<class F>
__global__ void Kernel_Integral2d_GPU(double *result, double x_l_bound, double x_u_bound, double y_l_bound, double y_u_bound, F f, dim3 BlocksPerGrid) {

	int tidx = threadIdx.x;
	int bidx = blockIdx.x;
	int dimx = blockDim.x;
	int idx = tidx + bidx*dimx;

	int tidy = threadIdx.y;
	int bidy = blockIdx.y;
	int dimy = blockDim.y;
	int idy = tidy + bidy*dimy;

	//extern __shared__ double cache2[];
	__shared__ double cache2[THREAD_PER_BLOCK_X*THREAD_PER_BLOCK_Y];

	double step_size_x = (x_u_bound - x_l_bound) / (INTEGRAL2D_STEP_GPU);
	double step_size_y = (y_u_bound - y_l_bound) / (INTEGRAL2D_STEP_GPU);

	
	if (idx < INTEGRAL2D_STEP_GPU && idy < INTEGRAL2D_STEP_GPU) {

		double x_p = x_l_bound + idx*step_size_x;
		double y_p = y_l_bound + idy*step_size_y;
		double volume = (step_size_x*step_size_y)*0.25*(f(x_p, y_p) + f(x_p + step_size_x, y_p) + f(x_p, y_p + step_size_y) + f(x_p + step_size_x, y_p + step_size_y));
		//double volume = (step_size_x*step_size_y)*f(x_p + 0.5*step_size_x, y_p + 0.5*step_size_y);
                //changing to one function call here would allow a larger grid size (before kernel launch aborts due to large grid size)

		if (volume < 99999 && volume > -99999)
			cache2[tidx + tidy*dimx] = volume;
	}
	else {
		cache2[tidx + tidy*dimx] = 0; //not all threads in the last block calculate a value, so zero padding is needed
	}
	
	__syncthreads();

	if (tidx == 0 && tidy == 0) {
                //the first thread in each block sums up the data in cache, and uploads to global memory
                //the global memory with BlocksPerGrid.x * BlocksPerGrid.y is returned to the host CPU
		double sum = 0;
		for (int i = 0; i < dimx*dimy; i++) {
			sum += cache2[i];
		}
		result[bidx + bidy*BlocksPerGrid.x] = sum;
	}

	return;
}

No, it is not expanded inline by the compiler. The pointer to the function is passed as a kernel argument, and the kernel calls that pointer, similar to how this would work in host code.