Whats wrong with this simple kernel call? Invalid Configuration Argument (with empty Kernel)

Here is my kernel call and the kernel itself is empty at the moment.

dim_x = 256; dim_y = 256; dim_z = 256;

	dimBlock.x = 32;

				dimBlock.y = 16;

	dimBlock.z = 1;

	dimGrid.x = dim_x / dimBlock.x;

	dimGrid.y = dim_y / dimBlock.y;

	dimGrid.z = dim_z / dimBlock.z;

	printf("BlockDim %d %d %d GridDim %d %d %d\n", dimBlock.x, dimBlock.y, dimBlock.z, dimGrid.x, dimGrid.y, dimGrid.z);

	copy3D<<<dimGrid, dimBlock>>>((unsigned char*)deviceInputVolumePitch3D.ptr,

								  (unsigned char*)deviceLabelVolumePitch3D_1.ptr,

								  deviceInputVolumePitch3D.pitch);

	status = cudaGetLastError();

	if(status != cudaSuccess)

	{fprintf(stderr, "%s\n", cudaGetErrorString(status));}

Printf shows:

BlockDim 32 16 1 GridDim 8 16 256

invalid configuration argument

I know what this error message means but I don’t know where the error comes from. I also know that the limitations for dimBlock x, y, z is 512, 512, 64 and in my program I don’t hit this limitation. When I set dimGrid.z = 1 the kernel call works. So where is my problem? Am I blind?

Accidently added this post and there is no option to delete it …

There is a limitation on dimGrid as well as dimBlock. It’s in the programming guide. 65535x65535x1 IIRC.

EDIT: Can’t actually see this in there. I’m sure it used to be.

thread block is 3-dimensional but grid is 2-dimensional, that is why it works when you set grid.z = 1

Thank you and Tigga for your replies.

It should be possible to have a 2-dimensional block and a 3-dimensinal grid, at least in theorie.

I tried to use this kernel only to test the rest of programm and my idea was to copy the 3D array

layer by layer and thats why I have a 2D block and a 3D grid.

It’s the first time that I hear that block and grid must be in the same “dimension” (1D, 2D, 3D). Where

look up this information in the “Programing Guide”?

You misunderstand. You cannot have 3D grids at all. That the source of your problems: you are using an illegal grid configuration.

That explains my problems External Media. Thank you.

Until yet I only worked with 2D configurations … and I used the same ideas for 3D.

Why is there this limitation? Some hardware specific stuff?

Only NVIDIA knows for sure, but I am guessing their scheduling hardware/software uses a single 32 bit word for block ID (which gives the 65335 x 65335 limit). It also makes some sense if you bear in mind the original purpose of all the current GPUs, ie. rendering, which basically resolves down to a 2D grid of pixels for display with underlying 3D calculations for the values of each pixel. So 3D blocks with 2D grids isn’t such a surprising choice.

Of course this might all change with the next generation Fermi devices, but that is only speculation…

Thanks again.

What is the best approach to process 3D data for example a voxel volume of 256 x 256 x 256 ?

How to chose blockDim and gridDim?

blockDim.x should be at least a multiple of 16 for coalescing , that I know, but what about
the other parameters?

It think it will depend a lot on your global memory access patterns.

I have a number of different 3D finite difference discretisation kernels for convection-diffusion equations which require 3 or 5 point stencil data in each spatial direction. For memory coalescing reasons, I wound up using a 2D block and grid structure, and have each thread traverse the third spatial dimension, maintaining a local shared memory cache of the data points necessary for constructing stencils in each block. In that arrangement, some threads per block don’t participate other than to do memory reads, and for branch divergence reasons, some do redundant calculation which get discarded, but it is still faster overall because it guarantees all global memory reads and writes are coalesced and keeps divergence and shared memory bank conflicts to a minimum (in the case of the double precision versions of the code).

If your data usage patterns are much closer to 1 global memory read per thread, or you are using textures, it probably doesn’t matter that much. The key thing is multiples of 32 threads per block for scheduling and memory coalescing reasons, and at least 192 threads per multiprocessor to amortise instruction pipelining costs.

How do you manage your shared memory cache? Is a __synthread() necessary in your approach?

Yes, two. At each iteration, all threads in a block to their global reads and shared memory updates and then synchronize. After that, the calculating threads do their work, then the participatory threads do their writes, while the idle threads sit at a second thread barrier. Then the loop continues.

Thank you for your help.

I read the paper “3D Finite Difference Computation on GPUs using CUDA” and they use the same as you.

If you are referring to the Micikevicius paper, then (if my memory serves me correctly) they didn’t use the same strategy as what I am describing. In that implementation, they did the differencing in two passes with two separate kernels, firstly a doing a 2D differencing (x-y) pass, then a 1D z pass on the results of the 2D differencing. They can do that because the high order central scheme they use can be decomposed into a set of partial sums. The difference operators for the hyperbolic terms I uses can’t easily be done that way, so I do the differencing calculations in a single pass, but using a 2D kernel grid.

He introduces the idea of the 2-pass-kernel but actually he uses a one pass kernel.

I have done some very similar things recently, though I was using upwind differencing and I think I may have structured it in a slightly different way. What sort of memory bandwidth/FLOPS were you getting?

It has been a while since I did any profiling and I haven’t really done all that much in the way of serious optimization, but I seem to recall that the explicit version of the 3D differencing kernel (so one using ghost cells) was hitting about 95Gb/s in single precision when I did look at it in the profiler running on a stock GTX275. I think the theoretical bandwidth of that card is around 127Gb/s, so there should still be more to be had out of it, but I haven’t be using the explicit code all that much in the last couple of months.