Whats wrong with this simple kernel call? Invalid Configuration Argument (with empty Kernel)

CapJo · November 23, 2009, 2:24pm

Here is my kernel call and the kernel itself is empty at the moment.

dim_x = 256; dim_y = 256; dim_z = 256;

	dimBlock.x = 32;

				dimBlock.y = 16;

	dimBlock.z = 1;

	dimGrid.x = dim_x / dimBlock.x;

	dimGrid.y = dim_y / dimBlock.y;

	dimGrid.z = dim_z / dimBlock.z;

	printf("BlockDim %d %d %d GridDim %d %d %d\n", dimBlock.x, dimBlock.y, dimBlock.z, dimGrid.x, dimGrid.y, dimGrid.z);

	copy3D<<<dimGrid, dimBlock>>>((unsigned char*)deviceInputVolumePitch3D.ptr,

								  (unsigned char*)deviceLabelVolumePitch3D_1.ptr,

								  deviceInputVolumePitch3D.pitch);

	status = cudaGetLastError();

	if(status != cudaSuccess)

	{fprintf(stderr, "%s\n", cudaGetErrorString(status));}

Printf shows:

BlockDim 32 16 1 GridDim 8 16 256

invalid configuration argument

I know what this error message means but I don’t know where the error comes from. I also know that the limitations for dimBlock x, y, z is 512, 512, 64 and in my program I don’t hit this limitation. When I set dimGrid.z = 1 the kernel call works. So where is my problem? Am I blind?

CapJo · November 23, 2009, 2:50pm

Accidently added this post and there is no option to delete it …

Tigga · November 23, 2009, 3:02pm

There is a limitation on dimGrid as well as dimBlock. It’s in the programming guide. 65535x65535x1 IIRC.

EDIT: Can’t actually see this in there. I’m sure it used to be.

LSChien · November 23, 2009, 3:03pm

thread block is 3-dimensional but grid is 2-dimensional, that is why it works when you set grid.z = 1

CapJo · November 23, 2009, 3:14pm

Thank you and Tigga for your replies.

It should be possible to have a 2-dimensional block and a 3-dimensinal grid, at least in theorie.

I tried to use this kernel only to test the rest of programm and my idea was to copy the 3D array

layer by layer and thats why I have a 2D block and a 3D grid.

It’s the first time that I hear that block and grid must be in the same “dimension” (1D, 2D, 3D). Where

look up this information in the “Programing Guide”?

avidday · November 23, 2009, 3:17pm

You misunderstand. You cannot have 3D grids at all. That the source of your problems: you are using an illegal grid configuration.

CapJo · November 23, 2009, 3:22pm

That explains my problems External Media. Thank you.

Until yet I only worked with 2D configurations … and I used the same ideas for 3D.

Why is there this limitation? Some hardware specific stuff?

avidday · November 23, 2009, 3:29pm

Only NVIDIA knows for sure, but I am guessing their scheduling hardware/software uses a single 32 bit word for block ID (which gives the 65335 x 65335 limit). It also makes some sense if you bear in mind the original purpose of all the current GPUs, ie. rendering, which basically resolves down to a 2D grid of pixels for display with underlying 3D calculations for the values of each pixel. So 3D blocks with 2D grids isn’t such a surprising choice.

Of course this might all change with the next generation Fermi devices, but that is only speculation…

CapJo · November 23, 2009, 3:43pm

Thanks again.

What is the best approach to process 3D data for example a voxel volume of 256 x 256 x 256 ?

How to chose blockDim and gridDim?

blockDim.x should be at least a multiple of 16 for coalescing , that I know, but what about
the other parameters?

avidday · November 23, 2009, 4:08pm

It think it will depend a lot on your global memory access patterns.

I have a number of different 3D finite difference discretisation kernels for convection-diffusion equations which require 3 or 5 point stencil data in each spatial direction. For memory coalescing reasons, I wound up using a 2D block and grid structure, and have each thread traverse the third spatial dimension, maintaining a local shared memory cache of the data points necessary for constructing stencils in each block. In that arrangement, some threads per block don’t participate other than to do memory reads, and for branch divergence reasons, some do redundant calculation which get discarded, but it is still faster overall because it guarantees all global memory reads and writes are coalesced and keeps divergence and shared memory bank conflicts to a minimum (in the case of the double precision versions of the code).

If your data usage patterns are much closer to 1 global memory read per thread, or you are using textures, it probably doesn’t matter that much. The key thing is multiples of 32 threads per block for scheduling and memory coalescing reasons, and at least 192 threads per multiprocessor to amortise instruction pipelining costs.

CapJo · November 23, 2009, 4:34pm

How do you manage your shared memory cache? Is a __synthread() necessary in your approach?

avidday · November 23, 2009, 4:59pm

Yes, two. At each iteration, all threads in a block to their global reads and shared memory updates and then synchronize. After that, the calculating threads do their work, then the participatory threads do their writes, while the idle threads sit at a second thread barrier. Then the loop continues.

CapJo · November 23, 2009, 5:34pm

Thank you for your help.

I read the paper “3D Finite Difference Computation on GPUs using CUDA” and they use the same as you.

avidday · November 23, 2009, 5:48pm

If you are referring to the Micikevicius paper, then (if my memory serves me correctly) they didn’t use the same strategy as what I am describing. In that implementation, they did the differencing in two passes with two separate kernels, firstly a doing a 2D differencing (x-y) pass, then a 1D z pass on the results of the 2D differencing. They can do that because the high order central scheme they use can be decomposed into a set of partial sums. The difference operators for the hyperbolic terms I uses can’t easily be done that way, so I do the differencing calculations in a single pass, but using a 2D kernel grid.

CapJo · November 23, 2009, 6:13pm

He introduces the idea of the 2-pass-kernel but actually he uses a one pass kernel.

Tigga · November 23, 2009, 10:06pm

It think it will depend a lot on your global memory access patterns.

I have a number of different 3D finite difference discretisation kernels for convection-diffusion equations which require 3 or 5 point stencil data in each spatial direction. For memory coalescing reasons, I wound up using a 2D block and grid structure, and have each thread traverse the third spatial dimension, maintaining a local shared memory cache of the data points necessary for constructing stencils in each block. In that arrangement, some threads per block don’t participate other than to do memory reads, and for branch divergence reasons, some do redundant calculation which get discarded, but it is still faster overall because it guarantees all global memory reads and writes are coalesced and keeps divergence and shared memory bank conflicts to a minimum (in the case of the double precision versions of the code).

If your data usage patterns are much closer to 1 global memory read per thread, or you are using textures, it probably doesn’t matter that much. The key thing is multiples of 32 threads per block for scheduling and memory coalescing reasons, and at least 192 threads per multiprocessor to amortise instruction pipelining costs.

I have done some very similar things recently, though I was using upwind differencing and I think I may have structured it in a slightly different way. What sort of memory bandwidth/FLOPS were you getting?

avidday · November 23, 2009, 10:24pm

It has been a while since I did any profiling and I haven’t really done all that much in the way of serious optimization, but I seem to recall that the explicit version of the 3D differencing kernel (so one using ghost cells) was hitting about 95Gb/s in single precision when I did look at it in the profiler running on a stock GTX275. I think the theoretical bandwidth of that card is around 127Gb/s, so there should still be more to be had out of it, but I haven’t be using the explicit code all that much in the last couple of months.

Topic		Replies	Views
LARGE 2D arrays CUDA Programming and Performance	10	8542	August 11, 2011
Newbie help on thread blocks CUDA Programming and Performance	22	10593	December 24, 2008
Grid dimensions CUDA Programming and Performance	6	5378	September 18, 2009
trouble learning how to set block and max thread size CUDA Programming and Performance	4	1953	January 26, 2011
Size limitation for 1D Arrays in CUDA? CUDA Programming and Performance	9	18101	October 17, 2013
do not understand thread/block division CUDA Programming and Performance	10	2791	April 23, 2012
About grid size and performance CUDA Programming and Performance	10	2396	June 25, 2010
Dimensions of a Block and a Grid CUDA Programming and Performance	7	12977	May 1, 2008
BUG in CUDA2.0 Beta2 Max grid dimension CUDA Programming and Performance	8	4420	July 4, 2008
How to realise 3 dimension calculation? CUDA Programming and Performance	9	240	August 28, 2023

Whats wrong with this simple kernel call? Invalid Configuration Argument (with empty Kernel)

Related topics