using 'z' within the grid size

nuliknol · December 24, 2009, 3:53pm

Hi,

i get this error when i execute my kernel (in emu mode):

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <t.cu>, line 18 : invalid configuration argument.

dim3 grid(512,512,512);

   .....

	kernel<<< grid , 512 >>>(result);

is this because the z variable in the grid is 512 ? The docs say z can be only 1 but i do not exceed 65536x65536 maximum grid size, why ‘z’ can’t be used? I need x,y and z to run within the 1-512 range, is there a way to do this without doing extra calculations? Right now the only solution seems to be running the kernel with ‘x’ from 1-512 and ‘y’ from 1-65536 and then divide the ‘y’ by 512 to simulate ‘z’, but the division takes many gpu cycles. How do you guys do it?

Thanks in advance for any ideas or suggestions.

king1 · December 24, 2009, 4:05pm

Hi,

i get this error when i execute my kernel (in emu mode):

cutilCheckMsg() CUTIL CUDA error: Kernel execution failed in file <t.cu>, line 18 : invalid configuration argument.
dim3 grid(512,512,512);

   .....

	kernel<<< grid , 512 >>>(result);
is this because the z variable in the grid is 512 ? The docs say z can be only 1 but i do not exceed 65536x65536 maximum grid size, why ‘z’ can’t be used? I need x,y and z to run within the 1-512 range, is there a way to do this without doing extra calculations? Right now the only solution seems to be running the kernel with ‘x’ from 1-512 and ‘y’ from 1-65536 and then divide the ‘y’ by 512 to simulate ‘z’, but the division takes many gpu cycles. How do you guys do it?

Thanks in advance for any ideas or suggestions.

For grid only 2 dimensions are allowed. so extra calculations must

Mr_Nuke · December 25, 2009, 10:44pm

The grid can only have x and y greater than 1.

Simulating z by divding y seems like a good way to go. If the division takes enough GPU cycles so that you notice a performance hit, then your kernel does far too few calculations, and you might consider setting up your configuration differently.

For example, I worked with a simulation that could be parallelized to n by p threads, but then for every n, the results of p_i had to be summed. Originally, I wanted to have n by p threads, but that meant either 2 separate kernels, or a reduction later on, with insane amounts of memory. Having only n threads proved to be much faster and resource friendly than anticipated.

So another strategy would be to change your approach so that you need one fewer dimension. You could also use a 3-dimensional block with a 2-D grid.