Limit on the size of data that can be processed by a kernel Newbie question

avi2886 · January 16, 2009, 8:48pm

Hi all,

    I am new to CUDA and not an expert in C either. So for now I am just trying out a few programs. (Using the series of articles in Dr.Dobbs..) I wrote and compiled the code given in that article about an introduction to using kernels. The code just takes an array, transfers it to the GPU, increments every element of the array by 1 and sends it back to the host. I was then playing around with this code, changing the size of the array and checking at what size the code would stop working. There were a few things that do not seem to make sense. I would really appreciate it if someone could help me out a bit on this. I am doing all this using the emulator..

If my array size increased beyond 2048 integers the result i got was an unincremented array. My blocksize was defined to be 4 and the No. of blocks was calculated to be as many as needed to accomodate the array. When I increased the blocksize the limit on the array size also increased. Size of the array could be a maximum of 512*blocksize. More than that and my array was returned un-incremented.

I figure that this is because the kernel was not getting called because the available resources are not enough to run the kernel… right.

i ) I read in the programming guide that “If the shared memory and registers available per multiprocessor is not sufficient to execute atleast one block, then the kernel will fail to launch”. But if my kernel failing to launch was caused due to insufficient shared memory, then how come the array size limit increases when i increase the blocksize. When I increase the blocksize, I am increasing the no. of threads per block, more threads means more shared memory required. So shouldnt the array size limit decrease if i increase my block size because the amount of shared memory required per block increases.

Basically what I am saying is, when my blocksize is smaller I am dividing the problem into a lot of small blocks and when I increase blocksize I divide the problem into a smaller number of larger blocks. Intuitively it would figure that it is easier to execute a smaller block than a larger block. But it seems to be the opposite here.

ii )The only other explanation that I can think of for why the kernel fails to launch is that there is a limit on the maximum no. of blocks you can use. Max. array size = Blocksize * 512 implies Maximum number of allowed blocks =512. But I read that the number of blocks in a grid is not constrained by the device and it can greatly exceed the number of available multiprocessors. Is this maybe some kind of limitation imposed because I am using the emulator?

I get the feeling that I am missing something very elementary here. But I ve spent quite some time thinking and still not able to figure out what… Any help would be greatly appreciated.

best,

Avinash

_Big_Mac · January 16, 2009, 8:59pm

I have a feeling you confused blocksize and gridsize when calling your kernel. In fact, you would never want to launch blocks consisting of 4 threads each - very ineffective. On the other hand, it’s quite common to launch 512 threads per block which is also the maximum amount of threads per block allowed.

I believe you’re actually accidentally launching 4 blocks, 512 threads each for the 2048 array and then trying to increase the amount of threads per block to accommodate for the larger array.

A simple incrementation will most definitely not consume too much resources and 512 blocks is quite a small amount for a GPU.

avi2886 · January 16, 2009, 9:18pm

Aargh!!.. I knew it would be something very basic. That was rather boneheaded of me… :">

Thanks a lot for helping me out there. Wasn’t able to figure this out and couldnt move ahead without figuring it out.

Thanks again.

best,
Avinash

Topic		Replies	Views
Launching Kernel Fail CUDA Programming and Performance	15	3409	May 28, 2014
Large Thread Size prevents Kernel from running CUDA Programming and Performance	8	925	May 16, 2011
array elements limitation CUDA Programming and Performance	4	1024	May 19, 2009
Kernel function doesn't launch with block size >16 Block size of 4, 8, and 16 launch fine CUDA Programming and Performance	2	2875	July 28, 2008
block size CUDA Programming and Performance	6	5864	July 21, 2013
Maximum number of blocks Legacy PGI Compilers	5	2399	April 7, 2020
Newbie: Maximum Block Size CUDA Programming and Performance	8	3378	July 22, 2008
Limitation on number of device function calls in a kernel? (and its effect on maximum grid size?) CUDA Programming and Performance	1	944	November 14, 2018
help with some cuda programming CUDA Programming and Performance	9	1818	August 31, 2009
Need help understanding kernel function, grid and block CUDA Programming and Performance	6	526	October 12, 2021

Limit on the size of data that can be processed by a kernel Newbie question

Related topics