Limit on the size of data that can be processed by a kernel Newbie question

Hi all,

    I am new to CUDA and not an expert in C either. So for now I am just trying out a few programs. (Using the series of articles in Dr.Dobbs..) I wrote and compiled the code given in that article about an introduction to using kernels. The code just takes an array, transfers it to the GPU, increments every element of the array by 1 and sends it back to the host. I was then playing around with this code, changing the size of the array and checking at what size the code would stop working. There were a few things that do not seem to make sense. I would really appreciate it if someone could help me out a bit on this. I am doing all this using the emulator..

If my array size increased beyond 2048 integers the result i got was an unincremented array. My blocksize was defined to be 4 and the No. of blocks was calculated to be as many as needed to accomodate the array. When I increased the blocksize the limit on the array size also increased. Size of the array could be a maximum of 512*blocksize. More than that and my array was returned un-incremented.

I figure that this is because the kernel was not getting called because the available resources are not enough to run the kernel… right.

i ) I read in the programming guide that “If the shared memory and registers available per multiprocessor is not sufficient to execute atleast one block, then the kernel will fail to launch”. But if my kernel failing to launch was caused due to insufficient shared memory, then how come the array size limit increases when i increase the blocksize. When I increase the blocksize, I am increasing the no. of threads per block, more threads means more shared memory required. So shouldnt the array size limit decrease if i increase my block size because the amount of shared memory required per block increases.

Basically what I am saying is, when my blocksize is smaller I am dividing the problem into a lot of small blocks and when I increase blocksize I divide the problem into a smaller number of larger blocks. Intuitively it would figure that it is easier to execute a smaller block than a larger block. But it seems to be the opposite here.

ii )The only other explanation that I can think of for why the kernel fails to launch is that there is a limit on the maximum no. of blocks you can use. Max. array size = Blocksize * 512 implies Maximum number of allowed blocks =512. But I read that the number of blocks in a grid is not constrained by the device and it can greatly exceed the number of available multiprocessors. Is this maybe some kind of limitation imposed because I am using the emulator?

I get the feeling that I am missing something very elementary here. But I ve spent quite some time thinking and still not able to figure out what… Any help would be greatly appreciated.



I have a feeling you confused blocksize and gridsize when calling your kernel. In fact, you would never want to launch blocks consisting of 4 threads each - very ineffective. On the other hand, it’s quite common to launch 512 threads per block which is also the maximum amount of threads per block allowed.

I believe you’re actually accidentally launching 4 blocks, 512 threads each for the 2048 array and then trying to increase the amount of threads per block to accommodate for the larger array.

A simple incrementation will most definitely not consume too much resources and 512 blocks is quite a small amount for a GPU.

Aargh!!.. I knew it would be something very basic. That was rather boneheaded of me… :">

Thanks a lot for helping me out there. Wasn’t able to figure this out and couldnt move ahead without figuring it out.

Thanks again.