Are kernels more "primitive" than pixel shaders? (and also, my first cuda program :-P)

I made this small raytracer in CUDA that computes a pixel per kernel. While the performance is very good (in fact, i was amazed that i was adding stuff and there wasn’t much of a difference - although this probably has to do with glDrawPixels being a bottleneck than anything else), i reached to a point where adding more commands made the program to fail (the kernel wouldn’t execute). So i assume i reached some kind of command limit?

I’ve seen raytracers much more complex being done in pixel shaders, so i wonder if kernels aren’t directly mapped to pixel shader functionality but are more primitive in a sense that a single pixel shader can be “executed/converted” to more than one kernel.

The command limit is 2 million instructions. What you probably did is set the block size too high, and adding more code increased your register use until the block was requesting more registers than an MP has. To monitor register use, add the flag “–ptxas-options=-v” to the nvcc call. Registers-per-thread * threads-per-block must be somewhat less than 8192 (G80) or 16384 (G200).

You call your kernel indeed with 512 threads (and 512 bocks), so each block calculates a line, each thread calculates a pixel. But that will indeed only work when your kernel uses a maximum of
8192/512 = 16 (or 32 on GT200) registers.

Ah i see, thank you. So i assume a pixel shader for a 2048x1536 area (for example) is broken to more than one kernel calls, right?


I added a call for a second ray for antialiasing, and…

ptxas info	: Used 40 registers, 28+0 bytes lmem, 24+20 bytes smem, 24 bytes cmem[0], 208 bytes cmem[1]

…40 registers. So this exceeds the 32 registers per thread.

I will try to solve this by doing two calls per frame. However, are there any tips about minimizing register use without resorting to local memory? For instance, does this:

if (...) {

	int a;



if (...) {

	int b;



generate code that will use the same register for both a and b or i’m wasting resources and should have used a single variable outside the ifs?

Please read the programming guide.

I’ve read it but it doesn’t mention anything about the pixel shaders question. Also i don’t remember anything about the last hints part. There is an part on optimization, but i don’t remember it mentioning anything related to this current nvcc implementation. Of course i might be wrong since i read the whole document in a single stroke and i might have missed some things near the end (after reading a document for hours you miss stuff).

Thanks for your help.

After I gave you my answer, it didn’t look like you knew what a “block” was.

There is no set limit on registers. Just on registers used by a block. You can have up to 128 registers, if your block is small enough (64 or 128 threads). You do not need to call the kernel multiple times to process a 2048x1536 image.

if you call 512 * 16 blocks and have 512/16 threads per block, you can use 32*16 registers per thread.

and then just adjust your x & y calculation


global void calc_pixel(int* pixels, float ft)


int y = blockIdx.x>>4;

int x = threadIdx.x + ((blockIdx.x & 0xf)*16);

int idx = y*512 +x;



I knew what a block is, but i might not have worded it properly. Yesterday is the first day i touched CUDA and i’m not very sure about a few things. For example, when i mention “kernel call” i mean this (from my updated code):

calc_pixel<<<512, 128>>>(pixels, 0, itercount, antialias, ft);

calc_pixel<<<512, 128>>>(pixels, 128, itercount, antialias, ft);

calc_pixel<<<512, 128>>>(pixels, 256, itercount, antialias, ft);

calc_pixel<<<512, 128>>>(pixels, 384, itercount, antialias, ft);

So to be sure (and i just read the pages in the programming guide, but just to be sure): each one of these calls creates a grid with with 512 blocks each one containing 128 threads, right? Also these are scheduled to be executed in order - they do not overlap because they have no stream defined, right?

I have no background in parallel programming so all this is new stuff to me, but they look very interesting :-)

You can do it in one kernel call, just call 4 times as many blocks.

But yes, each call creates a grid and they are executed in order. Using streams does not change that, there is currently no option to run more than 1 kernel at the same time.

OOps, i mixed the maximum numbers of the blocks-per-grid and threads-per-block and i never tried any value above 512. Now i did

calc_pixel<<<4096, 64>>>(pixels, 0, itercount, antialias, ft);

and it runs faster (i tried several combinations, this seems to have the best results - at least when using power of two numbers).

However was i right or wrong about my understanding of the previous four calls (i mean about the numbers, etc)?

Oh, i didn’t saw the edit.

Then what does the documentation mean when it writes?

What i thought from this is that the threads in two streams will be executed at the “same time” (not really at the same time though since from what i’ve understood threads are time sliced) in the sense that the scheduler will schedule them more efficient.

I’ve read the page about streams a few times and i’m not sure where streams should be used. Are they only a host-oriented feature so the host can know what the GPU does (via events?) or they affect what the GPU does?

streams are for overlapping calculation on GPU with calculation on CPU. there is a simpleStreams example in the SDK.
will try to post some comments in your forum later today or tomorrow.