I made this small raytracer in CUDA that computes a pixel per kernel. While the performance is very good (in fact, i was amazed that i was adding stuff and there wasn’t much of a difference - although this probably has to do with glDrawPixels being a bottleneck than anything else), i reached to a point where adding more commands made the program to fail (the kernel wouldn’t execute). So i assume i reached some kind of command limit?
I’ve seen raytracers much more complex being done in pixel shaders, so i wonder if kernels aren’t directly mapped to pixel shader functionality but are more primitive in a sense that a single pixel shader can be “executed/converted” to more than one kernel.
The command limit is 2 million instructions. What you probably did is set the block size too high, and adding more code increased your register use until the block was requesting more registers than an MP has. To monitor register use, add the flag “–ptxas-options=-v” to the nvcc call. Registers-per-thread * threads-per-block must be somewhat less than 8192 (G80) or 16384 (G200).
You call your kernel indeed with 512 threads (and 512 bocks), so each block calculates a line, each thread calculates a pixel. But that will indeed only work when your kernel uses a maximum of
8192/512 = 16 (or 32 on GT200) registers.
I’ve read it but it doesn’t mention anything about the pixel shaders question. Also i don’t remember anything about the last hints part. There is an part on optimization, but i don’t remember it mentioning anything related to this current nvcc implementation. Of course i might be wrong since i read the whole document in a single stroke and i might have missed some things near the end (after reading a document for hours you miss stuff).
After I gave you my answer, it didn’t look like you knew what a “block” was.
There is no set limit on registers. Just on registers used by a block. You can have up to 128 registers, if your block is small enough (64 or 128 threads). You do not need to call the kernel multiple times to process a 2048x1536 image.
I knew what a block is, but i might not have worded it properly. Yesterday is the first day i touched CUDA and i’m not very sure about a few things. For example, when i mention “kernel call” i mean this (from my updated code):
So to be sure (and i just read the pages in the programming guide, but just to be sure): each one of these calls creates a grid with with 512 blocks each one containing 128 threads, right? Also these are scheduled to be executed in order - they do not overlap because they have no stream defined, right?
I have no background in parallel programming so all this is new stuff to me, but they look very interesting :-)
Then what does the documentation mean when it writes?
What i thought from this is that the threads in two streams will be executed at the “same time” (not really at the same time though since from what i’ve understood threads are time sliced) in the sense that the scheduler will schedule them more efficient.
I’ve read the page about streams a few times and i’m not sure where streams should be used. Are they only a host-oriented feature so the host can know what the GPU does (via events?) or they affect what the GPU does?