I’m going to implement a ray tracer. It’s all about binary tree traversal (with a short stack) and some simple computations.
Now I have two schemes for implementation :
1.Implement the whole algorithm in a large kernel. So all the intermediate data (for example: the short stack) are stored as local variables.
2.Implement it as a sets of small kernels, and they use global memories to store intermediate data.
I don’t know which one will have a better performance.
I guess :
For the first scheme, the kernel may use all registers and even local memories for intermediate data storage. But the compiler will try to optimize the register usage.
For the second scheme, though each kernel use less register, but they must store intermediate data to global memories. The compiler won’t have chance to optimize the register usage.
So the first one is better.
If you can reuse the data you place on the shared memory it is better to have a large kernel (not considering the issues you might have because of the Timeout Detection and Recovery (TDR) mechanism - you can always disable it). You still be able to divide your algorithm by declaring device functions.
If you cannot reuse the data in the shared memory then probably you should split things in several kernels.
I have a very similar question in another thread … The basic issue is of “intermediate data” … In my code, I can reuse all of the intermediate data so I definitely want a large kernel that keeps all data on-chip and throws away intermediate data by overwriting it (by staying on-chip I am sure to avoid memory access bottlenecks). But the complication I have is that in some cases, where 'verbose output" is needed, I want to copy the intermediate data to device memory RAM, in each step before it is overwritten. And another thing is, what if the intermediate data is large (1-5KB) and kept in a single struct (class)? Can it all fit on the chip without causing unsolvable problems like not being able to launch enough threads or unnecessarily suffering from “thrashing” due to the hardware automatically doing “register spilling”?
Regarding RAM: You could allocate pinned, mapped memory and simply set values from kernel as if you were storing data to global memory. Sending data to RAM is slow, but since you do not reread it again, hopefully a “fire-and-forget” mechanism will kick in, and you would end up with a kernel running in parallel with data transmission between GPU and host. Once the data is en-route to host, you can overwrite the shared memory with new stuff and continue your work normally. If that works out, I believe this is the best option for you.
Actually I’m wondering how the compiler allocate register and memory usage? Is that true for “the more local variables, the more register usage”? When the compiler knows it will run out of registers, how does it solve the problem. Does it try to copy some variable to device memory or just left the problem to the hardware(say register spilling )?
To Cygnus X1:
Mapped memory won’t help much in my situation since there aren’t much host/device data transition. Most intermediate data are left in device memory and used for following kernels (for scheme 2).