Possible CUDA improvements

I’ll keep this short and bulleted:

  1. Say I have a kernel that I pass cudaMallocPitch data. If I have more than 16 threads per block and each one is accessing a unique “row” (idx * pitch). I’m creating non-coalesced memory accesses. Further cudaMallocPitch, from my observation consumes pitch * height. So really if you’re on a larger memory device and use this on say 1024 byte width rows, you’re shooting yourself in the foot (For algorithms that require walking by row linearly). Although I’ve been able to solve all but one or two instances with complicated algorithmic changes. Further if you flip it than you’re grinding on one bank, which defeats the point of pitch.

Solution: Keep a per host application context of memory allocated via various cudaMalloc’s and reuse wasted alignment space.

  1. The amount of work that one has to put into modifying NVCC PTX output before passing kernels to ptxas. I find the PTX ISA 1.1 inaccurate, so this takes a while to familiarize yourself with (ie: still says it supports add.v4.s32 in numerous places). Why do I have to modify the NVCC output? A few things, refusal to respect align(x) on structures for anything more than address alignment. Say I have the structure: struct align(16) test { int x, y; float z, w; }. NVCC refuses to coalesce the load into one ld..v4 or st..v4. This vastly impacts performance. Second, NVCC doesn’t utilize setp in most applicable locations. Even using the PTX ISA C translation of the instruction with a predicated else conditional does not yield a setp instruction.

Solution: Write an appropriate optimizer for NVCC with regards to loads and stores, possibly coalescing then splitting load results, implement per the CUDA language ref support for align, and implement an aggressive peephole optimizer for using predicated logic. Further implement an equivalent of __asm() with __ptx() or some other keywords. People like options.

  1. In device warp/thread management would yield a huge performance gain. For example say I have a block of 128 threads, randomly 50% take one large branch, 50% take another. If there were a way to regroup threads into appropriate warps based on their branch decision, this would be a huge performance gain. Typically I’ve dealt with this by splitting kernels into stages, scanning, regrouping, then finishing the kernel, but typically this costs the same performance if not worse.

Solution: Have an in kernel voting scheme with respect to regrouping of warps (all those that say X group regroup into a warp, all those that say Y regroup into a warp, etc). Or auto-manage it based on execution path. Doesn’t appear to be the case on my 8800GTX or GTX 280.

Example of poorly optimized NVCC PTX output (yes the -O3 flag was passed):

    ld.local.u32    %r28, [%rd19+0];        // id:140 __cuda___cuda___cudaparm__Z11cump_addsubi6cump_sS_S____val_paramsrc21648+0x0
    st.local.u32    [%rd20+0], %r28;        // id:139 __cuda___cuda__temp_mcselect14880+0x0
    ld.local.u32    %r29, [%rd19+4];        // id:140 __cuda___cuda___cudaparm__Z11cump_addsubi6cump_sS_S____val_paramsrc21648+0x0
    st.local.u32    [%rd20+4], %r29;        // id:139 __cuda___cuda__temp_mcselect14880+0x0
    ld.local.u32    %r30, [%rd19+8];        // id:140 __cuda___cuda___cudaparm__Z11cump_addsubi6cump_sS_S____val_paramsrc21648+0x0
    st.local.u32    [%rd20+8], %r30;        // id:139 __cuda___cuda__temp_mcselect14880+0x0
    ld.local.u32    %r31, [%rd19+12];       // id:140 __cuda___cuda___cudaparm__Z11cump_addsubi6cump_sS_S____val_paramsrc21648+0x0
    st.local.u32    [%rd20+12], %r31;       // id:139 __cuda___cuda__temp_mcselect14880+0x0

Mind you this is from:
device void cump_test(uint4 *param)
uint4 lcl;
for(i = 0; i < loopmax; i++)
lcl = param[i];


I thought about this problem as well. There’s one major problem: thread IDs are immutable. If you “rearrange” threads, the thread IDs will change, yet very very often you have work or indexing or assignments which are based on a (now changed) thread ID. Every user solution would be painful and confusing.

This means that any regrouping has to be done at a very low level to abstract it away from the user, adding an indirection so that tid is based on a lookup and not simple offset from some base. Ugh!

And remember, there IS already free regrouping at the half-warp level, in the sense that half warps that are all disabled are not scheduled, so your divergence splits are only penalized at the 16 thread half-thread level.

I thought about some kind of opcode which says “swap all of my registers with thread #X”. That’s effectively a thread swap since a thread is defined by its register state. But that’s tricky too… there is local RAM, and you have to manually deal with tid changes, etc. And, for that matter, you can do this YOURSELF with some explicit shared memory swaps if it were really crucial.

So, what I concluded was that even with hypothetical CUDA support, “thread swapping” is actually inelegant. The GPU scheduler is so efficiently lightweight, it’s counterproductive to try to rearrange things to reduce divergence. Sure there’s tons of counterexamples like “if (!(threadIdx.x & 15)) { do something…}” but for the most common case where some random threads, even a majority, disable for a dozen clocks or something, it’s a lot more more work than it’s worth. And for the worst cases, where a single thread is doing significant work all by itself, that’s more of your algorithm’s poor choice than it is a repacking issue.

I think the current 16-thread half-warp size is a pretty good sweet spot for scheduling, too. Though I wonder how much GPU die space it’d cost to switch the scheduler to warps of 16 threads (and removing any special meaning of a half-warp), and also how much performance gain apps would get from the more nimble execution paths.

Well, I think something like this has already been done for compute 1.3 devices. If you look in the programming guide there is some explanation, but GT200 only has coalesced loads & stores. I have not fully grasped exactly what it does, but then again, I am coalescing on earlier hardware, so I don’t have to worry yet.

More of a reorganization of warp to thread mapping not so much of a block to thread mapping. For example on the GTX 280 and Tesla devices, and what seems to be the direction of future devices, block dimensions are growing. If I launch 512 threads in a block, they can be reorganized into different warp groups with respect to branches and not affect the thread id.

maximum block size did not change on GT200, it is & was 512 threads per block. I was a little disappointed, since a MP can handle 1024 threads, so it would have been nice if the maximum block size would have been 1024 also, I could have used that because I have a algorithm where I have to scan arrays of size 2048 ;)

Also rearranging threads in other warps will break a lot of software, e.g. reduction.

Sorry, I went from an 8800 GTX to a GT200 device. Didn’t know 512 wasn’t new to GT200.

512 was possible before, but it often was not the best option, since it would give only 66% occupancy. using 256 threads gave the option of 100% occupancy.