I have just finished porting a heavy algorithm from D3D9 HLSL. The D3D9 version was very, very inefficient for many reasons. The CUDA version of the algorithm uses 1/2 the arithmetic instructions (not a ballpark estimate, exactly 1/2), and probably 1/5th of the global memory bandwidth as I replaced 28 passes to/from render targets/textures with 4 kernel launches, the rest of the passes take place in shared memory.
Yet, the CUDA version runs less than half as fast as the D3D9 version.
So, I ran cudaProf to try and understand what is happening:
- The summary table’s ‘glob mem overall throughput (GB/s)’ column sums to 327 GB/s, but my card is only capable of 141 GB/s, less than half as much. It also looks like the texture memory reads are not counted in the glob memory reads, which means my real glob memory access is even higher. What does this mean?
- The shared memory indicated for my kernel is 4124 bytes, not 4096 bytes like I allocate. This is hurting my occupancy. What else is allocating shared memory?
- A kernel that I know to be very, very short claims to execute 169,000 instructions, but the problematic kernel in this case which is very long executes 200,000 instructions, only an incremental amount more. Why?
- Is the optimal gld/gst efficiency 1? That would be suggested by its name, but a kernel that I am quite confident should be completely coalesced/aligned is getting only .04 gld/gst efficiency. Is reading/writing a float4 not coalesced? It is the size of a transaction (128 bits) correct? Other kernels that I expected to be coalesced as well do not appear to be either. One of my kernels reads/writes 4 byte words, in order, in 16x16 thread blocks into memory allocated with cudaMallocPitch. This kernel is reported to have only .16 gld/gst efficiency. What else could I be doing wrong to not get coalesced transactions?
That is a good start ;)
I am on a compute capability 1.3 card (GTX 280).