index calculation overhead and seek suggestions

FangQ · February 4, 2009, 7:23pm

In the past, I spend some time to program computational codes using BrookGPU (BrookGPU), unfortunately, Brook is not active anymore, so I recently decided to convert one of my codes to CUDA. The code is quite simple, only a few simple kernels to perform finite-difference calculation on a 2D texture. the skeleton can be found in my previous post at [url=“http://forums.nvidia.com/index.php?showtopic=88623”]http://forums.nvidia.com/index.php?showtopic=88623[/url]

With Brook, I can get 40x acceleration with an 8800GT card on a Xeon machine. However, I can only get 10x acceleration with CUDA. Reviewing my changes, I realized that the indexing calculations needed for CUDA kernels almost doubled the FLOPs for each call. For example, the following line compute the index on a 2D texture:
[font=“Courier New”] int idx= (blockDim.x * blockIdx.x + threadIdx.x)gridDim.yblockDim.y+blockDim.y * blockIdx.y + threadIdx.y;[/font]
that is 7 FLOPs! my total FLOP count for each kernel is only about 10ish. I felt somehow this is responsible for my speed reduction I mentioned earlier.

In Brook, one dose not need to explicitly compute the index. I believe it will be populated during the scheduling phase (I am not sure). but in CUDA, you have to do this for each call. I was hoping that CUDA should be smarter than brook, and suspected I missed something.

so, here are my questions:

did you see a similar overhead introduced by indexing calculations for small kernels?
does CUDA have indexof operator similar to brook?
any suggestions to reduce this overhead? (passing an index array may save some FLOPS, but creating the indices to the index array is still expensive!)

RussAtkinson · February 4, 2009, 7:52pm

I rarely see any significant overhead for indexing. 9 times out of 10 I am limited by access time to global memory.

Also, it does not seem right that by having a 1.7 x increase in operations leads to a 4x decrease in speed. I’d look elsewhere for major speedups.

You can reduce the multiply cost in many cases by using __mul24 instead of full multiply. Just be careful with your ranges to keep the calculation valid.

You can reduce indexing overhead by having each block do several iterations, stepping the index instead of recalculating it. For example, if each block does 4 elements instead of 1 then you get a 4x decrease in index multiplications.

If you really have a “2D texture” maybe you should be using tex2D to fetch the element?

Finally, you can remove one multiply from

int idx= (blockDim.x * blockIdx.x + threadIdx.x)*gridDim.y*blockDim.y+blockDim.y * blockIdx.y + threadIdx.y;

by using

int idx= ((blockDim.x * blockIdx.x + threadIdx.x)*gridDim.y+blockIdx.y)*blockDim.y + threadIdx.y;

_Big_Mac · February 4, 2009, 10:52pm

Those aren’t 7 FLOPs, those are 7 integer ops. Those are much slower than floating point ops!

Still, I too have a feeling that’s somehow not your main problem…

FangQ · February 4, 2009, 10:55pm

thanks for the reply. I am not sure if this deceleration is due to indexing or my un-wised use of the global memory (frankly, I have no idea what are the griddim and blockdim that I should use). maybe you can help me determine a reasonable number based on your experience:

my domain size is 512x512, there are a couple arrays of that size, some of them are coefficients (float4), some of them are the field qualities (float2/float4). For each kernel, I do a finite difference calculation, which normally requires to get the neighbor values and do some simple algebra, for example, one of my kernels look like (and you can see my ugly indexing calculation)

__global__ void Step2(float4 E[], float2 H[], float4 cH[], float2 OH[] ) {

		int CidX =blockDim.x * blockIdx.x + threadIdx.x;

		int CidY =blockDim.y * blockIdx.y + threadIdx.y;

		int dimx=gridDim.x*blockDim.x;

		int dimy=gridDim.y*blockDim.y;

		int CidG = CidX*dimy+CidY;

		int CidT = (CidX%DIMX)*gridDim.y+(CidY%DIMY);

	if(CidX+1==dimx||CidY+1==dimy)

		return;

	OH[CidG].x=H[CidG].x + cH[CidT].x*(E[CidG].x+E[CidG].y-E[CidG+1].x-E[CidG+1].y);

}

now I am using <<<dim3(128,128),dim3(4,4)>>> to launch my kernel. I tried other combinations, mostly failed, I don’t know why.

so, in your experience, for this type of computation, what would be recommended as the block and grid dimensions? other than the indexing, anything else can cause the slow-down?

FangQ · February 5, 2009, 12:01am

I attached the output from CUDA visual profiler (as a figure), I don’t know if these numbers mean anything to you.

RussAtkinson · February 5, 2009, 1:28am

The biggest problem I see with your code (in Step2) is the use of a lot of uncoalesced global memory accesses (look at the gld uncoalesced column in your visual profile, especially for UpdateE and UpdateH)). The best thing to do is to access swaths of contiguous memory, even if you don’t use all of the values! Gaps in the fetches or stores give the GPU a severe case of the hiccups.

Given that Step2 performs a rather sparse use of global memory, I assume that the declarations are the way they are for some purpose. But you might consider making some of your float2 or float4 arrays into pairs or quads of arrays so you can avoid gaps in the access patterns.

In some (but not all) cases when doing neighbor calculations you can use shared memory (with appropriate syncthreads) to avoid the neighbor fetch. If the boundary issues prevent this you can still use 1D or 2D textures to get the benefit of caching, so adjacent fetches use the cache instead of doing a second global fetch.

The above hints are much more likely to be useful than tweaking your grid and block dimensions.

_Big_Mac · February 5, 2009, 11:03am

Indeed, what kills your performance is uncoalesced memory accesses (both loads and stores). Coalescing mem accesses should be #1 when you optimize. I’d wager that #2 would be getting rid of divergent branches (if you can, maybe they’re caused by border conditions and have to be there?). Read up on that, it can be quite a difficult topic.

It’s not always possible to completely do both, sometimes the algorithm simply requires a divergence or a random memory access here and there, but your code can definitely be done better.

Also, 4x4 blocksize will probably cut your performance significantly right on the start. Threads are executed in so called ‘warps’ - batches of 32. If your blocks have 16 threads, they won’t fill up warps - it’s sort of an internal fragmentation thing if you know what I mean. Try to have 32 * N threads in a block. But that would be #3 on your optimization checklist :)