Register usage too high How to reduce register usage?

Lev: forgot to mention the occupancy of 0.25 was for the 43 register case. From my occupancy calculator it should be 0.50 using compute capability 1.3 with 32 registers.

WIth those 4 lines of math code, the 4 lined up, one of them you mentioned about lev, I have bene playing with that recently. That value use to be pre-computed for each pixel and passed into the kernal and now i changed it to be computed on the fly. But yes you are write, it on depends on the outter loop and not the inner loop, I assumed the compiler would take care of it. I moved those 4 lines out of the inner loop into the outer loop and running a test now.

Note that your reads from [font=ā€œCourier Newā€]pR_table[/font] are not coalesced, so you are wasting a significant amount of your device memory bandwidth. Because of this, I would assume that at the moment you are limited by memory bandwidth, even though only a fraction of the peak bandwidth is actually used.

I havenā€™t used the profiler for a while, but Iā€™d expect the instruction throughput for a computationally limited kernel to be (close to) 1 or higher. This would again hint that memory bandwidth (or itā€™s efficient use) is the limiting factor at the moment.

Note that even if you want to keep [font=ā€œCourier Newā€]pR_table[/font] in global memory, you could at least reorder the indices to achieve coalesced access.

Moving r_table to be recomputed further increased out speed, we are up to 6.8x now. Does this mean we are probably bandwidth limited? Itā€™s time to coalesce the pImage. So the best way to coalesced is to setup my memory accesses so the most amount of threads are ā€œfedā€ data in the smallest data load from global memory I guess.

thanks!

Have you created short test case? 30 seconds is enought usually. Though it should be more or less representative of program path and data usage.

Yeah, I compute a 8192x8192 image which takes about 5 minutes to run for my short test case and I put it in the profiler. I just need to learn what all the numbers mean and what they are hinting to me to what the bottleneck is.

Should I post my results?

Attached is my cuda profiler results. I hope I have saved it correctly. The ā€œr_table on the flyā€ is the most recent one where i compute it on the fly.

If any of you want to both looking through the results to give me some tips it would be much appreciated, but not expected.

Can you replace
pImage[ bigYOffset + bigXOffset ] += storedValue;

with temp+=storedvalue;

pImage[ bigYOffset + bigXOffset ]=temp; after cycle?

And see is there any time difference. Also time difference with 32 and 48 registers wiht square root out of cycle. Compiler may precompute it, interesting to check with small test case.

Also I suggest to change all numbers and variables to floats, just to see what performance gain can you have if you get rid from doubles. It is simple I think. Do npot forget to change 1.0 to 1.0f.

I think my instruction throughput increased when I moved r_table into being computed on the fly, and maybe that gave me the speed increase. Not sure exactly how to increase instruction throughput though,there seems to be a few waysā€¦ Still not sure what Iā€™m limited on.

I also tried computing cos values on the fly. But it almost trippled my running time. Iā€™m not sure how I could use cospi function directly, I think iā€™ll have to change my calculation a little to be able to use it,

Hmm not exactly, since this loop computes many pixels, at the moment it is doing a 5x5 patch of pixels per thread. So thats 25 pixels, so I would need 25 local temp variables. I was thinking of making it only do say, 2x2 pixel grid, which is 4 pixels per thread, and iā€™m using 256 threads per block, which means iā€™ll use up 4k of shared memory. At the cost of more memory transfers( device to host) because I have to compute smaller sub images.

Yeah itā€™s time to fiddle with these doubles I think as well.

Also going to coalesce image pixel accesses which will require lots of code changes, so delaying it until I figure out if it is needed or not. By the looks of it, coalescing is always needed where its possible.

Actually only one xoffset is changed, so you need only 5 temp variables, btw bood idea is no unroll that intermost loop and hard code number 5.

Your low instruction throughput means you are limited by something else - global memory throughput probably.

Using cospi() is easy. Just replace [font=ā€œCourier Newā€]cos(M_PI*x)[/font] with [font=ā€œCourier Newā€]cospi(x)[/font]. In you case the code reduces nicely, instead of

pCOStbl[ indx ] = (float)cos( (M_PI + M_PI) * (float)(indx + indx -1) / (float)(2 * NUMTBL) );

you just get

pCOStbl[ indx ] = cospif( (float)(indx + indx -1) / (float)(NUMTBL) );

if the float version is used (at the expense of 2 ULP).

I think so too - in combination with recomputing table data on the fly.

I take out the compile flag -arch sm_13 so no doubles are used and my small computation goes from 6 mins to 1 min.

Actually only one xoffset is changed, so you need only 5 temp variables, btw bood idea is no unroll that intermost loop and hard code number 5.

Hmm true, I could try that, thanks.

oh is that it? I was confused, but Iā€™m using 2PI?. Anyways thanks for that, iā€™ll do that now.

The original computation computes cos (M_PI * 2.0 * (float)(indx + indx -1) / 2.0 * (float)(NUMTBL)), Since the factor of two occurs in both the numerator and the denominator, it cancels, and can be removed. The use of cospi() instead of cos() takes care of the multiplication with M_PI, leaving

cospi ((float)(indx + indx -1) / (float)(NUMTBL));

Since the result is stored to an array of floats, you might as well use cospif() to gain more performance, as numerical differences will be minimal.