Register usage too high How to reduce register usage?

frier86 · November 30, 2011, 12:03pm

Lev: forgot to mention the occupancy of 0.25 was for the 43 register case. From my occupancy calculator it should be 0.50 using compute capability 1.3 with 32 registers.

WIth those 4 lines of math code, the 4 lined up, one of them you mentioned about lev, I have bene playing with that recently. That value use to be pre-computed for each pixel and passed into the kernal and now i changed it to be computed on the fly. But yes you are write, it on depends on the outter loop and not the inner loop, I assumed the compiler would take care of it. I moved those 4 lines out of the inner loop into the outer loop and running a test now.

tera · November 30, 2011, 6:23pm

Note that your reads from [font=“Courier New”]pR_table[/font] are not coalesced, so you are wasting a significant amount of your device memory bandwidth. Because of this, I would assume that at the moment you are limited by memory bandwidth, even though only a fraction of the peak bandwidth is actually used.

tera · November 30, 2011, 6:32pm

I haven’t used the profiler for a while, but I’d expect the instruction throughput for a computationally limited kernel to be (close to) 1 or higher. This would again hint that memory bandwidth (or it’s efficient use) is the limiting factor at the moment.

Note that even if you want to keep [font=“Courier New”]pR_table[/font] in global memory, you could at least reorder the indices to achieve coalesced access.

frier86 · December 2, 2011, 12:24pm

Moving r_table to be recomputed further increased out speed, we are up to 6.8x now. Does this mean we are probably bandwidth limited? It’s time to coalesce the pImage. So the best way to coalesced is to setup my memory accesses so the most amount of threads are “fed” data in the smallest data load from global memory I guess.

thanks!

Lev · December 2, 2011, 3:01pm

Have you created short test case? 30 seconds is enought usually. Though it should be more or less representative of program path and data usage.

frier86 · December 2, 2011, 10:44pm

Yeah, I compute a 8192x8192 image which takes about 5 minutes to run for my short test case and I put it in the profiler. I just need to learn what all the numbers mean and what they are hinting to me to what the bottleneck is.

Should I post my results?

frier86 · December 3, 2011, 12:21am

Attached is my cuda profiler results. I hope I have saved it correctly. The “r_table on the fly” is the most recent one where i compute it on the fly.

If any of you want to both looking through the results to give me some tips it would be much appreciated, but not expected.

Lev · December 3, 2011, 12:42am

Can you replace
pImage[ bigYOffset + bigXOffset ] += storedValue;

with temp+=storedvalue;

pImage[ bigYOffset + bigXOffset ]=temp; after cycle?

And see is there any time difference. Also time difference with 32 and 48 registers wiht square root out of cycle. Compiler may precompute it, interesting to check with small test case.

Also I suggest to change all numbers and variables to floats, just to see what performance gain can you have if you get rid from doubles. It is simple I think. Do npot forget to change 1.0 to 1.0f.

frier86 · December 4, 2011, 7:48am

I think my instruction throughput increased when I moved r_table into being computed on the fly, and maybe that gave me the speed increase. Not sure exactly how to increase instruction throughput though,there seems to be a few ways… Still not sure what I’m limited on.

I also tried computing cos values on the fly. But it almost trippled my running time. I’m not sure how I could use cospi function directly, I think i’ll have to change my calculation a little to be able to use it,

Hmm not exactly, since this loop computes many pixels, at the moment it is doing a 5x5 patch of pixels per thread. So thats 25 pixels, so I would need 25 local temp variables. I was thinking of making it only do say, 2x2 pixel grid, which is 4 pixels per thread, and i’m using 256 threads per block, which means i’ll use up 4k of shared memory. At the cost of more memory transfers( device to host) because I have to compute smaller sub images.

Yeah it’s time to fiddle with these doubles I think as well.

Also going to coalesce image pixel accesses which will require lots of code changes, so delaying it until I figure out if it is needed or not. By the looks of it, coalescing is always needed where its possible.

Lev · December 4, 2011, 8:36am

Actually only one xoffset is changed, so you need only 5 temp variables, btw bood idea is no unroll that intermost loop and hard code number 5.

tera · December 4, 2011, 11:00am

Your low instruction throughput means you are limited by something else - global memory throughput probably.

Using cospi() is easy. Just replace [font=“Courier New”]cos(M_PI*x)[/font] with [font=“Courier New”]cospi(x)[/font]. In you case the code reduces nicely, instead of

pCOStbl[ indx ] = (float)cos( (M_PI + M_PI) * (float)(indx + indx -1) / (float)(2 * NUMTBL) );

you just get

pCOStbl[ indx ] = cospif( (float)(indx + indx -1) / (float)(NUMTBL) );

if the float version is used (at the expense of 2 ULP).

I think so too - in combination with recomputing table data on the fly.

frier86 · December 4, 2011, 12:13pm

I take out the compile flag -arch sm_13 so no doubles are used and my small computation goes from 6 mins to 1 min.

Actually only one xoffset is changed, so you need only 5 temp variables, btw bood idea is no unroll that intermost loop and hard code number 5.

Hmm true, I could try that, thanks.

oh is that it? I was confused, but I’m using 2PI?. Anyways thanks for that, i’ll do that now.

njuffa · December 4, 2011, 11:31pm

I take out the compile flag -arch sm_13 so no doubles are used and my small computation goes from 6 mins to 1 min.
Actually only one xoffset is changed, so you need only 5 temp variables, btw bood idea is no unroll that intermost loop and hard code number 5.
Hmm true, I could try that, thanks.

oh is that it? I was confused, but I’m using 2PI?. Anyways thanks for that, i’ll do that now.

njuffa · December 4, 2011, 11:36pm

The original computation computes cos (M_PI * 2.0 * (float)(indx + indx -1) / 2.0 * (float)(NUMTBL)), Since the factor of two occurs in both the numerator and the denominator, it cancels, and can be removed. The use of cospi() instead of cos() takes care of the multiplication with M_PI, leaving

cospi ((float)(indx + indx -1) / (float)(NUMTBL));

Since the result is stored to an array of floats, you might as well use cospif() to gain more performance, as numerical differences will be minimal.

Topic		Replies	Views
How to reduce register usage CUDA Programming and Performance	47	50139	May 28, 2022
Analysing the registers CUDA Programming and Performance	9	1306	March 13, 2012
How to optimize my cuda code? CUDA Programming and Performance	14	2377	June 28, 2023
Cuda compiler will optimize code to use more registers than available by attempting to cache parameters CUDA Programming and Performance	12	2428	November 14, 2017
Optimize code from 38 registers to 21 registers CUDA Programming and Performance	48	9924	August 31, 2010
Questioning compiler's use of registers, how to get compiler to use registers more efficiently CUDA Programming and Performance	10	5330	March 30, 2009
Implementation Questions arrising from Ch.5 on Performace Guidelines in the Programming Guide 2.0 CUDA Programming and Performance	12	2646	June 8, 2009
CUDA Occupancy Calculator Helps pick optimal thread block size CUDA Programming and Performance	76	312983	September 13, 2011
How to calculate register resource correctly? I met strange problem on calculate resource such as re CUDA Programming and Performance	2	1350	July 27, 2010
Registers usage behaviour CUDA Programming and Performance cuda , kernel	8	237	May 15, 2025

Register usage too high How to reduce register usage?

Related topics