Help with optimizing CUDA function if optimizing is possible..


I’m currently working on integrating CUDA for the statistical language R. For a proof-of-concept I’m porting one standard R function to CUDA, namely the density normal distrubution (called dnorm() in R). The density is calculated with this formula (x = input, mu = mean, sigma = standard deviation):

f(x) = 1/(sqrt(2 pi) sigma) e^-((x - mu)^2/(2 sigma^2))

I have the function perfectly working (and callable from within R) and getting an average of 5x boost comparing CPU vs. GPU (quad-core 2.4Ghz Core2 [R only uses one core] with 8800GTX). The biggest bottleneck is that I need to cast all input/output from double-precision to single-precision since R only uses double-precision. So for optimizing this algorithm, I have a few questions:

  • Is casting the input/output using a for() loop (on the CPU) to iterate on all values the fastest method? eg.:
for (i = 0; i < N; i++)


	  output_float[i] = (float) input_double[i];

  • Since this CUDA function only uses the input value once, do I need to use shared memory?

I’ve read somewhere that even when using the input only once, using shared memory might speed it up (couldn’t notice it myself though).

  • Do I need to use specialized CUDA functions for multiplying, division, power and exponent for the best performance?

  • Can I keep the input in one vector for all input sizes? If I use a vector of say 10million elements, is it still the most optimal data-structure to use or should it be changed or split up?

I’ve used the CUDA GPU Occupancy Calculator with the following numbers (when using shared memory):

  • Threads Per Block: 256

  • Registers Per Thread: 4

  • Shared Memory Per Block: 1064

(information is from the .cubin file)

The Multiprocessor Warp Occupancy is 24 according to this calculator, which is the Max Occupancy.

The kernel code I currently use (without shared mem usage) is this one:

__global__ void deviceDnorm(float* a, float* o, float m, float s) {

  int tid = threadIdx.x;

  int bid = blockIdx.x;

  int index = bid * blockDim.x + tid;

o[index] = (1 / (sqrtf(2 * M_PI))) * exp( -( (__powf(( a[index] - m), 2)) / (2 * __powf(s, 2)) ));


Any advice on (further) optimizations are very much appreciated and welcome!

  • Marcel.

I have a similar problem. I need to apply a function F on each cell of a matrix, and then save F(x) back to the original cell. This function will be called 10 millions times in my program. It would be great if I could get a speedup than my naive implementation (which is similar to your code).

Id be surprised if shared memory sped up anything, youre only doing two memory accesses (read write) instead of gmem straight to a register. Pretty sure it wont actually! But since i have no tested it myself, i wont say anything definitive.

There are low(er) accurary functions but you seem to be using them already.
Watch out for the exp function, as written, if you execute it on a gt200, itll use double precision, you need to use expf, or, __expf for the faster one.
Instead of me copying it all here, read section 5.1 of the programming guide.

If you are reading all your values coalesced, as you seem to be doing in the snipped, then a 1d array will do just fine.
Youre using a 1d block and a 1d grid for now, so beware that you wont be able to scale to 10 millions thread this way.
I personnaly use a 2d grid and 1d blocks. 6555565555256 should be plenty.

I’m not familiar with R, so I guess my question is: how does your kernel work? Are you using a Monte Carlo approximation in your kernel to find the area under the normal curve between two z-scores? If so, you might do this independently in separate threads then do a reduction on the results from the threads to compute a single value, which you could copy back to the host and cast to a double-precision float.

I don’t see why there would be any reason to cast them all in a loop on the host; even if your kernel is working on a vector of inputs and computing a vector of outputs (which, I think, is what you’re trying to cast?), I believe you should still be able to handle this on the GPU, which would also save you 50% on your bandwidth usage when transferring back to the host.

I just read the example “transpose” in cuda sdk. It reads a matrix A and puts its transpose in a matrix B. Each thread only does one read (a cell in A), and only one write (a cell in B ). However, the naive implementation without shared memory is 11 times slower than their optimized version using shared memory. Hence I am expecting that Marcel’s problem and my problem could be optimized by utilizing shared memory.

Ah yes but this is to coalesce the writes to global memory since we are reading in order but writing out of order. Im guessing the OP’s approach will read value [index] and write at the same position in the output global memory array. Or that is what i assumed anyway!

Thank you everyone for your replies :thumbup: .

That is indeed exactly what I’m doing, so I do not think using shared memory will help speeding it up further, although I want to check this out to make sure that it does/doesn’t.

That is not what I’m doing, I’m calculating the density of each value independently, so I need to return a vector of the same length as the input vector. This topic us just ment for me to get better insight in to optimizing functions for CUDA, the actual calculations arent that important (yet).

I am however curious to your other suggestion, how should I be able to handle double-precision data on the GPU when it only supports single-precision calculations? As far as I know each value needs to be cast before the GPU can do any calculations with it.

Thanks again!

that’s how i would do it:

__global__ void deviceDnorm(float* a, float* o, float m, float const1) {

  int index=blockIdx.x*blockDim.x+threadIdx.x;

  float am=a[index]-m;



on the host:

float const1=-1.f/(2.f*s*s);

eliminated tid and bid, just in case… you don’t have to make the compiler’s life more difficult than it already is. :-p

calculating 1.f/sqrtf(2.f*M_PI):

another case, where the compiler could optimize, if it is clever…

if not, this would be rather costly, as sqrtf is calculated as the reciprocal of rsqrtf and therefore takes 32 cycles plus another reciprocal (16) makes 48 cycles in total.

i would just type in that constant.

__powf(x,y) is implemented as exp2f(y*__log2f(x)), so it’s something like 32+4+16=52 cycles. if you just want to square a value, simply use x*x, that’s just 4 cycles.

unfortunately, to apply this to a[index]-m, we have to use a register (to avoid a second global read). anyway, i think, it’s worth it as this thing isn’t going to be register bound.

this leaves us with -1.f/(2.fss). another reciprocal, an invert and 2 multiplies for a constant value? why would you want to calculate this 10 million times? ;-) simply exchange the parameter s for this constant.

finally to exp:

as has been pointed out before, use expf, or if accuracy isn’t an issue, use __expf. __expf is 32 cycles, expf should be “a magnitude slower” (according to the developer’s guide)

ps: the performance guidelines (chapter 5 of the developer’s guide) are worth reading. ;-) further down in that guide you’ll also find the error bounds for e.g. expf and __expf.

pps: i can’t test the effect of the optimizations at the moment, so this is up to you. :-p

regarding the double precision question: nvidia cards with compute capability 1.3 or higher (meaning gtx2xx and tesla x10xx) also support double precision, but at a much slower speed than single precision.


i just saw a discrepancy between the formula and your implementation given in the first post. the formula says 1/(sqrt(2*PI)sigma) but you implemented 1/sqrt(2PI). if the formula is correct and your implementation wrong, you should replace 0.3989423f in my code with const2 (set up float const2 as a new parameter for the device function) and compute this parameter on the cpu.