How to get faster than 5-8 seconds

I am looking for fresh ideas how I can get this code quicker. It now takes 5 - 8 seconds, which is 100 times faster than doing it in matlab on a quadcore, so I should be already happy, but I don’t like kernel calls of more than 5 seconds. I would like people to be able to run this on their normal desktop when it has a GTX280 in it. It now runs on a C1060 sample.

normal C:

for (int index = 0; index < 20402; index++) {

 Â for(int ix = 0; ix < 1608; ix++) {

 Â  Â for(int iy=0; iy < 1608; iy++) {

 Â  Â  Â out[ix,iy] += input1[index, ix] * input2[index, iy] * function(A[index], B[index]);

 Â  Â }

 Â }


I currently have the following CUDA code which takes 5 - 8 seconds


calc_relatively_slow<<<dim3(1608, 1608,1), NUM_THREADS>>>(20402, input1, input2, A,B,out);


function calc_relatively_slow(num_index, input1, input2, A,B, out) {

__shared__ complex s_out[NUM_THREADS];

s_out[threadIdx.x] = 0.0f;

unsigned index = threadIdx.x;

while (index < num_index) {

 Â s_out[threadIdx.x] += input1[index + blockIdx.x*num_index] * input2[index + blockIdx.y*num_index] * function(A[index], B[index]);




//here comes a standard reduction

out[blockIdx.x + blockIdx.y * gridDim.x] = s_out[0];


So each block calculates one output-value.

I thought of letting each block calculate let’s say a 16x16 block of the output-array, to have 2 half warp read in input1 and input2, but I am afraid that will lead to impossible shared memory usage (1616256*8>16k), and still I have the idea I can make better use of shared memory.

Anybody have any ideas?

Hard to give much advice since your function “function” is a black box.
If that function is very fast, then you’re likely memory speed limited.
It looks like your memory access pattern may already be completely coalesced though, which is good.
Your A and B arrays are relatively small, they may fit into constant memory and therefore may be cached.
Similarly, your input arrays might be faster as texture lookups too.

You could precompute some of the pointer math in your inner loop too, the stuff like “blockIdx.x*num_index” Multiples aren’t THAT expensive (even without using _mul24) but if your inner function is super-fast, it’d start to show.

Hard to judge exactly what to focus on since it’s unclear where your bottleneck is, it depends on that undefined function.

There’s a bug in your CUDA code. You’re accessing input1 and input1 and I think you want input2 the second time.

There’s a bug in your plain “C” example code too, I think you mean += and not = in the inner loop assignment. That’s why you mention using a reduction in CUDA.


And why not first compute
C[index]=function(A[index], B[index]);

then use that in your kernel?

That will save you a lot of compute if it’s slow! And even if it’s trivial, it’ll save memory bandwidth and latency.

I fixed the two bugs above, it’s always fun to leave your memory stick with the real code at work…

The function is very fast (not big enough to be calculate bound), and the first thing tomorrow will indeed be to pre-calculate that one, that will save already quite some memory bandwidth [slaps his had in despair for not seeing that one]). I’ll see how far that will get me.

I think I can make the blocked version work too if I give up on the reduction (and just adding 16x16 values in shared memory, that will also take 256 shared values like before). I will have to transpose input1 & input2 to keep it coalesced, but that should be also possible (they are calculated in an earlier kernel and otherwise I will bind those to a 2D-texture)

About constant memory, maybe I will try that just out of curiosity. I don’t really have control on the input-sizes (these are typical for what I see up to now), so users might make the arrays too large for constant memory, and the total code is already complicated enough to not make another special case ;)

Thanks a lot for the ideas, 15 minutes ago I was having no idea what to try next, and now I cannot wait to try a few!

Another poor man’s solution is to simply call the kernel multiple times to compute different chunks of the output. That will help keep you under 5 seconds per call with little extra overhead.

I was about to suggest that, since i had to do it to be sure to stay below the watch dog timer.

This is what i did:

for(int i=-15;i<15;i++)



  CUDA_SAFE_CALL( cudaMemcpyToSymbol(params, &param, sizeof(Params)) );

  convolve<<<dimGrid, dimBlock,0>>>(d_Result);

  CUT_CHECK_ERROR("Kernel execution failed [ ]");

int i = x+params.currentI;

	//for(int i=x-ten;i<x+ten;i++)


  for(int j= y-ten; j<y+ten;j++)


  	for(int k=z-ten;k<z+ten;k++)

where “ten” is actually 15… hey, its not in production yet:)

Yeah, that is also possible, but ‘ugly’, then I will just let everybody who wants to run it buy a C1060 ;) (or better setup a calculation machine for people to use with a S1070 attached)

I actually dreamt last night that I can indeed get the time down a lot when doing the suggestion of SPWorley and borrowing some ideas from the matrixmul example. I was calculating the output of the function 1608x1608 times too often…

You need a vacation.

I am actually only 2 weeks back from vacation :lol:

The first idea (calculating the function evaluation in another kernel) gained me 20% reduction in time.

I am now busy with the second idea to borrow from matrixmul. I am not getting the same output for the moment, but also I keep about the same time, so I’ll try to fix things so I can post the code, along with if it was actually worth it…

I don’t get it, why can’t you just calculate up to, eg, num_index/2 and then pick up where you left off in a 2nd kernel call? You’d barely even have to change your CUDA function, just pass in the old ‘index’, the stopping point, and also replace the ‘=’ in the last line with ‘+=’. Then in your host code coordinate it like this:

int i;

int N = 1000;

//zero out 'out'

for(i= 0; i< 20402-N; i+= N)


calc_relatively_slow<<<dim3(1608, 1608,1), NUM_THREADS>>>(i, i+N, 20402, input1, input2, A,B,out);


calc_relatively_slow<<<dim3(1608, 1608,1), NUM_THREADS>>>(i, 20402, 20402 input1, input2, A,B,out);

__global__ calc_relatively_slow(start_index, stop_index, stride (aka num_index), input1, input2, A,B, out);
function calc_relatively_slow(start_index, stop_index, stride, input1, input2, A,B, out) {

__shared__ complex s_out[NUM_THREADS];

s_out[threadIdx.x] = 0.0f;

index = start_index + threadIdx.x;

while (index < stop_index) {

 s_out[threadIdx.x] += input1[index + blockIdx.x*stride] * input2[index + blockIdx.y*stride] * function(A[index], B[index]);




//reduce s_out

out[blockIdx.x + blockIdx.y * gridDim.x] += s_out[0];


Oh, and change N to be a multiple of NUM_THREADS.

I have to check my actual code monday when back at work, but I think that what I am trying to do can actually almost be done by CGEMM… Then I will be taking my matrix multiplication kernel (elsewhere in my program, also modified a bit from standard CGEMM) and modify it a little bit again.

If really true, I cannot believe my blindness. It is really frustrating that when you look at matlab code from someone else, and convert it to CUDA code, you sometimes still don’t ‘see’ what the code doing…

I hope this will be quite a bit faster, otherwise I will indeed have to split up the calculation.

So indeed, it is almost like a matrix multiply

A = L1xN
B = NxL2
C = Nx1

result(x,y) = sum over i=0…N C[i] * A[x,i] * conj(B[i,y]);

So after modifying a matrix multiplication code I get 0.6 - 1.1 seconds instead of 5.4 - 7.7 seconds…