Cuda Latency problems Slow Cuda

Hi all,

I’m trying to speed up a function in c by replacing all of the for loops with cuda kernels.
I’m finding though that the speed of the code is dropping dramatically :-( compared with using intels compiler which vectorizes some of the loops.

Could the decrease in performance be due to calls to cudaMemcpy??

Typically I’m copying 256 or 512 float complex.

I’ve included an example of my code below.

Any help would eb much appreciated!

Cheers,

Wes.

Below n=256, x1 and p are arrays of complex floats bmass2 is a float…

for(i = 1; i <= n; i++) {
x1[i] = bmass2 * p[i];
}

becomes…

global void cudaVectorMultiply(int n, float bmass2, cuFloatComplex *p, cuFloatComplex *x1) {

int i = blockIdx.x*blockDim.x+threadIdx.x;

if(i <= n) {
	x1[i] = cuCmulf((make_cuFloatComplex(bmass2, 0)), p[i]);
}

return;

}

and in my cu source code I have…

cudaMemcpy(p_d, p, (kvol+1)*sizeof(float complex), cudaMemcpyHostToDevice);
cudaVectorMultiply<<<dimGrid,dimBlock>>>(n, bmass2, p_d ,x1_d);
cudaMemcpy(x1, x1_d, (kvol+1)*sizeof(float complex), cudaMemcpyDeviceToHost);

The GPU is going to compute that incredibly quickly, but your net speedup will be completely overwhelmed by overhead, especially just sending the memory to and from the card. Move as much of your compute to the card as you can and keep it there for as long as possible.

The strategy of taking current CPU code and “replacing the for loops with CUDA kernels” gives the wrong kind of workload for a GPU.

Thanks for the reply, I thought that I might need to try to keep everything on the gpu.

Could you point me in the direction of a more detailed programmers guide? I have the nvidia programming guide and reference manual but I can’t seem to find things like…

  1. If I put a for loop on the gpu (e.g. in a kernel) how do I make it execute in parallel?

I guess its like a pragma in OpenMP???

  1. Should I be incorporating everything into a .cu file (I have a few thousand lines of c and a few hundred that I’d want to execute on the gpu in cuda)?

  2. How should I compile for optimum performance? I’m currently just doing nvcc -O3 program.cu -o program Is there a better method, e.g linking with gcc or icc???

  3. What should I set my block size to be???

Thanks again for your help!

Wes.

First, if all you have to do is calculate 256 or 512 elements than it’s simply not enough work to saturate the GPU. It’s most likely not worthwhile.

Or do you have many batches of such computations that can be parallelized?

As for your further questions:

  1. CUDA works differently than OpenMP. You don’t use pragmas to unroll ‘for’ loops into multiple threads.

I’m not sure what you want to do in your algorithm, could you provide some pseudocode (in serial form)?

Do you want a ‘for’ loop in the kernel (meaning each thread will go through that loop) or a loop of kernels?

  1. You don’t have to. You can have all your non-CUDA code in other files (be it .c or .cpp). If you’re combining CUDA and C++, be sure to create a wrapper for the kernel call with ‘extern “C”’. Like this:
// .cu file

__global__ void cudaVectorMultiply(...) {

...

}

extern "C" void launchKernel(...) {

//this is the wrapper function

cudaVectorMultiply<<<dimGrid,dimBlock>>>(n, bmass2, p_d ,x1_d);

//you need this wrapper because the above kernel call uses CUDA specific syntax of <<<< >>>> 

//and this should be hidden from your C/C++ compiler

}
// .cpp file

extern "C" launchKernel(...); //forward declaration that this function is defined somewhere else in your project and should be linked as a pure C function

//all your none-CUDA code can go here

int main() {

...

launchKernel(...); //this is just a C function, the compiler has no idea that it's calling GPU work

...

}

I can’t help you with 3)

  1. This is non-trivial and usually some experimentation has to be performed to get optimum performance. A good starting point is somewhere between 128-256 (inclusive) threads per block and working your way from there. Check out the occupancy calculator tool that came with the SDK

Thanks! That was really helpful.

The algorithm I’m trying to implement in cuda is a conjugate gradient algorithm.

I’ve tried to do some pseudo code below, I think there is lots to parallelise because there are so many loops that could be executed in parallel.

Again any help would be much appreciated!

It would look something like this…

for(1 to 1000) {
for(1 to 512) {
p[i]=x[i];
r[i]=Phi[i];
}

for(1 to 512) {
	x1[i] = bmass*bmass * p[i];
}
	
for(1 to 3) {
	for(1 to 3) {
	     for(1 to 512) {
		ish1[i] = ishf[i];
		ish2[i] = ishb[i];
		ish3[i] = ishf[i];
		ish4[i] = ishb[i];
	}
	for(1 to 512) {
	         x1[i] = x1[i] - some other stuff
	}
	
                            for(1 to 512) {
                         alphad=alphad+conj(p[i])*x1[i];
	}
		
	alpha=alphan/alphad;

	for( 1 to 512) {
		x[i]=x[i]+alpha*p[i];
	}
	for(1 to 512) {
		r[i]=r[i]-alpha*x1[i];
	}

	betan=0.0;

	for(1 to 512) {
		betan=betan+conj(r[i])*r[i];
	}
	
	betadum=betan/betad;
	betad=betan;
	alphan=betan;

	for(1 to 512) {
		p[i]=r[i]+betadum*p[i];
	}

	if(betan < resid) {
		fprintf(fp98, "end of congr");
		break;
	}
                     }
}

if( algorithm hasn't converged) {
	fprintf(fp6, " # iterations of congrad exceeds niterc"); 
}

}

A kernel call, without doing any computation at all, takes 20 us (same for copying data).

In 20 us, a 2 GHz CPU can execute 40000 cycles. If the data is already in cache and the code reasonably optimized, you should be able to execute about as many operations. Even if the data is not in cache, about 5000 operations should be feasible.

So except for unusual circumstances (e.g. it allows you to avoid a CPU<->GPU data transfer), starting a GPU kernel that executes less than several thousand operations is completely pointless, always.

From the pseudocode above I take there are thousands of iterations of the outermost loop but they cannot be parallelized. The only thing that can be easily parallelized is the content of each iteration but if there’s only about 512 elements to go through in internal loops. That’s enough to saturate a single block, maybe two. We need hundrets or better yet - hundrets of thousands :)

Unless you can run the outermost loop (for 1:1000) in parallel but I see a convergence condition check at the bottom and I assume it’s a kind of an iterative solver with sequential dependability.

Thanks for the replies Big_Mac and Reimar, both are very helpful and informative.

The 512 comes from the fact that I’m working on a 8x8x8 lattice.
I’m trying to improve the speed of the code so we can increase the lattice size, I’d really like to work on a 32x32x64 size, that would give 65536 as my internal loop size, So my feeling is using the gpu is a good way to go.

You are right Big_Mac , It is an iterative solve and so the external loop cant be parallelized (as far as I can see). There is lots more to the code but when I profile it I find that 99% of the time is spent in this function, so I guess it’s pointless looking for parallelism anywhere else.

I’m currently rewriting the code so that all of the computation is done on the gpu and only the external loop is done on the cpu. I think this will mean only one kernel call per iteration of the external loop and I can leave lots in the shared memory of the gpu. I’ve got a couple more questions…

  1. How do I allocate and copy a 2d array to the gpu shared memory area, I’m trying cudaMemcpyToArray but can’t seem to make it work.

  2. I have a loop over three variables could this be performed in one go for example could I change…

int i = blockIdx.x*blockDim.x+threadIdx.x;

for(id1 = 1; id1 <= Ndim; id1++) {
for(id2 = 1; id2 <= Ndim; id2++) {
if(i<n) {
x1[i] = cuCsubf(x1[i], some stuff[i][id1][id2]);
}
__syncthreads();
}
}

for…
int i = blockIdx.xblockDim.x+threadIdx.x;
int j = blockIdy.y
blockDim.y+threadIdy.y;
int k = blockIdz.z*blockDim.z+threadIdz.z;

if(k <Ndim) {
if(j<Ndim) {
if(i<n) {
x1[i] = cuCsubf(x1[i], some stuff[i][j][k]);
}
__syncthreads();
}
}

Thanks again for all of the help!

Wes.

to 1) You can only allocate a static 2D array in smem or deal with linear smem and calculate the adresses by hand (that’s how I do it mostly). cudaMemcpyToArray is a host function to copy from host to device memory. Shared memory only exists in block scope, so you wont be able to keep your data in there, after your block finishes.

to 2) I haven’t completely looked it through, but one thing to mention: grids are only 2D, so there is no blockIdx.z!

Thanks for your reply VrahoK.

I think I’ve gotten a little confused about shared memory. I’m trying to allocate a 2d array of int’s or float complex on the device that can be read by any thread.

Basically I want my int ishf[i][j] array to be viewable from within my kernel. I’ve tried to use cudaMallocArray and cudaMemcpyToArray, but I can’t get it to work. I have no problems copying my 1D arrays, so I’m guessing I’m using the functions incorrectly.

Could anyone point me in the direction of an example of copying a 2d array from host to device???

Thanks again,

Wes.

Well… no. Maybe you can run this function repeatedly in parallel. That would be, in effect, parallelizing that other code that seems to take up 1%, but it’d be exactly what you need. Profiling never tells you the ultimate answer to optimizing.

P.S. shared memory does sometimes persist from from call to call, but you must not rely on that fact.

P.P.S. don’t use cudaMallocArray. that’s for using textures. you should use cudaMalloc or possibly cudaMallocPitch. If you use cudaMallocPitch, use cudaMemcpy2D (but both are optional… you should just use cudaMalloc at first).

If you want to increase your size from 512 to 65k (or even bigger) than GPU is I think indeed the way to go, as you can easily scale to larger sizes afterwards.

As a general hint, I would read (and understand) some SDK examples before porting this code. Things like reduction and scan are very useful to understand, as they are parallel-programming standard tools, and you might need them eventually. That will also help you to understand the programming model.

I think this is not shared memory you’re talking about but global memory. Global memory is the “normal” device memory, your GPU’s DRAM, where you should store the bulk of your data. Shared memory is a per block “scratchpad” memory, sort of a cache thing

That’s an awful lot of ifs :) but they seem to be just for checking boundaries and the “else” part shouldn’t ever be reached if you calculate the number of threads and blocks correctly, right?

As for looping over three variables in one go - you need to be sure to handle the race condition there but this is generally how you’d want to do it in CUDA. You exploit parallelism in more dimensions than you would if you parallelized only over “i” and do rest sequentially.

[Thanks for all of the replies, I’ve ended up using device as an external at the start of my code to allocate memory then use copyToSymbol. Seems to work well.
I now have to figure out how to so a sum of a variable say a[i] from the different threads (I’ve posted a new topic for that on.

Thanks again,

Wes.

Why not just do the multiplication as a matrix multiplication (with cublas)? Treat the p array as a 2-d array of size 2xn (one column for the real part, one column for the imaginary part).

EDIT: Try either cublasCgemm (does actual complex arithmetic), or cublasSgemm if you remember to cast your arrays back to your complex struct (or whatever) after the multiplication is done.