Help me with very, very poor performance

Hi everyone,

I’m pretty new to CUDA, so please be gentle.

I’m benchmarking my OpenMP routine against my CUDA routine, and I seem to have run into a problem. I’ve successfully multithreaded it with CUDA, but my perforamnce stinks.

Here is the kernel

__global__ void

EDCalc( float* genome,float* edspacing,float* rhoarray, float2* dnk, float* distarray, int totalpts, int refllayers, float roughness,  float rho, int ptsperthread )

{

int index = blockIdx.x*blockDim.x+threadIdx.x;

if(index < ptsperthread)

{

    float temp = 0;

    float dist = 0;

    

	if(index == 0)

	{

  for(int i = 0; i < refllayers; i++)

  {

  	rhoarray[i] = (genome[i+1]-genome[i])*rho/2.0f;

  }

	}

  

	syncthreads();

   

  

	temp = 0.0f;

	for(int k = 0; k < refllayers; k++)

	{

  

  dist = (edspacing[index]-distarray[k] )*roughness;

  

  if(dist > 6.0f)

  	temp += (rhoarray[k])*(2.0f);

  else if (dist > -6.0f)

  	temp += (rhoarray[k])*(1.0f+erff(dist));

  

  

  dnk[index].x = temp;

	}

 }

}

Against my CPU version, the function takes ~160 microseconds. With CUDA, it takes ~940 microseconds. Unless I comment out dnk[index].x = temp. Then all of the math takes 48 microseconds (a nice improvement), but I’m not storing anything. I need to store it for the next step in the calculation. Any ideas how to speed this up? I’m allocating that memory with

CUDA_SAFE_CALL(cudaMalloc((void**) cudank, npts*sizeof(float2)));

and it does not get copied back to the CPU side. I’ve attached the project if that helps.

I know this isn’t optimized as well as it could be, but I just want to get the proof of concept working. Thanks for any help.
CUDAtest.rar (1.63 MB)

for(int k = 0; k < refllayers; k++)

{

 Â 

 Â dist = (edspacing[index]-distarray[k] )*roughness;

 Â 

 Â if(dist > 6.0f)

     temp += (rhoarray[k])*(2.0f);

 Â else if (dist > -6.0f)

     temp += (rhoarray[k])*(1.0f+erff(dist));

 Â 

 Â dnk[index].x = temp;

}

Looking at this, the two biggest problems are the reads from distarray and rhoarray. Both are in global memory, and every thread is reading the same value at once. Global memory is extremely inefficient at broadcasting data, so you should consider putting distarray into constant memory, where the constant cache can help you out. You write to rhoarray earlier in this kernel, so it can’t go into constant memory. But how big is rhoarray? Will it fit into shared memory? That would help enormously as well.

Also, beware of benchmarks where you comment out a memory write! The compiler aggressively optimizes away dead code, and might eliminate a bunch of math if it detects the answer is not written out anywhere.

Hi seibert, thanks for the reply. I’ve read up on shared memory. This is much faster. Did I go overboard with the shared memory, or does this look right? Also, exactly how much shared memory do i have? I assume there is some sort of limit.

__global__ void

EDCalc( float* genome,float* edspacing,float* rhoarray, float2* dnk, float* distarray, int refllayers, float roughness,  float rho, int totalpts, float boxsize, float dz, float leftoffset)

{

int index = blockIdx.x*blockDim.x+threadIdx.x;

__shared__ float sroughness;

__shared__ float srhoarray[100];

__shared__ float sdrhoarray[100];

__shared__ float sdistarray[100];

__shared__ float sgenome[100];

__shared__ float sboxsize;

__shared__ float sdz;

__shared__ float smulti;

__shared__ float sleftoffset;

sdz = dz;

sboxsize = boxsize;

sroughness = roughness*sdz;

sleftoffset = leftoffset;

smulti = rho/2.0f;

int tid = threadIdx.x;

	if(tid < refllayers+1)

  sgenome[tid] = genome[tid];

	

	__syncthreads();

	

	if(tid < refllayers)

	{

  srhoarray[tid] = (sgenome[tid+1]-sgenome[tid])*smulti;

  sdrhoarray[tid] = 2.0f*srhoarray[tid];

  sdistarray[tid] = (-sleftoffset- tid*boxsize)/sdz;

	}

	

  

	__syncthreads();

	

	

if(index < totalpts)

{

  

    float temp = 0;

    float dist = 0;

	for(int k = 0; k < refllayers; k++)

	{

  

  dist = (index + sdistarray[k] )*sroughness;

  

  if(dist > 6.0f)

  	temp += sdrhoarray[k];

  else if (dist > -6.0f)

  	temp += (srhoarray[k])*(1.0f+erff(dist));

  

	}

	dnk[index].x = temp;

 }

__syncthreads();

}

16 kb per multiprocessor.

So if you use more than 8 kb (including blockdim, blockIdx, gridDim and kernel-parameters), you can have only 1 block per multiprocessor.

the commandline option --ptxargs=-v (from memory, so check) to nvcc will print the amount your kernel is actually using.

Did I go overboard with the shared memory, or does this look right?

You went a little overboard. There’s no point in duplicating parameters in shared memory; they already live there. So sboxsize, sdz, and sleftoffset can go away.

The calculated ones, sroughness and smulti, are a little different: your current code calculates them once for each thread (which is fine) then stores them to shared memory once for each thread (which is not: it results in bank conflicts). Unless you’re running out of registers, either (a) just declare them “float” or (b) put the calculation in the single place each of these is used (e.g.

dist = (index + sdistarray[k] )*roughness*sdz;

) and count on the friendly compiler to optimize the calculation or © do the multiplication outside the kernel and pass the result in.

A bigger potential overboardification is the four arrays of 100 floats each. If refllayers is typically 90 or 80, this might be fine. But if you’re just reserving the maximum you think you might need, and your actual usage is 20 or 30, consider having the calling code allocate one big shared array, which you then partition:

Host code:

EDCalc<<<grid, threads, refllayers*sizeof(float)*4>>>(blah, blah, refllayers, blah);

// 4 arrays

Kernel code:

extern __shared__ float combined[];

...

void __global__ EDCalc(..., refllayers, ...)

{

    float* srhoarray = combined;

    float* sdrhoarray = combined + refllayers;

    float* sdistarray = combined + 2*refllayers;

    float* sgenome = combined + 3*refllayers;

   // use the above arrays the way you did before

    ....

}

On another note, where did edspacing go? Also, rhoarray and distarray. If you’re

not using them, get rid of them.

Finally,

dnk[index].x = temp;

is cheating you out of the coalesced write you so richly deserve. Consider not

interlacing your output: i.e. have a big array of the x components, then a big

array of the y components. Or, if you’re not using the y components, get rid

of them.

Hope this was helpful.

–redpill

Thank you very much redpill! That answered many of my questions. I however don’t quite understand exactly how to coalesce the output. I coudn’t find anything on google about it either. Do you mean that instead of float2, just use float? Thanks for all of the help, this is running pretty smoothly now, and I haven’t even figured out how to take the inner loop out! Its currently float2 because I will (in the future) use y. However, I don’t need it right now.

EDIT: So I read up on coalescing, and I didn’t see any speedup. All I did was change the float2 to a float. Is there something else I should do?

New kernel

extern __shared__ float combined[];

__global__ void

EDCalc( float* genome, float2* dnk, float* distarray, int refllayers, float roughness,  float rho, int totalpts, float boxsize, float dz, float leftoffset)

{

int index = blockIdx.x*blockDim.x+threadIdx.x;

float* srhoarray = combined;

float* sdrhoarray = combined + refllayers;

float* sdistarray = combined + 2*refllayers;

float* sgenome = combined + 3*refllayers;

float smulti,sroughness;

sroughness = roughness*dz;

smulti = rho/2.0f;

int tid = threadIdx.x;

	if(tid < refllayers+1)

  sgenome[tid] = genome[tid];

	

	__syncthreads();

	

	if(tid < refllayers)

	{

  srhoarray[tid] = (sgenome[tid+1]-sgenome[tid])*smulti;

  sdrhoarray[tid] = 2.0f*srhoarray[tid];

  sdistarray[tid] = (-leftoffset- tid*boxsize)/dz;

	}

	

  

	__syncthreads();

	

	

if(index < totalpts)

{

  

    float temp = 0;

    float dist = 0;

	for(int k = 0; k < refllayers; k++)

	{

  

  dist = (index + sdistarray[k] )*sroughness;

  

  if(dist > 4.0f)

  	temp += sdrhoarray[k];

  else if (dist > -4.0f)

  	temp += (srhoarray[k])*(1.0f+erff(dist));

  

	}

	dnk[index].x = temp;

 }

__syncthreads();

}

It’s possible then that the global write at the end is a negligible part of the overall kernel runtime, so coalesced vs. uncoalesced does not make an observable difference for you.