Question about performance of accessing an only-read array from kernel

Hi, I need some help to improve the performance of my application.

By removing one read from memory, my code improves an order of magnitude, so I have to improve that line somehow. I’ve read about memory coalescing access, but I’m not sure if my code does it correctly.

Host:

float *hold1V = NULL, *hold1V_GPU = NULL;
  int co = 0;
  hold1V = (float *)calloc(length2*length1, sizeof(float));
  for (int kk=0; kk<length1; kk++) {
    for (int k=0; k<length2; k++) {
      hold1V[co] = cos(var1[k]*var2[kk] - var1[k]*var3[k]);
      co++;
    }    
  }
  error = cudaMalloc(&hold1V_GPU, length2*length1*sizeof(float));
  if (error != cudaSuccess) LogPrintf(LOG_NORMAL, "%s %i %d\n", cudaGetErrorString(error),error,__LINE__);
  error = cudaMemcpy(hold1V_GPU, hold1V, length2*length1*sizeof(float), cudaMemcpyHostToDevice);
  if (error != cudaSuccess) LogPrintf(LOG_NORMAL, "%s %i %d\n", cudaGetErrorString(error),error,__LINE__);

GPU:

__global__ void function1( ... , const float * __restrict__ hold1V_GPU, ...)
{
  ...
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for ( templateCounter=index; templateCounter<length1; templateCounter+=stride) {
    ...
    for ( k=0; k<length2; k++ ) {
      ...
      coss = __ldg(&hold1V_GPU[k*length1 + templateCounter]);
      ...
    }
    ...
  }
  ...
}

The load from the coss line is the line which takes a lot of time. Are the threads doing coalesced access? If not, how can I arrange it?

I though that for all the threads in a warp, with contiguous indexes of the variable index, and at the same step of the innter loop over length2, they would be accessing in a coalesced way! But changing the coss line by e.g. coss=1 drops the execution time by an order of magnitude.

I’ve also tried adding __syncthreads(); but it doesn’t improve:

__global__ void function1( ... , const float * __restrict__ hold1V_GPU, ...)
{
  ...
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for ( templateCounter=index; templateCounter<length1; templateCounter+=stride) {
    ...
    for ( k=0; k<length2; k++ ) {
      ...
      coss = __ldg(&hold1V_GPU[k*length1 + templateCounter]);
      ...
      __syncthreads();
    }
    ...
  }
  ...
}

Many thanks!

It appears that your load should coalesce properly.

If all pointer arguments to the kernel are restrict and the kernel is not overly complicated, the compiler is highly likely to “do the right thing” without the explicit __ldg().

Your code (to the extent shown) looks completely bound by memory throughput, which is why performance increases significantly when you get rid of the loads. If so, the solution will be to select a GPU with high memory throughput. What GPU are you currently using?

I don’t see anything improvable in the snippet as posted, but would suggest running your code with the CUDA profiler that can direct you at any inefficiencies, which may be in parts of the code you haven’t shown (such as kernel configuration).