Hi, I need some help to improve the performance of my application.
By removing one read from memory, my code improves an order of magnitude, so I have to improve that line somehow. I’ve read about memory coalescing access, but I’m not sure if my code does it correctly.
Host:
float *hold1V = NULL, *hold1V_GPU = NULL;
int co = 0;
hold1V = (float *)calloc(length2*length1, sizeof(float));
for (int kk=0; kk<length1; kk++) {
for (int k=0; k<length2; k++) {
hold1V[co] = cos(var1[k]*var2[kk] - var1[k]*var3[k]);
co++;
}
}
error = cudaMalloc(&hold1V_GPU, length2*length1*sizeof(float));
if (error != cudaSuccess) LogPrintf(LOG_NORMAL, "%s %i %d\n", cudaGetErrorString(error),error,__LINE__);
error = cudaMemcpy(hold1V_GPU, hold1V, length2*length1*sizeof(float), cudaMemcpyHostToDevice);
if (error != cudaSuccess) LogPrintf(LOG_NORMAL, "%s %i %d\n", cudaGetErrorString(error),error,__LINE__);
GPU:
__global__ void function1( ... , const float * __restrict__ hold1V_GPU, ...)
{
...
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for ( templateCounter=index; templateCounter<length1; templateCounter+=stride) {
...
for ( k=0; k<length2; k++ ) {
...
coss = __ldg(&hold1V_GPU[k*length1 + templateCounter]);
...
}
...
}
...
}
The load from the coss line is the line which takes a lot of time. Are the threads doing coalesced access? If not, how can I arrange it?
I though that for all the threads in a warp, with contiguous indexes of the variable index, and at the same step of the innter loop over length2, they would be accessing in a coalesced way! But changing the coss line by e.g. coss=1 drops the execution time by an order of magnitude.
I’ve also tried adding __syncthreads(); but it doesn’t improve:
__global__ void function1( ... , const float * __restrict__ hold1V_GPU, ...)
{
...
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for ( templateCounter=index; templateCounter<length1; templateCounter+=stride) {
...
for ( k=0; k<length2; k++ ) {
...
coss = __ldg(&hold1V_GPU[k*length1 + templateCounter]);
...
__syncthreads();
}
...
}
...
}
Many thanks!