Why is the performance more? Refering to Dr Dobbs article

The following is a paragraph taken from Website Dr Dobbs

The two programs basically calculate the reverse of an array on GPU; one with shared memory another without shared memory.

Can someone help me in understanding the answer to the above question: Since global memory performance is between 100x-150x slower than shared memory, shouldn’t the significantly slower global memory performance dominate the runtime of both examples? Why is the shared memory version faster?

The two programs, cited above are given below for easy reference:

reverseArray_multiblock.cu

// includes, system

#include <stdio.h>

#include <assert.h>

// Simple utility function to check for CUDA runtime errors

 void checkCUDAError(const char* msg);

// Part3: implement the kernel 

__global__ void reverseArrayBlock(int *d_out, int *d_in) 

{ 

int inOffset = blockDim.x * blockIdx.x; 

int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x); 

int in = inOffset + threadIdx.x;

int out = outOffset + (blockDim.x - 1 - threadIdx.x); 

d_out[out] = d_in[in]; 

}

//////////////////////////////////////////////////////////////////////////////// // Program main //////////////////////////////////////////////////////////////////////////////// 

int main( int argc, char** argv) 

{ 

// pointer for host memory and size 

int *h_a; 

int dimA = 256 * 1024; // 256K elements (1MB total)

// pointer for device memory

 int *d_b, *d_a;

// define grid and block size 

int numThreadsPerBlock = 256;

// Part 1: compute number of blocks needed based on array size and desired block size 

int numBlocks = dimA / numThreadsPerBlock; // allocate host and device memory 

size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int); 

h_a = (int *) malloc(memSize); 

cudaMalloc( (void **) &d_a, memSize ); 

cudaMalloc( (void **) &d_b, memSize );

// Initialize input array on host 

for (int i = 0; i < dimA; ++i) 

{ h_a[i] = i; }

// Copy host array to device array 

cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

// launch kernel dim3 dimGrid(numBlocks); 

dim3 dimBlock(numThreadsPerBlock); 

reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a );

// block until the device has completed 

cudaThreadSynchronize();

// check if kernel execution generated an error // Check for any CUDA errors 

checkCUDAError("kernel invocation");

// device to host copy 

cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

// Check for any CUDA errors 

checkCUDAError("memcpy");

// verify the data returned to the host is correct 

for (int i = 0; i < dimA; i++) 

{ 

assert(h_a[i] == dimA - 1 - i );

 }

// free device memory 

cudaFree(d_a); 

cudaFree(d_b);

// free host memory free(h_a);

// If the program makes it this far, then the results are correct and // there are no run-time errors. Good work! 

printf("Correct!\n"); return 0; 

} 

void checkCUDAError(const char *msg) 

{ 

cudaError_t err = cudaGetLastError(); 

if( cudaSuccess != err) 

{

 fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 

exit(EXIT_FAILURE); 

} 

}

reverseArray_multiblock_fast.cu using shared memory

// includes, system

#include <stdio.h>

#include <assert.h>

// Simple utility function to check for CUDA runtime errors 

void checkCUDAError(const char* msg);

// Part 2 of 2: implement the fast kernel using shared memory 

__global__ void reverseArrayBlock(int *d_out, int *d_in) 

{ 

extern __shared__ int s_data[];

int inOffset = blockDim.x * blockIdx.x; 

int in = inOffset + threadIdx.x;

// Load one element per thread from device memory and store it // *in reversed order* into temporary shared memory 

s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

// Block until all threads in the block have written their data to shared mem 

__syncthreads();

// write the data from shared memory in forward order, // but to the reversed block offset as before

int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);

int out = outOffset + threadIdx.x; 

d_out[out] = s_data[threadIdx.x];

 }

//////////////////////////////////////////////////////////////////////////////// // Program main //////////////////////////////////////////////////////////////////////////////// 

int main( int argc, char** argv)

{ // pointer for host memory and size 

int *h_a; 

int dimA = 256 * 1024; // 256K elements (1MB total)

// pointer for device memory 

int *d_b, *d_a;

// define grid and block size 

int numThreadsPerBlock = 256;

// Compute number of blocks needed based on array size and desired block size 

int numBlocks = dimA / numThreadsPerBlock;

// Part 1 of 2: Compute the number of bytes of shared memory needed // This is used in the kernel invocation below 

int sharedMemSize = numThreadsPerBlock * sizeof(int);

// allocate host and device memory 

size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int); 

h_a = (int *) malloc(memSize); 

cudaMalloc( (void **) &d_a, memSize ); 

cudaMalloc( (void **) &d_b, memSize );

// Initialize input array on host 

for (int i = 0; i < dimA; ++i) 

{ 

h_a[i] = i; 

}

// Copy host array to device array 

cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

// launch kernel 

dim3 dimGrid(numBlocks); 

dim3 dimBlock(numThreadsPerBlock); 

reverseArrayBlock<<< dimGrid, dimBlock, sharedMemSize >>>( d_b, d_a );

// block until the device has completed 

cudaThreadSynchronize();

// check if kernel execution generated an error // Check for any CUDA errors 

checkCUDAError("kernel invocation");

// device to host copy 

cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

// Check for any CUDA errors 

checkCUDAError("memcpy");

// verify the data returned to the host is correct 

for (int i = 0; i < dimA; i++) 

{ 

assert(h_a[i] == dimA - 1 - i ); 

}

// free device memory 

cudaFree(d_a); 

cudaFree(d_b);

// free host memory free(h_a);

// If the program makes it this far, then the results are correct and // there are no run-time errors. Good work! 

printf("Correct!\n");

return 0; 

} 

void checkCUDAError(const char *msg) 

{ 

cudaError_t err = cudaGetLastError(); 

if( cudaSuccess != err)

{ 

fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 

exit(EXIT_FAILURE); } 

}

The benefit of using the shared memory is that you can re-use data. This will reduce your total amount of accesses to the global memory an will therefore result in a speed-up. However, as I can see it, this code example does not make use of data re-use (there is no reason for it for this application anyway). Therefore, the speed-up is not due to the fact that you are using the shared memory, but due to the way you are accessing your global memory.

Global memory accesses can be grouped together ‘coalesced’ (see the description of memory coalescing) to decrease global memory access times substantially. It appears that your original kernel has un-coalesced accesses, while your shared memory kernel has coalesced accesses. This could be one of the reasons. Another could be due to an effect known as ‘partition camping’.

I suggest you read some documentation on memory accessing patterns, that should explain your speed-up.

Any good reference on partition camping?

There’s a white paper named MatrixTranspose in your CUDA SDK folder (e.g. NVIDIA_GPU_Computing_SDK/C/src/transposeNew/doc) with a chapter about partition camping. I think its quite good explained in there.

ONeill: Thanks! But are you sure that the performance claimed by the author is a result of Partition Camping?

Nope probably not, PC could possibly have some effects though but not so much that it should give you an order of a magnitude difference.

Isn’t the problem with the first implementation that you can read coalesced but not write coalesced? I don’t think even cc 1.3 supports that kind of wrap-around read/writes…

In the smem version ( lets say we’re working with a vector of length 32) thread #32 reads in element #32 from global and places it in shared memory bank #1 and so on for each thread. Then there is no issue left to do a coalesced write to global memory of this vector. Is this clear?

The issue is coalescing on compute capability < 1.2 devices (which were the only devices available when the Dr. Dobbs article was written). On those older devices, threads had to read a contiguous block of memory in threadID order to get coalesced memory transactions. However, these rules were loosened up significantly on compute capability >= 1.2. Both versions of this code should run approximately the same speed on these newer CUDA devices.

Oh nice, i wasnt sure if that was possible even with 1.3 devices…

btw there still arent any 1.2 devices right?

GT21x is Compute 1.2.

The GT240 card that everyone loves is 1.2.

Two questions here:

1-How do you know that the first implementation has coalesced read but not coalesced write?

2-Even in the second implementation, ultimately we have to copy the data back from shared memory to global memory. Is this copying in the second implementation Coalesced?