Why is the performance more? Refering to Dr Dobbs article

gpuguy · April 19, 2010, 6:53am

The following is a paragraph taken from Website Dr Dobbs

The two programs basically calculate the reverse of an array on GPU; one with shared memory another without shared memory.

Can someone help me in understanding the answer to the above question: Since global memory performance is between 100x-150x slower than shared memory, shouldn’t the significantly slower global memory performance dominate the runtime of both examples? Why is the shared memory version faster?

The two programs, cited above are given below for easy reference:

reverseArray_multiblock.cu

// includes, system

#include <stdio.h>

#include <assert.h>

// Simple utility function to check for CUDA runtime errors

 void checkCUDAError(const char* msg);

// Part3: implement the kernel 

__global__ void reverseArrayBlock(int *d_out, int *d_in) 

{ 

int inOffset = blockDim.x * blockIdx.x; 

int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x); 

int in = inOffset + threadIdx.x;

int out = outOffset + (blockDim.x - 1 - threadIdx.x); 

d_out[out] = d_in[in]; 

}

//////////////////////////////////////////////////////////////////////////////// // Program main //////////////////////////////////////////////////////////////////////////////// 

int main( int argc, char** argv) 

{ 

// pointer for host memory and size 

int *h_a; 

int dimA = 256 * 1024; // 256K elements (1MB total)

// pointer for device memory

 int *d_b, *d_a;

// define grid and block size 

int numThreadsPerBlock = 256;

// Part 1: compute number of blocks needed based on array size and desired block size 

int numBlocks = dimA / numThreadsPerBlock; // allocate host and device memory 

size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int); 

h_a = (int *) malloc(memSize); 

cudaMalloc( (void **) &d_a, memSize ); 

cudaMalloc( (void **) &d_b, memSize );

// Initialize input array on host 

for (int i = 0; i < dimA; ++i) 

{ h_a[i] = i; }

// Copy host array to device array 

cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

// launch kernel dim3 dimGrid(numBlocks); 

dim3 dimBlock(numThreadsPerBlock); 

reverseArrayBlock<<< dimGrid, dimBlock >>>( d_b, d_a );

// block until the device has completed 

cudaThreadSynchronize();

// check if kernel execution generated an error // Check for any CUDA errors 

checkCUDAError("kernel invocation");

// device to host copy 

cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

// Check for any CUDA errors 

checkCUDAError("memcpy");

// verify the data returned to the host is correct 

for (int i = 0; i < dimA; i++) 

{ 

assert(h_a[i] == dimA - 1 - i );

 }

// free device memory 

cudaFree(d_a); 

cudaFree(d_b);

// free host memory free(h_a);

// If the program makes it this far, then the results are correct and // there are no run-time errors. Good work! 

printf("Correct!\n"); return 0; 

} 

void checkCUDAError(const char *msg) 

{ 

cudaError_t err = cudaGetLastError(); 

if( cudaSuccess != err) 

{

 fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 

exit(EXIT_FAILURE); 

} 

}

reverseArray_multiblock_fast.cu using shared memory

// includes, system

#include <stdio.h>

#include <assert.h>

// Simple utility function to check for CUDA runtime errors 

void checkCUDAError(const char* msg);

// Part 2 of 2: implement the fast kernel using shared memory 

__global__ void reverseArrayBlock(int *d_out, int *d_in) 

{ 

extern __shared__ int s_data[];

int inOffset = blockDim.x * blockIdx.x; 

int in = inOffset + threadIdx.x;

// Load one element per thread from device memory and store it // *in reversed order* into temporary shared memory 

s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

// Block until all threads in the block have written their data to shared mem 

__syncthreads();

// write the data from shared memory in forward order, // but to the reversed block offset as before

int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);

int out = outOffset + threadIdx.x; 

d_out[out] = s_data[threadIdx.x];

 }

//////////////////////////////////////////////////////////////////////////////// // Program main //////////////////////////////////////////////////////////////////////////////// 

int main( int argc, char** argv)

{ // pointer for host memory and size 

int *h_a; 

int dimA = 256 * 1024; // 256K elements (1MB total)

// pointer for device memory 

int *d_b, *d_a;

// define grid and block size 

int numThreadsPerBlock = 256;

// Compute number of blocks needed based on array size and desired block size 

int numBlocks = dimA / numThreadsPerBlock;

// Part 1 of 2: Compute the number of bytes of shared memory needed // This is used in the kernel invocation below 

int sharedMemSize = numThreadsPerBlock * sizeof(int);

// allocate host and device memory 

size_t memSize = numBlocks * numThreadsPerBlock * sizeof(int); 

h_a = (int *) malloc(memSize); 

cudaMalloc( (void **) &d_a, memSize ); 

cudaMalloc( (void **) &d_b, memSize );

// Initialize input array on host 

for (int i = 0; i < dimA; ++i) 

{ 

h_a[i] = i; 

}

// Copy host array to device array 

cudaMemcpy( d_a, h_a, memSize, cudaMemcpyHostToDevice );

// launch kernel 

dim3 dimGrid(numBlocks); 

dim3 dimBlock(numThreadsPerBlock); 

reverseArrayBlock<<< dimGrid, dimBlock, sharedMemSize >>>( d_b, d_a );

// block until the device has completed 

cudaThreadSynchronize();

// check if kernel execution generated an error // Check for any CUDA errors 

checkCUDAError("kernel invocation");

// device to host copy 

cudaMemcpy( h_a, d_b, memSize, cudaMemcpyDeviceToHost );

// Check for any CUDA errors 

checkCUDAError("memcpy");

// verify the data returned to the host is correct 

for (int i = 0; i < dimA; i++) 

{ 

assert(h_a[i] == dimA - 1 - i ); 

}

// free device memory 

cudaFree(d_a); 

cudaFree(d_b);

// free host memory free(h_a);

// If the program makes it this far, then the results are correct and // there are no run-time errors. Good work! 

printf("Correct!\n");

return 0; 

} 

void checkCUDAError(const char *msg) 

{ 

cudaError_t err = cudaGetLastError(); 

if( cudaSuccess != err)

{ 

fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) ); 

exit(EXIT_FAILURE); } 

}

CNugteren · April 19, 2010, 7:24am

The benefit of using the shared memory is that you can re-use data. This will reduce your total amount of accesses to the global memory an will therefore result in a speed-up. However, as I can see it, this code example does not make use of data re-use (there is no reason for it for this application anyway). Therefore, the speed-up is not due to the fact that you are using the shared memory, but due to the way you are accessing your global memory.

Global memory accesses can be grouped together ‘coalesced’ (see the description of memory coalescing) to decrease global memory access times substantially. It appears that your original kernel has un-coalesced accesses, while your shared memory kernel has coalesced accesses. This could be one of the reasons. Another could be due to an effect known as ‘partition camping’.

I suggest you read some documentation on memory accessing patterns, that should explain your speed-up.

gpuguy · April 19, 2010, 7:53am

The benefit of using the shared memory is that you can re-use data. This will reduce your total amount of accesses to the global memory an will therefore result in a speed-up. However, as I can see it, this code example does not make use of data re-use (there is no reason for it for this application anyway). Therefore, the speed-up is not due to the fact that you are using the shared memory, but due to the way you are accessing your global memory.

Global memory accesses can be grouped together ‘coalesced’ (see the description of memory coalescing) to decrease global memory access times substantially. It appears that your original kernel has un-coalesced accesses, while your shared memory kernel has coalesced accesses. This could be one of the reasons. Another could be due to an effect known as ‘partition camping’.

I suggest you read some documentation on memory accessing patterns, that should explain your speed-up.

Any good reference on partition camping?

ONeill · April 19, 2010, 8:32am

There’s a white paper named MatrixTranspose in your CUDA SDK folder (e.g. NVIDIA_GPU_Computing_SDK/C/src/transposeNew/doc) with a chapter about partition camping. I think its quite good explained in there.

gpuguy · April 19, 2010, 9:04am

ONeill: Thanks! But are you sure that the performance claimed by the author is a result of Partition Camping?

Jimmy_Pettersson · April 19, 2010, 10:54am

Nope probably not, PC could possibly have some effects though but not so much that it should give you an order of a magnitude difference.

Isn’t the problem with the first implementation that you can read coalesced but not write coalesced? I don’t think even cc 1.3 supports that kind of wrap-around read/writes…

In the smem version ( lets say we’re working with a vector of length 32) thread #32 reads in element #32 from global and places it in shared memory bank #1 and so on for each thread. Then there is no issue left to do a coalesced write to global memory of this vector. Is this clear?

seibert · April 19, 2010, 1:05pm

The issue is coalescing on compute capability < 1.2 devices (which were the only devices available when the Dr. Dobbs article was written). On those older devices, threads had to read a contiguous block of memory in threadID order to get coalesced memory transactions. However, these rules were loosened up significantly on compute capability >= 1.2. Both versions of this code should run approximately the same speed on these newer CUDA devices.

Jimmy_Pettersson · April 19, 2010, 4:32pm

Oh nice, i wasnt sure if that was possible even with 1.3 devices…

btw there still arent any 1.2 devices right?

tmurray · April 19, 2010, 4:34pm

GT21x is Compute 1.2.

seibert · April 19, 2010, 4:43pm

The GT240 card that everyone loves is 1.2.

gpuguy · April 23, 2010, 3:33pm

Two questions here:

1-How do you know that the first implementation has coalesced read but not coalesced write?

2-Even in the second implementation, ultimately we have to copy the data back from shared memory to global memory. Is this copying in the second implementation Coalesced?

Topic		Replies	Views
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5078	September 6, 2008
Using Shared Memory in CUDA C/C++ Technical Blog	36	1992	October 8, 2020
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3313	January 10, 2010
How would you do this? CUDA Programming and Performance	12	4467	August 5, 2008
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7261	January 25, 2010
Effective global memory bandwidth? CUDA Programming and Performance	17	17571	September 18, 2007
Dynamic Heap initialization CUDA Programming and Performance	12	357	June 24, 2024
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29325	April 23, 2010
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13932	September 5, 2008

Why is the performance more? Refering to Dr Dobbs article

Related topics