I’m new to CUDA. I started writing my first test program (a sort) but when trying to optimise it have hit a problem. When I tried using shared memory it took longer to execute than before. I have reduced the code down to these essentials (the comment at the top shows the recorded time in ms):
[codebox]
/** Release Debug
1.0 1.1 1.2 1.3 1.0 1.1 1.2 1.3
Global 548 548 535 771 548 540 540 771 Global
Shared 655 655 655 901 655 655 655 901 Shared
*/
global extern void Test1(char* d_sortArray)
{
__shared__ char s_saBackup[arraySize];
__shared__ char s_sortArray[arraySize];
//char * s_sortArray = d_sortArray;
// Make local copy to be restored each iteration
for(int i = 0; i < sortSize; i += blockDim.x)
{
s_saBackup[i + threadIdx.x] = s_sortArray[i + threadIdx.x];
}
__syncthreads();
for(int j = 0; j < 1e6; j++)
{
// Restore copy
for(int i = 0; i < sortSize; i += blockDim.x)
{
s_sortArray[i + threadIdx.x] = s_saBackup[i + threadIdx.x];
}
__syncthreads();
}
}[/codebox]
As you can see, copying from shared to shared takes longer than copying from shared to global memory. More worrying, using 1.3 arch takes longer too. Why is this? What am I doing wrong?
Thanks in advance for help.
Raffles
PS If I make it a device to device copy (instead of shared to device) the 1.3 time goes up to 1219 for Release and 1009 for Debug!
It is probably shared memory bank conflicts. Shared memory is “striped” across 16 banks, with each back holding a set of 32 bit words. If multiple running threads are trying to access data stored in the same bank simultaneously, then a “bank conflict” occurs, and shared memory access is serialized. It is particularly easy to generate bank conflicts with chars because you wind up with 4 chars per bank and typical access patterns mean multiple active threads reading sequentially stored bytes wind up reading off the same shared memory bank. Double word types can also have the same problem.
You can read more about it in Section 5.1.2.5 of the programming guide (and what you can do about it).
Thanks, that’s helped a bit, but I’m still not seeing the “400 to 600 clock cycles of memory latency” difference between global and shared memory accesses. New figures:
[codebox]
/**
char array, 1 thread block
Release Debug
1.0 1.1 1.2 1.3 1.0 1.1 1.2 1.3
Global 548 548 535 771 548 540 540 771 Global
Shared 655 655 655 901 655 655 655 901 Shared
int array, 1 thread block
Release Debug
1.0 1.1 1.2 1.3 1.0 1.1 1.2 1.3
Global 1008 1024 1220 1008 1013 1008 Global
Shared 426 426 653 426 426 428 Shared
int array, 10 thread blocks
Release Debug
1.0 1.1 1.2 1.3 1.0 1.1 1.2 1.3
Global 2150 2279 2150 2150 Global
Shared 428 653 428 428 Shared
*/[/codebox]
Any ideas? (also why is Release build slower than Debug, and 1.3 slower than older archs?)