Alright, so it looks like I was wrong.
I put together a simple test to try to determine the effect of transaction size on memory transfer bandwidth. It turns out that when only one thread per warp is active, memory bandwidth increases with larger data types, which disproved my previous assertion. However with 32 active threads, it tapers off after 32-bit data types at ~60GB/s as Uncle Joe noticed.
Now I looked into this a little bit more in depth, because I know that I have hit over 100GB/s on a previous implementation of memcpy on a 280GTX. Moving from 6 warps per block to 1 warp per block increases the upper bound to 77GB/s on the 280GTX. What is interesting is that in this case, 128-bit elements (or 512-byte transactions) achieve the highest bandwidth. Increasing the total number of blocks from 64 to 384 increases the upper bound again to ~100GB/s, but in this case, 64-bit (256-byte transactions) elements are the fastest. This makes me think that all transactions sizes are equally efficient from the SM’s perspective, and that the bandwidth depends more on the interleaved memory access pattern that is generated by the combination of all active blocks on all SMs. This paper talks about a similar issue http://www.ece.ubc.ca/~aamodt/papers/gyuan.micro2009.pdf .
#include <stdio.h>
template< typename T, unsigned int size, unsigned int threads, unsigned int warps, unsigned int blocks >
__global__ void testCoalescing( T* out, const T* in )
{
if( ( threadIdx.x & 0x1F ) >= threads ) return; // Bail out most threads
unsigned int tid = ( threadIdx.x >> 5 ) + (threadIdx.x & 0x1F);
unsigned int bid = blockIdx.x;
unsigned int gid = tid + bid * warps * threads;
const unsigned int totalThreads = threads * warps * blocks;
const unsigned int iterations = size / totalThreads;
unsigned int index = gid;
#pragma unroll 32
for( unsigned int i = 0; i < iterations; ++i, index += totalThreads )
{
out[index] = in[index];
}
for(; index < size; index += totalThreads )
{
out[index] = in[index];
}
}
template< typename T, unsigned int size, unsigned int threads, unsigned int warps, unsigned int blocks >
void test()
{
T* in;
T* out;
cudaMalloc( (void**) &in, size );
cudaMalloc( (void**) &out, size );
cudaEvent_t end;
cudaEvent_t start;
cudaEventCreate( &start );
cudaEventCreate( &end );
cudaEventRecord( start, 0 );
for( unsigned int i = 0; i < 100; ++i )
{
testCoalescing< T, size / sizeof( T ), threads,
warps, blocks ><<< blocks, warps * 32 >>>( out, in );
}
cudaEventRecord( end, 0 );
cudaEventSynchronize( end );
float time = 0.0;
cudaEventElapsedTime( &time, start, end );
time /= 100.0f;
cudaEventDestroy( start );
cudaEventDestroy( end );
cudaFree( in );
cudaFree( out );
printf( " %ld-bit transfer bandwidth: %fGB/s\n",
sizeof( T ) * 8, ( ( 2 * size * 1000.0f ) / 1073741824.0f ) / time );
}
int main(int argc, char** argv)
{
const unsigned int size = 2 << 24;
const unsigned int warps = 6; // cover the 24-stage pipeline
const unsigned int blocks = 64;
const unsigned int threads = 32; // active threads per warp
int device;
cudaGetDevice( &device );
cudaDeviceProp prop;
cudaGetDeviceProperties( &prop, device );
printf( "For device: %s\n", prop.name );
test< unsigned char, size, threads, warps, blocks >();
test< unsigned short, size, threads, warps, blocks >();
test< unsigned int, size, threads, warps, blocks >();
test< long long unsigned int, size, threads, warps, blocks >();
test< uint4, size, threads, warps, blocks >();
return 0;
}
Here are the results for one active thread per warp, 6 warps per block, 64 blocks:
normal@phenom:~/temp/coalescing$ ./coalescing
For device: GeForce GTX 280
8-bit transfer bandwidth: 1.107381GB/s
16-bit transfer bandwidth: 3.196410GB/s
32-bit transfer bandwidth: 5.778513GB/s
64-bit transfer bandwidth: 7.753606GB/s
128-bit transfer bandwidth: 8.331792GB/s
normal@phenom:~/temp/coalescing$ ./coalescing
For device: Tesla C1060
8-bit transfer bandwidth: 1.036921GB/s
16-bit transfer bandwidth: 2.853252GB/s
32-bit transfer bandwidth: 5.173034GB/s
64-bit transfer bandwidth: 7.878699GB/s
128-bit transfer bandwidth: 9.468297GB/s
For 32 active threads per warp, 6 warps per block, 64 blocks:
normal@phenom:~/temp/coalescing$ ./coalescing
For device: Tesla C1060
8-bit transfer bandwidth: 13.369940GB/s
16-bit transfer bandwidth: 36.440418GB/s
32-bit transfer bandwidth: 48.894238GB/s
64-bit transfer bandwidth: 34.927246GB/s
128-bit transfer bandwidth: 34.616203GB/s
normal@phenom:~/temp/coalescing$ ./coalescing
For device: GeForce GTX 280
8-bit transfer bandwidth: 15.327665GB/s
16-bit transfer bandwidth: 44.629219GB/s
32-bit transfer bandwidth: 59.287403GB/s
64-bit transfer bandwidth: 46.184879GB/s
128-bit transfer bandwidth: 44.705101GB/s
For 32 active threads per warp, one warp per block and 64 total blocks:
For device: GeForce GTX 280
8-bit transfer bandwidth: 7.316723GB/s
16-bit transfer bandwidth: 17.354494GB/s
32-bit transfer bandwidth: 30.239174GB/s
64-bit transfer bandwidth: 54.181236GB/s
128-bit transfer bandwidth: 76.944344GB/s
normal@phenom:~/temp/coalescing$ ./coalescing
For device: Tesla C1060
8-bit transfer bandwidth: 6.486255GB/s
16-bit transfer bandwidth: 15.308207GB/s
32-bit transfer bandwidth: 29.604433GB/s
64-bit transfer bandwidth: 50.725243GB/s
128-bit transfer bandwidth: 68.840340GB/s
For 32 active threads per warp, but only one warp per block and 384 total blocks:
For device: GeForce GTX 280
8-bit transfer bandwidth: 14.599340GB/s
16-bit transfer bandwidth: 42.640224GB/s
32-bit transfer bandwidth: 75.512741GB/s
64-bit transfer bandwidth: 101.169762GB/s
128-bit transfer bandwidth: 85.896751GB/s
normal@phenom:~/temp/coalescing$ ./coalescing
For device: Tesla C1060
8-bit transfer bandwidth: 13.462626GB/s
16-bit transfer bandwidth: 39.852840GB/s
32-bit transfer bandwidth: 63.858753GB/s
64-bit transfer bandwidth: 74.491768GB/s
128-bit transfer bandwidth: 74.745811GB/s
(note that this is running on Ocelot, so the results are slightly lower than using the NVIDIA compiler due to using the device JIT compiler rather than using the native instructions generated by NVCC, the results should be slower across all data types though)