why 256byte loads slower than 128byte loads?

I know the CUDA programming guide says 256byte loads are 2x slower than 128byte loads for compute capability 1.0. I’ve measured it on Tesla 1060 and I get 40GiB/s for coalesced 256 byte loads and 68.5GiB/s for coalesced 128byte loads.

Why the difference? From my understanding, each memory transaction maps to a single memory bank (the global address space is divided into 256 byte partitions, the maximum size memory transaction?). In this case, wouldn’t 256 byte transactions be more efficient than 128b - just increase the RAM burst mode access size.

I really want to use 256 byte loads because I want each thread to process as large a block at the finest grid level as possible to reduce the merge costs. Given the penalty, it still probably is worth it.

This is more of an educated guess, but the requirement for a coalesced load is one half-warp (16 threads). If we assume that the hardware only supports 64-bit ld/st operations (anything bigger requires multiple operations), then 16*64/8 = 64 bytes per half-warp is the maximum transaction size supported by hardware (or equivalently, 128 bytes for 32 threads). If each thread tries to do a 128-bit ld/st, the hardware (or compiler) may need to translate this into two consecutive 64-bit operations. This would result in interleaved/uncoalesced memory accesses.

Again, this is just a guess as I do not know how the hardware actually works. You could test this with the visual profiler to count the number of uncoalsced ld/st operations using 256-byte transactions. Sylvain might have a better idea.

16 * 64 / 8 == 128 and not 64…

am I missing something here ??? :">

Sadly, one would think that I could do basic math. Evidently that isn’t the case.

I went back and read the programming guide again. From the programming guide:

and

So I think that my original post still might be right. Let me try to phrase it correctly:

A coalesced load is one half-warp (16 threads). If we assume that the hardware only supports 32-bit ld/st operations (anything bigger requires multiple operations), then 16*32/8 = 64 bytes per half-warp is the maximum transaction size supported by hardware (or equivalently, 128 bytes for 32 threads). If each thread tries to do a 64-bit ld/st, the hardware (or compiler) may need to translate this into two consecutive 32-bit operations. This would result in interleaved/uncoalesced memory accesses.

I am still leaning towards this explanation because the uncoalesced penalty decreases as the size of an individual operation increases.

OK, so you’re suggesting a contiguous 8 byte load by 16 threads will take 4 64 byte transaction instead of 2

transaction 0: read lower 4 bytes for threads 0 1 2 3 4 5 6 7

transaction 1: read lower 4 bytes for threads 8 9 10 11 12 13 14 15

transaction 2: read upper 4 bytes for threads 0 1 2 3 5 6 7

transaction 4: read upper 4 bytes for threads 8 9 10 11 12 13 14 15

Obviously 2 transactions are redundant, so I think NVIDIA would’ve made the memory controller smart enough to

not do them. GT200 has better coalescing than G80, so why would they not implement this optimization?

The access pattern is completely contiguous - one 256-byte transaction should be convertible into multiple 128-byte transactions. Any reduction in bandwidth necessarily results from an inefficiency in either the memory controller or the memory interface of a MP.

I offered a possible explanation, but without knowing exactly what the hardware is doing it is hard to give a satisfactory explanation. One way to test my theory would be to have a single thread per warp issue 32, 64, and 128-bit memory operations. If my theory is correct, then the 128-bit operations should take longer because they issue multiple transactions.

Alright, so it looks like I was wrong.

I put together a simple test to try to determine the effect of transaction size on memory transfer bandwidth. It turns out that when only one thread per warp is active, memory bandwidth increases with larger data types, which disproved my previous assertion. However with 32 active threads, it tapers off after 32-bit data types at ~60GB/s as Uncle Joe noticed.

Now I looked into this a little bit more in depth, because I know that I have hit over 100GB/s on a previous implementation of memcpy on a 280GTX. Moving from 6 warps per block to 1 warp per block increases the upper bound to 77GB/s on the 280GTX. What is interesting is that in this case, 128-bit elements (or 512-byte transactions) achieve the highest bandwidth. Increasing the total number of blocks from 64 to 384 increases the upper bound again to ~100GB/s, but in this case, 64-bit (256-byte transactions) elements are the fastest. This makes me think that all transactions sizes are equally efficient from the SM’s perspective, and that the bandwidth depends more on the interleaved memory access pattern that is generated by the combination of all active blocks on all SMs. This paper talks about a similar issue http://www.ece.ubc.ca/~aamodt/papers/gyuan.micro2009.pdf .

#include <stdio.h>

template< typename T, unsigned int size, unsigned int threads, unsigned int warps, unsigned int blocks >

__global__ void testCoalescing( T* out, const T* in )

{

	if( ( threadIdx.x & 0x1F ) >= threads ) return; // Bail out most threads

	

	unsigned int tid = ( threadIdx.x >> 5 ) + (threadIdx.x & 0x1F);

	unsigned int bid = blockIdx.x;

	unsigned int gid = tid + bid * warps * threads;

	

	const unsigned int totalThreads = threads * warps * blocks;

	const unsigned int iterations = size / totalThreads;	

	

	unsigned int index = gid;

	

	#pragma unroll 32

	for( unsigned int i = 0; i < iterations; ++i, index += totalThreads )

	{

		out[index] = in[index];

	}

	

	for(; index < size; index += totalThreads )

	{

		out[index] = in[index];

	}

}

template< typename T, unsigned int size, unsigned int threads, unsigned int warps, unsigned int blocks >

void test()

{

	T* in;

	T* out;

	

	cudaMalloc( (void**) &in, size );

	cudaMalloc( (void**) &out, size );

	cudaEvent_t end;

	cudaEvent_t start;

	cudaEventCreate( &start );

	cudaEventCreate( &end );

	cudaEventRecord( start, 0 );

	

	for( unsigned int i = 0; i < 100; ++i )

	{

		testCoalescing< T, size / sizeof( T ), threads, 

			warps, blocks ><<< blocks, warps * 32 >>>( out, in );

	}

	

	cudaEventRecord( end, 0 );

	cudaEventSynchronize( end );

	float time = 0.0;

	cudaEventElapsedTime( &time, start, end );

	time /= 100.0f;

	cudaEventDestroy( start );

	cudaEventDestroy( end );

	cudaFree( in );

	cudaFree( out );

	

	printf( " %ld-bit transfer bandwidth: %fGB/s\n", 

		sizeof( T ) * 8, ( ( 2 * size * 1000.0f ) / 1073741824.0f ) / time );

}

int main(int argc, char** argv)

{

	const unsigned int size = 2 << 24;

	const unsigned int warps = 6; // cover the 24-stage pipeline

	const unsigned int blocks = 64;

	const unsigned int threads = 32; // active threads per warp

	int device;

	cudaGetDevice( &device );

	cudaDeviceProp prop;

	cudaGetDeviceProperties( &prop, device );

	printf( "For device: %s\n", prop.name );

	test< unsigned char, size, threads, warps, blocks >();

	test< unsigned short, size, threads, warps, blocks >();

	test< unsigned int, size, threads, warps, blocks >();

	test< long long unsigned int, size, threads, warps, blocks >();

	test< uint4, size, threads, warps, blocks >();

	

	return 0;

}

Here are the results for one active thread per warp, 6 warps per block, 64 blocks:

normal@phenom:~/temp/coalescing$ ./coalescing

For device: GeForce GTX 280

8-bit transfer bandwidth: 1.107381GB/s

16-bit transfer bandwidth: 3.196410GB/s

32-bit transfer bandwidth: 5.778513GB/s

64-bit transfer bandwidth: 7.753606GB/s

128-bit transfer bandwidth: 8.331792GB/s

normal@phenom:~/temp/coalescing$ ./coalescing

For device: Tesla C1060

8-bit transfer bandwidth: 1.036921GB/s

16-bit transfer bandwidth: 2.853252GB/s

32-bit transfer bandwidth: 5.173034GB/s

64-bit transfer bandwidth: 7.878699GB/s

128-bit transfer bandwidth: 9.468297GB/s

For 32 active threads per warp, 6 warps per block, 64 blocks:

normal@phenom:~/temp/coalescing$ ./coalescing

For device: Tesla C1060

8-bit transfer bandwidth: 13.369940GB/s

16-bit transfer bandwidth: 36.440418GB/s

32-bit transfer bandwidth: 48.894238GB/s

64-bit transfer bandwidth: 34.927246GB/s

128-bit transfer bandwidth: 34.616203GB/s

normal@phenom:~/temp/coalescing$ ./coalescing

For device: GeForce GTX 280

8-bit transfer bandwidth: 15.327665GB/s

16-bit transfer bandwidth: 44.629219GB/s

32-bit transfer bandwidth: 59.287403GB/s

64-bit transfer bandwidth: 46.184879GB/s

128-bit transfer bandwidth: 44.705101GB/s

For 32 active threads per warp, one warp per block and 64 total blocks:

For device: GeForce GTX 280

8-bit transfer bandwidth: 7.316723GB/s

16-bit transfer bandwidth: 17.354494GB/s

32-bit transfer bandwidth: 30.239174GB/s

64-bit transfer bandwidth: 54.181236GB/s

128-bit transfer bandwidth: 76.944344GB/s

normal@phenom:~/temp/coalescing$ ./coalescing

For device: Tesla C1060

8-bit transfer bandwidth: 6.486255GB/s

16-bit transfer bandwidth: 15.308207GB/s

32-bit transfer bandwidth: 29.604433GB/s

64-bit transfer bandwidth: 50.725243GB/s

128-bit transfer bandwidth: 68.840340GB/s

For 32 active threads per warp, but only one warp per block and 384 total blocks:

For device: GeForce GTX 280

8-bit transfer bandwidth: 14.599340GB/s

16-bit transfer bandwidth: 42.640224GB/s

32-bit transfer bandwidth: 75.512741GB/s

64-bit transfer bandwidth: 101.169762GB/s

128-bit transfer bandwidth: 85.896751GB/s

normal@phenom:~/temp/coalescing$ ./coalescing

For device: Tesla C1060

8-bit transfer bandwidth: 13.462626GB/s

16-bit transfer bandwidth: 39.852840GB/s

32-bit transfer bandwidth: 63.858753GB/s

64-bit transfer bandwidth: 74.491768GB/s

128-bit transfer bandwidth: 74.745811GB/s

(note that this is running on Ocelot, so the results are slightly lower than using the NVIDIA compiler due to using the device JIT compiler rather than using the native instructions generated by NVCC, the results should be slower across all data types though)