why 256byte loads slower than 128byte loads?

Uncle_Joe · February 8, 2010, 9:12pm

I know the CUDA programming guide says 256byte loads are 2x slower than 128byte loads for compute capability 1.0. I’ve measured it on Tesla 1060 and I get 40GiB/s for coalesced 256 byte loads and 68.5GiB/s for coalesced 128byte loads.

Why the difference? From my understanding, each memory transaction maps to a single memory bank (the global address space is divided into 256 byte partitions, the maximum size memory transaction?). In this case, wouldn’t 256 byte transactions be more efficient than 128b - just increase the RAM burst mode access size.

I really want to use 256 byte loads because I want each thread to process as large a block at the finest grid level as possible to reduce the merge costs. Given the penalty, it still probably is worth it.

Gregory_Diamos · February 9, 2010, 2:25pm

This is more of an educated guess, but the requirement for a coalesced load is one half-warp (16 threads). If we assume that the hardware only supports 64-bit ld/st operations (anything bigger requires multiple operations), then 16*64/8 = 64 bytes per half-warp is the maximum transaction size supported by hardware (or equivalently, 128 bytes for 32 threads). If each thread tries to do a 128-bit ld/st, the hardware (or compiler) may need to translate this into two consecutive 64-bit operations. This would result in interleaved/uncoalesced memory accesses.

Again, this is just a guess as I do not know how the hardware actually works. You could test this with the visual profiler to count the number of uncoalsced ld/st operations using 256-byte transactions. Sylvain might have a better idea.

eyalhir74 · February 9, 2010, 3:04pm

This is more of an educated guess, but the requirement for a coalesced load is one half-warp (16 threads). If we assume that the hardware only supports 64-bit ld/st operations (anything bigger requires multiple operations), then 16*64/8 = 64 bytes per half-warp is the maximum transaction size supported by hardware (or equivalently, 128 bytes for 32 threads). If each thread tries to do a 128-bit ld/st, the hardware (or compiler) may need to translate this into two consecutive 64-bit operations. This would result in interleaved/uncoalesced memory accesses.

Again, this is just a guess as I do not know how the hardware actually works. You could test this with the visual profiler to count the number of uncoalsced ld/st operations using 256-byte transactions. Sylvain might have a better idea.

16 * 64 / 8 == 128 and not 64…

am I missing something here ??? :">

Gregory_Diamos · February 9, 2010, 3:49pm

Sadly, one would think that I could do basic math. Evidently that isn’t the case.

I went back and read the programming guide again. From the programming guide:

and

So I think that my original post still might be right. Let me try to phrase it correctly:

A coalesced load is one half-warp (16 threads). If we assume that the hardware only supports 32-bit ld/st operations (anything bigger requires multiple operations), then 16*32/8 = 64 bytes per half-warp is the maximum transaction size supported by hardware (or equivalently, 128 bytes for 32 threads). If each thread tries to do a 64-bit ld/st, the hardware (or compiler) may need to translate this into two consecutive 32-bit operations. This would result in interleaved/uncoalesced memory accesses.

I am still leaning towards this explanation because the uncoalesced penalty decreases as the size of an individual operation increases.

Uncle_Joe · February 9, 2010, 6:08pm

OK, so you’re suggesting a contiguous 8 byte load by 16 threads will take 4 64 byte transaction instead of 2

transaction 0: read lower 4 bytes for threads 0 1 2 3 4 5 6 7

transaction 1: read lower 4 bytes for threads 8 9 10 11 12 13 14 15

transaction 2: read upper 4 bytes for threads 0 1 2 3 5 6 7

transaction 4: read upper 4 bytes for threads 8 9 10 11 12 13 14 15

Obviously 2 transactions are redundant, so I think NVIDIA would’ve made the memory controller smart enough to

not do them. GT200 has better coalescing than G80, so why would they not implement this optimization?

Gregory_Diamos · February 9, 2010, 6:52pm

The access pattern is completely contiguous - one 256-byte transaction should be convertible into multiple 128-byte transactions. Any reduction in bandwidth necessarily results from an inefficiency in either the memory controller or the memory interface of a MP.

I offered a possible explanation, but without knowing exactly what the hardware is doing it is hard to give a satisfactory explanation. One way to test my theory would be to have a single thread per warp issue 32, 64, and 128-bit memory operations. If my theory is correct, then the 128-bit operations should take longer because they issue multiple transactions.

Gregory_Diamos · February 11, 2010, 4:59pm

Alright, so it looks like I was wrong.

I put together a simple test to try to determine the effect of transaction size on memory transfer bandwidth. It turns out that when only one thread per warp is active, memory bandwidth increases with larger data types, which disproved my previous assertion. However with 32 active threads, it tapers off after 32-bit data types at ~60GB/s as Uncle Joe noticed.

Now I looked into this a little bit more in depth, because I know that I have hit over 100GB/s on a previous implementation of memcpy on a 280GTX. Moving from 6 warps per block to 1 warp per block increases the upper bound to 77GB/s on the 280GTX. What is interesting is that in this case, 128-bit elements (or 512-byte transactions) achieve the highest bandwidth. Increasing the total number of blocks from 64 to 384 increases the upper bound again to ~100GB/s, but in this case, 64-bit (256-byte transactions) elements are the fastest. This makes me think that all transactions sizes are equally efficient from the SM’s perspective, and that the bandwidth depends more on the interleaved memory access pattern that is generated by the combination of all active blocks on all SMs. This paper talks about a similar issue http://www.ece.ubc.ca/~aamodt/papers/gyuan.micro2009.pdf .

#include <stdio.h>

template< typename T, unsigned int size, unsigned int threads, unsigned int warps, unsigned int blocks >

__global__ void testCoalescing( T* out, const T* in )

{

	if( ( threadIdx.x & 0x1F ) >= threads ) return; // Bail out most threads

	

	unsigned int tid = ( threadIdx.x >> 5 ) + (threadIdx.x & 0x1F);

	unsigned int bid = blockIdx.x;

	unsigned int gid = tid + bid * warps * threads;

	

	const unsigned int totalThreads = threads * warps * blocks;

	const unsigned int iterations = size / totalThreads;	

	

	unsigned int index = gid;

	

	#pragma unroll 32

	for( unsigned int i = 0; i < iterations; ++i, index += totalThreads )

	{

		out[index] = in[index];

	}

	

	for(; index < size; index += totalThreads )

	{

		out[index] = in[index];

	}

}

template< typename T, unsigned int size, unsigned int threads, unsigned int warps, unsigned int blocks >

void test()

{

	T* in;

	T* out;

	

	cudaMalloc( (void**) &in, size );

	cudaMalloc( (void**) &out, size );

	cudaEvent_t end;

	cudaEvent_t start;

	cudaEventCreate( &start );

	cudaEventCreate( &end );

	cudaEventRecord( start, 0 );

	

	for( unsigned int i = 0; i < 100; ++i )

	{

		testCoalescing< T, size / sizeof( T ), threads, 

			warps, blocks ><<< blocks, warps * 32 >>>( out, in );

	}

	

	cudaEventRecord( end, 0 );

	cudaEventSynchronize( end );

	float time = 0.0;

	cudaEventElapsedTime( &time, start, end );

	time /= 100.0f;

	cudaEventDestroy( start );

	cudaEventDestroy( end );

	cudaFree( in );

	cudaFree( out );

	

	printf( " %ld-bit transfer bandwidth: %fGB/s\n", 

		sizeof( T ) * 8, ( ( 2 * size * 1000.0f ) / 1073741824.0f ) / time );

}

int main(int argc, char** argv)

{

	const unsigned int size = 2 << 24;

	const unsigned int warps = 6; // cover the 24-stage pipeline

	const unsigned int blocks = 64;

	const unsigned int threads = 32; // active threads per warp

	int device;

	cudaGetDevice( &device );

	cudaDeviceProp prop;

	cudaGetDeviceProperties( &prop, device );

	printf( "For device: %s\n", prop.name );

	test< unsigned char, size, threads, warps, blocks >();

	test< unsigned short, size, threads, warps, blocks >();

	test< unsigned int, size, threads, warps, blocks >();

	test< long long unsigned int, size, threads, warps, blocks >();

	test< uint4, size, threads, warps, blocks >();

	

	return 0;

}

Here are the results for one active thread per warp, 6 warps per block, 64 blocks:

normal@phenom:~/temp/coalescing$ ./coalescing

For device: GeForce GTX 280

8-bit transfer bandwidth: 1.107381GB/s

16-bit transfer bandwidth: 3.196410GB/s

32-bit transfer bandwidth: 5.778513GB/s

64-bit transfer bandwidth: 7.753606GB/s

128-bit transfer bandwidth: 8.331792GB/s

normal@phenom:~/temp/coalescing$ ./coalescing

For device: Tesla C1060

8-bit transfer bandwidth: 1.036921GB/s

16-bit transfer bandwidth: 2.853252GB/s

32-bit transfer bandwidth: 5.173034GB/s

64-bit transfer bandwidth: 7.878699GB/s

128-bit transfer bandwidth: 9.468297GB/s

For 32 active threads per warp, 6 warps per block, 64 blocks:

normal@phenom:~/temp/coalescing$ ./coalescing

For device: Tesla C1060

8-bit transfer bandwidth: 13.369940GB/s

16-bit transfer bandwidth: 36.440418GB/s

32-bit transfer bandwidth: 48.894238GB/s

64-bit transfer bandwidth: 34.927246GB/s

128-bit transfer bandwidth: 34.616203GB/s

normal@phenom:~/temp/coalescing$ ./coalescing

For device: GeForce GTX 280

8-bit transfer bandwidth: 15.327665GB/s

16-bit transfer bandwidth: 44.629219GB/s

32-bit transfer bandwidth: 59.287403GB/s

64-bit transfer bandwidth: 46.184879GB/s

128-bit transfer bandwidth: 44.705101GB/s

For 32 active threads per warp, one warp per block and 64 total blocks:

For device: GeForce GTX 280

8-bit transfer bandwidth: 7.316723GB/s

16-bit transfer bandwidth: 17.354494GB/s

32-bit transfer bandwidth: 30.239174GB/s

64-bit transfer bandwidth: 54.181236GB/s

128-bit transfer bandwidth: 76.944344GB/s

normal@phenom:~/temp/coalescing$ ./coalescing

For device: Tesla C1060

8-bit transfer bandwidth: 6.486255GB/s

16-bit transfer bandwidth: 15.308207GB/s

32-bit transfer bandwidth: 29.604433GB/s

64-bit transfer bandwidth: 50.725243GB/s

128-bit transfer bandwidth: 68.840340GB/s

For 32 active threads per warp, but only one warp per block and 384 total blocks:

For device: GeForce GTX 280

8-bit transfer bandwidth: 14.599340GB/s

16-bit transfer bandwidth: 42.640224GB/s

32-bit transfer bandwidth: 75.512741GB/s

64-bit transfer bandwidth: 101.169762GB/s

128-bit transfer bandwidth: 85.896751GB/s

normal@phenom:~/temp/coalescing$ ./coalescing

For device: Tesla C1060

8-bit transfer bandwidth: 13.462626GB/s

16-bit transfer bandwidth: 39.852840GB/s

32-bit transfer bandwidth: 63.858753GB/s

64-bit transfer bandwidth: 74.491768GB/s

128-bit transfer bandwidth: 74.745811GB/s

(note that this is running on Ocelot, so the results are slightly lower than using the NVIDIA compiler due to using the device JIT compiler rather than using the native instructions generated by NVCC, the results should be slower across all data types though)

Topic		Replies	Views
Coalesced Memory access related doubt CUDA Programming and Performance	13	2232	December 9, 2010
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6289	October 29, 2008
Uncoalesced global memory bandwidth CUDA Programming and Performance	3	2285	March 28, 2009
Why 128byte transaction slower than 64byte one? CUDA Programming and Performance	0	1280	December 4, 2009
Effective global memory bandwidth? CUDA Programming and Performance	17	17715	September 18, 2007
Coalesced access to global memory for double4 CUDA Programming and Performance	8	3693	September 8, 2015
Coalesced read/write memory details More informations about coalesced memory CUDA Programming and Performance	10	12575	March 2, 2008
Global memory access bottleneck CUDA Programming and Performance	8	3606	September 4, 2015
efficient global memory access 32-, 64- or 128-bit loads ? CUDA Programming and Performance	9	4904	January 7, 2008
Memory access should be coalesced but is not CUDA Programming and Performance	6	1184	May 16, 2019

why 256byte loads slower than 128byte loads?

Related topics