SM_1.2+ Coalescing & gst efficiency What does it mean?

zeus13i · April 29, 2010, 1:22am

I’m not able to tell whether or not the kernels I write are coalesced or serialized. In an attempt to understand this, I’ve written a very simple kernel and used cudaprof:

Kernel Code:

#define TYPE float

#define RESULT 13.37

#define N (1<<22) // numThreads/numElements  ~= 4M

__global__ void simple(TYPE* d){

	int tid = blockDim.x*blockIdx.x+threadIdx.x;

	TYPE R = RESULT;

	

	if (tid < N)  d[tid] = R;

}

Profiler Output:

cudaprof_simple.bmp (405 KB)

cudaprof_simple2.bmp (297 KB)

So, gst_efficiency = gst_req/SUM(gst’s)

                        = 4384 / 26 240 = 0.167073171  (for this particular example)

                        <b>= 16.7% efficient??</b>

This, to me, implies that the lower the gst_efficiency the better! However, that feels like it’d be a misnomer so I have my doubts.

So what’s the deal? Is this how one determines how coalesced ones reads/writes are - with gst_efficiency? And, what is the optimal gst_efficiency? Please feel free to refer me to documentation… but I’ve been unable to find the answer anywhere - very frustrating - where do people get their information from?

Note: I’m using a TESLA C1060, CC1.3 on Ubuntu 9.04.

zeus13i · April 29, 2010, 4:25am

Update:

I wrote a new kernel, which has a similar data access pattern, but with varying stride so that I can compare its effects on gst_efficiency and instruction_throughput (which, I assume, directly effect how coalesced the data accesses are)

Here is a snippet of the code along with some results:

#define TYPE float

#define N (1<<23) // ~=4M

#define RESULT ((TYPE)13.37)

#define WIDTH 6

/*

 * FLOAT (4bytes)	

 * width=1 --> 0.1670730 gst efficiency, 0.9125260 instruction throughput [6.70403 ms]

 * width=2 --> 0.1670730 gst efficiency, 0.2488550 instruction throughput [13.6427 ms]

 * width=3 --> 0.0835366 gst efficiency, 0.1401990 instruction throughput [20.9702 ms]

 * width=4 --> 0.0835366 gst efficiency, 0.0789088 instruction throughput [28.4861 ms]

 * width=5 --> 0.0556911 gst efficiency, 0.0605162 instruction throughput [36.5757 ms]

 * width=6 --> 0.0556911 gst efficiency, 0.0466050 instruction throughput [45.6414 ms]

 * width=10 --> 0.033415 gst efficiency, 0.0218390 instruction throughput [84.0510 ms]

 *

 * DOUBLE (8bytes)

 * width=1 --> 0.1670730 gst efficiency, 0.4957300 instruction throughput [13.4160 ms]

 * width=2 --> 0.0835366 gst efficiency, 0.1355670 instruction throughput [26.6588 ms]

 * width=3 --> 0.0556911 gst efficiency, 0.0734652 instruction throughput [41.1372 ms]

 * width=4 --> 0.0417683 gst efficiency, 0.0423322 instruction throughput [56.3955 ms]

 * width=10 --> 0.016707 gst efficiency, 0.0163741 instruction throughput [154.675 ms]

*/

__global__ void naive(TYPE* d){

	int tid = blockDim.x*blockIdx.x+threadIdx.x;

	TYPE R = RESULT;

	

	if (tid < N){

		for (int w=0; w<WIDTH; ++w){

			d[tid*WIDTH+w] = R;

		}		

	}

}

zeus13i · April 29, 2010, 6:46am

I think I’ve made some progress…

A closer inspection of the part of the kernel which writes to global memory:

d[tid*WIDTH+w] = tid;

Let’s consider where each thread of a half-warp is writting:

t0: 0 [thread 0 writes at address 0]

t1: 1*W+W [thread 1 writes at address 1W+W]

…

t15: 15*W+W

…

tn: nW+W = W(n+1) = 16*W bytes = 64W byte segment size(float)

Now, according to ProgrammingGuide (pg91) coalescing, for CC1.2+, will occur if all threads (of a half-warp) access a segment of size 128bytes for 4/8-byte words.

So, if I’m correct, for W=1 and W=2 the writes should coalesce. With that in mind, let’s examine 2 of the results I previously obtained:

width=1 → 0.1670730 gst efficiency, 0.9125260 instruction throughput [6.70403 ms]
width=2 → 0.1670730 gst efficiency, 0.2488550 instruction throughput [13.6427 ms]

According to my previous calculations I’d have expected the gst_efficiency to remain the same but I was NOT expecting the throughput to drop - especially not to ~1/3 throughput!

What is going on? Can anyone enlighten me?

avidday · April 29, 2010, 7:13am

The only case than can coalesce is width=1. Coalescing rules require the words loaded or stored by a half warp of threads to be contained within contiguous half warp sized piece of memory, aligned to half-warp size times the word size - you are using floats, so 16 threads have to write into the same 64 byte piece of memory, and that piece of memory has to be aligned to a 64 byte boundary. Anything else will require additional memory transactions to service the load/store request. How many depends on the compute capability of the card and how the load/store request deviates from the coalescing rules.

jjp · April 29, 2010, 7:15am

Well, for W=1 your writes will be to 1, 2, 3,… 16, i.e. coalesced but not perfect because the first address is not a multiple of 16. For W=2 you’re getting 2, 4, 6,… 32. That is strided access and not coalesced.

zeus13i · May 4, 2010, 12:57am

Ok, but:

What does the gst_efficiency tell us?
If my reads/writes, with floats, do not fit into a 64-byte boundary then how do I know how much slower the reads/writes are? (eg. 1/2, 1/4 speed, etc)

avidday · May 4, 2010, 6:45am

It tells you the ratio of requests to transactions. Perfectly coalesced writes should have an efficiency of 100%. Fully serialized writes should have an efficiency of 1/16.

The short answer is you don’t. It will be slower, but how much can only be determined by benchmarking.

zeus13i · May 4, 2010, 6:56am

Great! But, in your previous post you said:

and when width==1 then I get gst_efficiency=~ 0.16 != 1.0

How do you explain that?

avidday · May 4, 2010, 7:05am

The emphasis is on the word “can”. WIDTH=1 is the only possible case where the writes have a stride of one, and hence the only case where coalescing could be possible.

zeus13i · May 4, 2010, 7:15am

Thanks again avidday!

I see, you were speaking of the general sense…

Well, I’m glad that the profiler can conclusively report the coalesce %, however I’m mystified as to why my particular example, with width=1, is not 100% coalesced. My reasoning:

Kernel:

// FLOAT

for (int w=0; w<WIDTH; ++w){

			d[tid*WIDTH+w] = R;

}

so, tid[0] writes at d[0] and tid[15] writes at d[15]. The segment size is thus 16*4-0 = 64 bytes

What am I missing?

zeus13i · May 6, 2010, 12:25am

/bump

I’m dying to find out the answer to this, could somebody please help me out?

zeus13i · May 11, 2010, 2:58am

I’m STILL stuck on this… I’m really eager to get to the bottom of this whole coalescing business but am unable to… I’ve read all available documentation (and re-read it, several times) and have scoured the forums and internet in search of a definitive answer or a clear explanation. I know some of you have already tried to help me with this but it seems we never got to the bottom of it.

I’ve now upgraded to CUDA3.0 and here are the results of one of my applications using the profiler:
External Media

It appears there’s no longer a gst_efficiency? Is this coalesced or not? Why?

zeus13i · May 11, 2010, 4:58am

Update: CUDA2.3 vs CUDA3.0 timing and cudaprof profiling

/*

 * ---------------------------

 * CUDA 2.3 (cudaprof) RESULTS

 * ---------------------------

 *

 * FLOAT (4bytes)	

 * width=1 --> 0.1670730 gst efficiency, 0.9125260 instruction throughput [6.70403 ms]

 * width=2 --> 0.1670730 gst efficiency, 0.2488550 instruction throughput [13.6427 ms]

 * width=3 --> 0.0835366 gst efficiency, 0.1401990 instruction throughput [20.9702 ms]

 * width=4 --> 0.0835366 gst efficiency, 0.0789088 instruction throughput [28.4861 ms]

 * width=5 --> 0.0556911 gst efficiency, 0.0605162 instruction throughput [36.5757 ms]

 * width=6 --> 0.0556911 gst efficiency, 0.0466050 instruction throughput [45.6414 ms]

 * width=10 --> 0.033415 gst efficiency, 0.0218390 instruction throughput [84.0510 ms]

 *

 * DOUBLE (8bytes)

 * width=1 --> 0.1670730 gst efficiency, 0.4957300 instruction throughput [13.4160 ms]

 * width=2 --> 0.0835366 gst efficiency, 0.1355670 instruction throughput [26.6588 ms]

 * width=3 --> 0.0556911 gst efficiency, 0.0734652 instruction throughput [41.1372 ms]

 * width=4 --> 0.0417683 gst efficiency, 0.0423322 instruction throughput [56.3955 ms]

 * width=10 --> 0.016707 gst efficiency, 0.0163741 instruction throughput [154.675 ms]

*/

/*

 * CUDA 3.0 (cudaprof) RESULTS

 *

 * FLOAT (4B, 64-bit)

 * width=1 --> Time taken: 0.225888 milliseconds. [9284.03 Million Writes/Second] | 100% gst_64b | 0.518179 instruction_throughput

 * width=2 --> Time taken: 0.652352 milliseconds. [6429.51 Million Writes/Second] | 90.5% gst_64b, rest gst_128b

 * width=3 --> Time taken: 1.23091 milliseconds. [5111.22 Million Writes/Second]  | 93.3% gst_64b, rest gst_128b 

 * width=4 --> Time taken: 2.02573 milliseconds. [4141.03 Million Writes/Second]  | 90.0% gst_64b, rest gst_128b

 * width=5 --> Time taken: 3.13894 milliseconds. [3340.54 Million Writes/Second]  | 91.6% gst_64b, rest gst_128b

 * width=6 --> Time taken: 4.41062 milliseconds. [2852.86 Million Writes/Second]  | 89.7% gst_64b, rest gst_128b

 * width=15 --> Time taken: 24.286 milliseconds. [1295.29 Million Writes/Second]  | 88.4% gst_64b, 10% gst_128b, rest gst_32b | 0.0343001 instruction_throughput

*/

#define N (1<<21)  // ~2M

__global__ void naive(TYPE* d){

	int tid = blockDim.x*blockIdx.x+threadIdx.x;

	TYPE R = RESULT;

	

	if (tid < N){

		for (int w=0; w<WIDTH; ++w){

			d[tid*WIDTH+w] = R;

			// t0: 0

	   			// t1: 1*W+W

					// t15: 15*W+W

					// tn: nW+W = W(n+1) = 16W bytes = 64W byte segment size(float)

			// According to ProgrammingGuide (pg91) coalescing will occur if all

		 		// threads access a segment of size 128bytes for 4/8-byte words and so

						// coalescing should occur for width=1 or width=2 ?

		}		

	}

}

How do you know what’s perfectly coalesced? And, if not, how close (1…1/16)?
If we compare, for example, width=1 with width=15 can we just divide instruction throughputs to get a relative performance ratio:

(0.518179 instruction throughput) / (0.0343001 instruction_throughput) =~ 15.1

thus, w=1 is 15x faster than w=15 ??

Topic		Replies	Views
Coalesced memory access example CUDA Programming and Performance	2	3323	March 28, 2011
Cuda Profiler 1.1 - question on gst coalesced value CUDA Programming and Performance	1	1616	April 5, 2009
coalesced vs. uncoalesced access why not speed-up of 16x? CUDA Programming and Performance	13	6156	October 29, 2008
cuda profiler error about coalesced store CUDA Programming and Performance	2	1120	January 6, 2010
CUDA VISUAL PROFILER :Results interpretation CUDA Programming and Performance	0	4446	March 9, 2010
compute visual profiler - gst coalesced/uncoalesced CUDA Programming and Performance	0	731	July 21, 2011
Coalesced read/write memory details More informations about coalesced memory CUDA Programming and Performance	10	12510	March 2, 2008
The global memory read and write is different? The result in profiler is confusing. CUDA Programming and Performance	6	4457	April 2, 2010
Global memory store efficiency CUDA Programming and Performance	9	8693	March 22, 2010
Even more Fermi Fun: Uncoalesced writes CUDA Programming and Performance	8	8955	June 5, 2010

SM_1.2+ Coalescing & gst efficiency What does it mean?

Related topics