SM_1.2+ Coalescing & gst efficiency What does it mean?

I’m not able to tell whether or not the kernels I write are coalesced or serialized. In an attempt to understand this, I’ve written a very simple kernel and used cudaprof:

Kernel Code:

#define TYPE float

#define RESULT 13.37

#define N (1<<22) // numThreads/numElements  ~= 4M

__global__ void simple(TYPE* d){

	int tid = blockDim.x*blockIdx.x+threadIdx.x;

	TYPE R = RESULT;

	

	if (tid < N)  d[tid] = R;

}

Profiler Output:

cudaprof_simple.bmp (405 KB)

cudaprof_simple2.bmp (297 KB)

So, gst_efficiency = gst_req/SUM(gst’s)

                        = 4384 / 26 240 = 0.167073171  (for this particular example)

                        <b>= 16.7% efficient??</b>

This, to me, implies that the lower the gst_efficiency the better! However, that feels like it’d be a misnomer so I have my doubts.

So what’s the deal? Is this how one determines how coalesced ones reads/writes are - with gst_efficiency? And, what is the optimal gst_efficiency? Please feel free to refer me to documentation… but I’ve been unable to find the answer anywhere - very frustrating - where do people get their information from?

Note: I’m using a TESLA C1060, CC1.3 on Ubuntu 9.04.

Update:

I wrote a new kernel, which has a similar data access pattern, but with varying stride so that I can compare its effects on gst_efficiency and instruction_throughput (which, I assume, directly effect how coalesced the data accesses are)

Here is a snippet of the code along with some results:

#define TYPE float

#define N (1<<23) // ~=4M

#define RESULT ((TYPE)13.37)

#define WIDTH 6

/*

 * FLOAT (4bytes)	

 * width=1 --> 0.1670730 gst efficiency, 0.9125260 instruction throughput [6.70403 ms]

 * width=2 --> 0.1670730 gst efficiency, 0.2488550 instruction throughput [13.6427 ms]

 * width=3 --> 0.0835366 gst efficiency, 0.1401990 instruction throughput [20.9702 ms]

 * width=4 --> 0.0835366 gst efficiency, 0.0789088 instruction throughput [28.4861 ms]

 * width=5 --> 0.0556911 gst efficiency, 0.0605162 instruction throughput [36.5757 ms]

 * width=6 --> 0.0556911 gst efficiency, 0.0466050 instruction throughput [45.6414 ms]

 * width=10 --> 0.033415 gst efficiency, 0.0218390 instruction throughput [84.0510 ms]

 *

 * DOUBLE (8bytes)

 * width=1 --> 0.1670730 gst efficiency, 0.4957300 instruction throughput [13.4160 ms]

 * width=2 --> 0.0835366 gst efficiency, 0.1355670 instruction throughput [26.6588 ms]

 * width=3 --> 0.0556911 gst efficiency, 0.0734652 instruction throughput [41.1372 ms]

 * width=4 --> 0.0417683 gst efficiency, 0.0423322 instruction throughput [56.3955 ms]

 * width=10 --> 0.016707 gst efficiency, 0.0163741 instruction throughput [154.675 ms]

*/

__global__ void naive(TYPE* d){

	int tid = blockDim.x*blockIdx.x+threadIdx.x;

	TYPE R = RESULT;

	

	if (tid < N){

		for (int w=0; w<WIDTH; ++w){

			d[tid*WIDTH+w] = R;

		}		

	}

}

I think I’ve made some progress…

A closer inspection of the part of the kernel which writes to global memory:

d[tid*WIDTH+w] = tid;

Let’s consider where each thread of a half-warp is writting:

t0: 0 [thread 0 writes at address 0]

t1: 1*W+W [thread 1 writes at address 1W+W]

t15: 15*W+W

tn: nW+W = W(n+1) = 16*W bytes = 64W byte segment size(float)

Now, according to ProgrammingGuide (pg91) coalescing, for CC1.2+, will occur if all threads (of a half-warp) access a segment of size 128bytes for 4/8-byte words.

So, if I’m correct, for W=1 and W=2 the writes should coalesce. With that in mind, let’s examine 2 of the results I previously obtained:

  • width=1 → 0.1670730 gst efficiency, 0.9125260 instruction throughput [6.70403 ms]

  • width=2 → 0.1670730 gst efficiency, 0.2488550 instruction throughput [13.6427 ms]

According to my previous calculations I’d have expected the gst_efficiency to remain the same but I was NOT expecting the throughput to drop - especially not to ~1/3 throughput!

What is going on? Can anyone enlighten me?

The only case than can coalesce is width=1. Coalescing rules require the words loaded or stored by a half warp of threads to be contained within contiguous half warp sized piece of memory, aligned to half-warp size times the word size - you are using floats, so 16 threads have to write into the same 64 byte piece of memory, and that piece of memory has to be aligned to a 64 byte boundary. Anything else will require additional memory transactions to service the load/store request. How many depends on the compute capability of the card and how the load/store request deviates from the coalescing rules.

Well, for W=1 your writes will be to 1, 2, 3,… 16, i.e. coalesced but not perfect because the first address is not a multiple of 16. For W=2 you’re getting 2, 4, 6,… 32. That is strided access and not coalesced.

Ok, but:

  1. What does the gst_efficiency tell us?

  2. If my reads/writes, with floats, do not fit into a 64-byte boundary then how do I know how much slower the reads/writes are? (eg. 1/2, 1/4 speed, etc)

It tells you the ratio of requests to transactions. Perfectly coalesced writes should have an efficiency of 100%. Fully serialized writes should have an efficiency of 1/16.

The short answer is you don’t. It will be slower, but how much can only be determined by benchmarking.

Great! But, in your previous post you said:

and when width==1 then I get gst_efficiency=~ 0.16 != 1.0

How do you explain that?

The emphasis is on the word “can”. WIDTH=1 is the only possible case where the writes have a stride of one, and hence the only case where coalescing could be possible.

Thanks again avidday!

I see, you were speaking of the general sense…

Well, I’m glad that the profiler can conclusively report the coalesce %, however I’m mystified as to why my particular example, with width=1, is not 100% coalesced. My reasoning:

Kernel:

// FLOAT

for (int w=0; w<WIDTH; ++w){

			d[tid*WIDTH+w] = R;

}

so, tid[0] writes at d[0] and tid[15] writes at d[15]. The segment size is thus 16*4-0 = 64 bytes

What am I missing?

/bump

I’m dying to find out the answer to this, could somebody please help me out?

I’m STILL stuck on this… I’m really eager to get to the bottom of this whole coalescing business but am unable to… I’ve read all available documentation (and re-read it, several times) and have scoured the forums and internet in search of a definitive answer or a clear explanation. I know some of you have already tried to help me with this but it seems we never got to the bottom of it.

I’ve now upgraded to CUDA3.0 and here are the results of one of my applications using the profiler:
External Media

It appears there’s no longer a gst_efficiency? Is this coalesced or not? Why?

Update: CUDA2.3 vs CUDA3.0 timing and cudaprof profiling

/*

 * ---------------------------

 * CUDA 2.3 (cudaprof) RESULTS

 * ---------------------------

 *

 * FLOAT (4bytes)	

 * width=1 --> 0.1670730 gst efficiency, 0.9125260 instruction throughput [6.70403 ms]

 * width=2 --> 0.1670730 gst efficiency, 0.2488550 instruction throughput [13.6427 ms]

 * width=3 --> 0.0835366 gst efficiency, 0.1401990 instruction throughput [20.9702 ms]

 * width=4 --> 0.0835366 gst efficiency, 0.0789088 instruction throughput [28.4861 ms]

 * width=5 --> 0.0556911 gst efficiency, 0.0605162 instruction throughput [36.5757 ms]

 * width=6 --> 0.0556911 gst efficiency, 0.0466050 instruction throughput [45.6414 ms]

 * width=10 --> 0.033415 gst efficiency, 0.0218390 instruction throughput [84.0510 ms]

 *

 * DOUBLE (8bytes)

 * width=1 --> 0.1670730 gst efficiency, 0.4957300 instruction throughput [13.4160 ms]

 * width=2 --> 0.0835366 gst efficiency, 0.1355670 instruction throughput [26.6588 ms]

 * width=3 --> 0.0556911 gst efficiency, 0.0734652 instruction throughput [41.1372 ms]

 * width=4 --> 0.0417683 gst efficiency, 0.0423322 instruction throughput [56.3955 ms]

 * width=10 --> 0.016707 gst efficiency, 0.0163741 instruction throughput [154.675 ms]

*/

/*

 * CUDA 3.0 (cudaprof) RESULTS

 *

 * FLOAT (4B, 64-bit)

 * width=1 --> Time taken: 0.225888 milliseconds. [9284.03 Million Writes/Second] | 100% gst_64b | 0.518179 instruction_throughput

 * width=2 --> Time taken: 0.652352 milliseconds. [6429.51 Million Writes/Second] | 90.5% gst_64b, rest gst_128b

 * width=3 --> Time taken: 1.23091 milliseconds. [5111.22 Million Writes/Second]  | 93.3% gst_64b, rest gst_128b 

 * width=4 --> Time taken: 2.02573 milliseconds. [4141.03 Million Writes/Second]  | 90.0% gst_64b, rest gst_128b

 * width=5 --> Time taken: 3.13894 milliseconds. [3340.54 Million Writes/Second]  | 91.6% gst_64b, rest gst_128b

 * width=6 --> Time taken: 4.41062 milliseconds. [2852.86 Million Writes/Second]  | 89.7% gst_64b, rest gst_128b

 * width=15 --> Time taken: 24.286 milliseconds. [1295.29 Million Writes/Second]  | 88.4% gst_64b, 10% gst_128b, rest gst_32b | 0.0343001 instruction_throughput

*/

#define N (1<<21)  // ~2M

__global__ void naive(TYPE* d){

	int tid = blockDim.x*blockIdx.x+threadIdx.x;

	TYPE R = RESULT;

	

	if (tid < N){

		for (int w=0; w<WIDTH; ++w){

			d[tid*WIDTH+w] = R;

			// t0: 0

	   			// t1: 1*W+W

					// t15: 15*W+W

					// tn: nW+W = W(n+1) = 16W bytes = 64W byte segment size(float)

			// According to ProgrammingGuide (pg91) coalescing will occur if all

		 		// threads access a segment of size 128bytes for 4/8-byte words and so

						// coalescing should occur for width=1 or width=2 ?

		}		

	}

}
  1. How do you know what’s perfectly coalesced? And, if not, how close (1…1/16)?

  2. If we compare, for example, width=1 with width=15 can we just divide instruction throughputs to get a relative performance ratio:

    (0.518179 instruction throughput) / (0.0343001 instruction_throughput) =~ 15.1

    thus, w=1 is 15x faster than w=15 ??