How many warps per SM to hide global mem latency?

I’d like to clarify how many active warps should be there per SM to hide global memory latency… for g80 architecture
Is there any rule of thumb for this issue?

I’ve heard 192 threads as a good threshold for hiding memory latency bandied about (so 6 warps).

192 threads hide read-after-write register dependencies, so writes Harris in his slides on optimization. Are you sure you aren’t confusing this number with the number of threads to hide global memory access latency? 192 seems quite low to me.

You’re right, I think I did. Vasily Volkov claims 470-720 cycles for a DRAM access (source, and honestly, everyone interested in CUDA should read that paper). So, let’s say in your kernel, you’ve got 20 instructions you can run (other loads, pointer arithmetic, etc.) before you need the results of the first load. Assume, just for the sake of argument, that it’s four cycles per instruction (shared memory would be longer, which helps us, so we could consider this a worst case scenario). I don’t know how feasible this is, but whatever, hand-waving for the time being.

So, 20 instructions * 4 cycles per instruction = 80 cycles before a warp has to block. So, assuming Volkov’s numbers are correct (and they probably are), you’ll need at least 6 warps per SM (not necessarily per block) to hide that memory latency efficiently and potentially as many as 9. Remember that loads don’t block until the results of that load are needed, so if you’re not hurting for registers moving your loads as early as possible is not a bad idea.

Does that make sense now?

Can I conclude that this statement also holds true for Stores? E.g., writing from smem to global mem does not block at all provided that the written result in gmem is never needed in the same kernel again?

Stores are a fire-and-forget operation; you’ll never block on a store. Now, if you load from the same address, I’m not 100% sure how that’s handled. But don’t do that, it seems like a bad idea anyway.

They never block (only the kernel return will be blocked until all writes have finished I believe). Reading from global memory after writing to it in the same kernel is a race condition waiting to happen and should be avoided.

Wow, excellent paper link, thanks very much for that. It shows real latency measurements and also the fact that there’s several levels of them. There’s still loads of undocumented behavior, especially about queued pending loads, that’s useful but completely undocumented unless you keep up on the forum or do your own experiments. [I’d love an “unofficial” FAQ about queued read behavior and limits…]

Is that true even if you immediately reuse the register that you are saving from?



// … do more work with transformed register value here…

Queued LOADS tie up a register that waits for the data to come in, but maybe there’s a write buffer or something that holds the pending writes and releases the register.

(Some testing (and looking at the compiler’s generated code) would answer this of course…)

Thanks for pointing that out. So how many cycles does this store operation take then?

Depends if you mean “cycles that the store operation is executing on the SM” (four) or “cycles until the result is available in global memory” (using vvolkov’s numbers. about 720 max, don’t do this within a kernel because it’s probably a race condition).

I meant the former (I don’t need the written results in the kernel, so no race condition here). Just wanted to make sure that a code like below does not stall between each loop iteration. If that’s the case, everything is fine.

//inside kernel

for (1..100) {

   // do some calculations in smem

   // write sthg from smem to gmem


Regarding this answer of tmurray to an earlier question of mine, I was hoping that a code sample like below will execute very fast:

//inside kernel

float result;

for (1..100) {

   // do some calculations (read sthg from smem, store temporarily (in reg?))


   idx = ...

   result += smem[idx] * smem[idx];

   // write sthg from smem to gmem


   dst_idx = offset + tid;

   gmem_out[dst_idx] = result;

(note: the posted code might not be reasonable…I just tried to post the general idea, it doesn’t reflect my actual code)

With “very fast” I mean basically as fast, as if this store operation in the last line would not be in there (since it is “fire and forget”, so only 4 cycles). However, my measurements show that WITH enabling the last line (gstore), it takes two orders of magnitude more time than when I comment out that line.

Any explanations on that? Am I missing sthg here?

P.S: all global loads/stores are coalesced, and there are no serialized warps. also, dst_idx is different for each executed thread (so no location is written twice or so).

Very likely when you comment out the write to gmem_out, the dead-code optimizer eliminates most or all of your kernel. The compiler is very aggressive about eliminating code that doesn’t ultimately lead to a result written out to global memory.

As this explains some measurement results, it does not explain why writing to gmem is still that slow. In fact, I didn’t comment that line out, but just wrote something else to gmem…so this statement must have been carried out. Hm, on second thought, the problem might lie somewhere else. Please take a look at the following.

My code looks sthg like this:

// inside kernel:

__shared__ float img_s[SUBIMAGE_WIDTH * SUBIMAGE_HEIGHT];

// load gmem into smem, sync...

float result;

for(i=0; i < 300; i++)


	result = 0;

	for (m1 = 0; m1<HEIGHT; m1++)


		for (m2 = 0; m2<WIDTH; m2++)


			// compute some index based on threadId's

			src_idx = ...;

			// do some calculation

			result += img_s[src_idx] * img_s[src_idx];



	// write data back to global memory


	int dst_idx = i*SUBIMAGE_WIDTH*SUBIMAGE_HEIGHT + tid;


	// very slow case (100 ms):

	imgOut_d[dst_idx] = result;

	// very fast case (1.3 ms):

	//imgOut_d[dst_idx] = dst_idx;


So the difference in execution time merely lies in the last statement, which writes sthg to global memory. If I write the correct result, it takes 100 ms. If I write some other number (in this case dst_idx, which doesn’t really make sense), it takes 1.5 ms.

As to my understanding, both variables (dst_idx and result) should be stored in a register, right? So how comes this huge difference in execution time?

The only thing I could come up with is if “result” somehow would have been offloaded to local memory. But according to nvcc’s output, it is not:

1>ptxas info	: Used 9 registers, 1056+32 bytes smem, 8 bytes cmem[1]	 that's the output when "result" is written to gmem

1>ptxas info	: Used 7 registers, 1056+32 bytes smem, 4 bytes cmem[1]	 that's the output when "dst_idx" is written to gmem

Any hints welcome!

As seibert already said. The dead code optimizer is very aggressive. When the only thing you write out to global memory is dst_idx, it sees that result is unused and optimizes away the entire computation, including very possibly even the global memory reads into img_s.

One way to fool the compiler to generate code to benchmark what you want would be to only write result if dst_idx == 0. That way, the compiler is forced to keep it around and the overhead of writing only one value to memory is tiny compared to writing one value per thread (assuming you have lots of threads).

Now that I look at it side by side - that would mean that the variable “result” is stored in constant memory? Why is that? (I had assumed it’s in a register…)