Global Memoy latencies and NVIDIA cards Latency

Does the global memory latency with respect to GPU clock differ among various NVIDIA cards?

I have a 8800 GTX. I find that the global memory accesses do NOT look that bad.

Can some1 enlighten me here?

In most cases, with a large enough number of threads running, the global memory latency is never a problem because of all the latency hiding. Throughput becomes the real challenge. You need to get the memory access patterns just right (coalesced or data locality within a warp in texture fetches) to max it out. On the 8800 GTX, the max I have seen is around 70 GiB/s.

Yes, I would also like to believe that running more threads would hide the latency. Consider the following 2 snippets.

Multiple WARP version (called with 512 threads per Block):

__global__ void loadKernelMWarp(float *p)


	__shared__ float *items;

	__shared__ float cacheA[512];

	int i;

	items = p + (blockIdx.x * 512);

	cacheA[threadIdx.x] = items[threadIdx.x];



Single WARP version (called with 32 threads per Block):

/* Works only for Single WARP */

__global__ void loadKernelSWarp(float *p)


	__shared__ float *items;

	__shared__ float cacheA[512];

	int i;

	items = p + (blockIdx.x * 512);

	for(i=threadIdx.x; i<512; i += blockDim.x)

  cacheA[i] = items[i];



I verified that the compiler generates the PTX code corresponding to the global loads.

I find the single-WARP version outperforms the multiple-WARP version by around 70 micro-seconds. I run only one block. I understand that __syncthreads on 512 threads take up around 60 microseconds on it. But still, SWARP outperforms MWARP. ANd with bigger block versions, this difference only raises.

I use the CUT_XXX_TIMER method of profiling time. I also cross checked with the CPU ticks time measurement (QueryPerformanceCounter from platform SDK). Its all the same.

THe MWARP version has higher CUDA occupancy than my SWARP version. The MWARP directly takes up 16 WARPS. Whereas SWARPs cannot go beyond 8 WARPS (max 8 active blocks per MP). So, I still dont understand how come the SWARP still beats it.

In fact, I had another kernel version where I do some computation on this data and write back to global memory. I found that the SWARP version is TWICE as fast as the MWARP version for BIG block sizes (like 2048, 4096 blocks etc.)

If the Global Memory requests are executed in queued fashion – then what is the size of this queue? What happens when more global memory requests come than the queue size? OR Does WARP scheduling latency worser than global memory accesses???

I understand the last question might sound like an overkill. But I just want to know what am I missing here.

Why do you call __syncthreads here?

If I write any application based on this code – I need that __syncthreads() there. So, it is a part of overhead of having multiple-warps. So, It has to be there. Otherwise, there is NO point in profiling that code.

Do I make sense?

Okay, I understand. Can we be sure that we do not need a syncthreads in the second case?

That being said, it is looking strange to me that there is so much performance difference between the two, and I would not know why. Maybe the profiler output can help understand the differences?

In the second case, there is only one WARP. A WARP does NOT need __syncthreads(). They always execute together. Even if there is a divergence, one half waits for other half to join. And, in the case above – there is no divergence what so ever (as 512 is divisible by 32).

Profiler output is un-reliable - IMHO for small block sizes. espescially, I find that the “cputime” (total time) that it reports do NOT agree with what I measure with the CUT_XXX_TIMER calls OR the QueryPerformanceCOunter() calls.

THere is always a difference of some 200 Microsecs between what the profiler says and what is actually measured from CPU. So, I dont think the profiler is going to help here. Even at the max, it is going to talk about “coalescing”. We can directly tell from the code that coalescing happens.

The only point that we dont understand is , what happens to the LOAD once it is issued. Is it queued? Whats the maximum size of this queue? What happens if there is NO space in the Queue for a WARP request?

I somehow think that only 2 global memory requests (coalesced half-warp requests) can be serviced at any given time. That would explain why the MWARP thing is as fast if not slower than the SWARP thing.

I think only an NVIDIA hardware engineer can answer this question. OR Probably it is just un-defined and invisible to programmers. But then, I would like this to be documented so that programmers atleast are aware of it.

That is because CUT_XXX_TIMER uses QueryPerformanceCounter. For this low level of a benchmark, you should really look at gputime in the profile log or use the cudaEvent API to see exactly how much time these take on the GPU to minimize the effect of driver overhead. I would say especially when you are timing a single block, where driver overhead is sure to dominate.

There is an overhead for launching kernels for large numbers of blocks, but in my kernels it never seemed that large until over 8000 blocks. I just ran a quick test on an empty kernel and it seems the overhead is pretty linear at 1.0ms/60000 blocks. Perhaps that contributes to the double performance you are seeing? Occupancy could also be a factor, try your MWARP version with varying block sizes. Even in highly tuned kernels, I see performance vary by up to 50% when sweeping through the block sizes.

One thing you may be missing is that 32 and 512 are not the only block sizes. You should really try benchmarking all block sizes in multiples of 32, on decent sized systems with at least nblocks*block_size > 6400 to saturate the device. I often find that a block size of 64 is the optimal in my kernels, though 128 is usually not far behind. And the optimal size changes from kernel to kernel.

Another thing you are missing is what I suggested in my first post to this thread: calculate the actual memory throughput you are achieving (again, with a decent problem size to saturate the device, no single block benchmarks). You should be able to max out the memory bandwidth of the device with either of the two kernels above. My read only memory benchmark tops out at 38.73 GiB/s.

Please see my rant on the “profiler” in my previous post.

Hmm… I launch equal number of blocks for both kernel. So the overhead has to be there for both the kernels. So, I dont think it is playing a factor.

I did check with 256 and I did NOT see any profound improvement. But yes I must actually benchmark it against variety of block sizes.

Memory bandwidth does NOT bother me. It is just the nagging question of why 512 is NOT performing better than 32. Are Global memory load requests queued? If so, what is the size of the queue. I want to know how exactly this happens.

I am pretty much convinced that therez something that is missing here. If we have the detail of how a global memory request is carried out , this thread can be easily answered.

Thanks for your interest.

I did read it. You mentioned nothing about gputime.

Well, for bigger block sizes I found the “gputime” was 2x lesser for SWARP thing than the MWARP thing.

For 1 block size, here is the profiler output I get:


timestamp=[ 2939.759 ] method=[ _Z15loadKernelSWarpPf ] gputime=[ 9.440 ] cputime=[ 291.098 ] occupancy=[ 0.250 ] instructions=[ 124 ] gld_coherent=[ 32 ] cta_launched=[ 1 ]

timestamp=[ 4876.318 ] method=[ _Z15loadKernelMWarpPf ] gputime=[ 3.008 ] cputime=[ 20.952 ] occupancy=[ 0.667 ] instructions=[ 129 ] gld_coherent=[ 32 ] cta_launched=[ 1 ]


As you can see the profiler says 9.440 GPU time and 291 CPUTime for SWARP

And, 3.008 GPU time and 20.952 CPU time for MWARP. This means that the MWARP thing is much much faster than SWARP. However, for the same run, when I measure the CUT_XXX_CALLS, the SWARP takes 609 microsecs and MWARP takes 674 microsecs. This output is consistent across invocations.

Thats why I am NOT counting on profiler output for smaller block sizes.

Here is an output for 1000 blocks.


timestamp=[ 7312.940 ] method=[ _Z15loadKernelSWarpPf ] gputime=[ 82.464 ] cputime=[ 348.368 ] occupancy=[ 0.250 ] instructions=[ 6608 ] gld_coherent=[ 3968 ] cta_launched=[ 124 ]

timestamp=[ 12739.888 ] method=[ _Z15loadKernelMWarpPf ] gputime=[ 121.792 ] cputime=[ 140.241 ] occupancy=[ 0.667 ] instructions=[ 6674 ] gld_coherent=[ 3968 ] cta_launched=[ 124 ]


For 1000 blocks, you can see GPUTIME for SWARP is ahead when compared to MWARP. And, with increasing BLOCK SIZES this only improves. Again the time measured using CUT calls say 665 Microsecs for SWARP and 777 microsecs for MWARP. Again, the profiler output and my time meausrement differ a lot.

For 60,000 blocks:


timestamp=[ 230096.000 ] method=[ _Z15loadKernelSWarpPf ] gputime=[ 4505.056 ] cputime=[ 4774.350 ] occupancy=[ 0.250 ] instructions=[ 380300 ] gld_coherent=[ 239680 ] cta_launched=[ 7490 ]

timestamp=[ 447566.094 ] method=[ _Z15loadKernelMWarpPf ] gputime=[ 7185.376 ] cputime=[ 7205.105 ] occupancy=[ 0.667 ] instructions=[ 400782 ] gld_coherent=[ 240000 ] cta_launched=[ 7500 ]


You can see that SWARP thing is faster by a factor or 1.6x. CUT time measurement says 4916 , 7902 microsecs – which is kindaa close to the profiler output.

I do remember seeing 2X speed-ups for some block sizes.

At this point, I really think that global memory latency is very very less in my hardware – 8800GTX. I dont have any other hardware to check.

If one of you can run this experiment (SWARP and MWARP thing) on your hardware and profile it, it would be interesting.

Thanks for any help.

Some comments:

  1. One shouldn’t count on more than 0.5-1 milliseconds of timer precision under non-realtime operating systems such as Windows and Linux.
  2. Calculating effective bandwidth is much more informative.
  3. Such dummy kernels (not writing anything into global memory) have chances to be turned to void by the PTX-to-cubin translator, though it doesn’t seem to be the case (from looking at the profiler outputs). But I suggest writing back the smem contents to global memory and just multiply the effective bandwidth rate by two.

The results are bit unexpectable, because MWarp has far better occupancy and it’s one of the main factor affecting the hardware ability of gmem latenсy hiding. But another factor is the number of in-flight thread blocks per multiprocessor and SWarp appears to be reaching the limit of 8 thread blocks per multiprocessor, while MWarp is limited by 1 thread block per multiprocessor. The reason is __syncthreads() is a block-wise barrier for any operations (including global memory ones), so the multiprocessor scheduler is allowed more flexibility in the case of multiple blocks in flight.

So not too small (due to the first factor) and not too big (due to the second factor) thread block size is recommended. F.e. 128 threads is one of the best choices: it doesn’t prevent from getting both good occupancy and multiple thread blocks in flight per multiprocessor (max. 6 thread blocks per multiprocessor and 100% occupancy for this case). Similarly, though not taking register count pressure into account, 96, 192, 256, 384 threads per block may also be good choices.


Thanks for answering.

I agree with your views on timer. And also on the PTX to cubin thing. I did NOT know that “cubin” is the final executable. I knew PTX gets translated but I did NOT know that “cubin” is the final form. Thanks for the info.

However, I dont understand the effective bandwidth you are talking about. Is that Number of memops-per-sec??? If so, how is that going to help us here? Anyway, MWARP and SWARP are loading same number of memory. Just that, the way they load is different.

I dont understand how the “__syncthreads()” affects the global memory fetch in MWARP case. In case of MWARP the occupancy is 16 out of 24 warps in a multi-processor. Contrast this with 7 or 8 WARPs (blocks) occupied by the SWARP case.

  1. In case of the MWARP case - the scheduler first schedules the first WARP - this warp issues 2 coalesced global memory fetches and probably goes to sleep. now, the scheduler moves on to 2nd WARP and so on until 16 WARPs. Thus 32 coalesced global memory requests go on immediately without having to wait.

Contrast this to SWARP case – only 16 requests max can go (assuming 8 active blocks)

So, my views are:

Apparently, it looks like MWARP is faster. But what if the global memory request done for WARP 0 finishes with the data when the global memory request for 8th WARP is being made. THat will put MWARP and SWARP on an equal plane. And, if it gets faster than that – it gives an edge to SWARP case because the WARP-scheduling latency is lesser for the SWARP case. So, it all depends on how fast global memory is. That is why I am interested to know the global memory latencies in various NVIDIA cards. If I base my program on this observation – I should NOT be let down when I run on a different NVIDIA card. That is my major concern.

  1. What is the size of global memory request queue? Is this a per multi-processor queue ? What does the WARP scheduler do if a global memory request from a WARP cannot be queued ? Does the local MP provide for a queue of 24 (*2) requests to make sure that WARP requests atleast dont find themselves aborted? hmm… May b.

I would ideally like to go with 1 WARP and schedule more blocks to saturate my multi-processor so that I am neither affected by Register-latenicies NOR by global memory accesses.

This way, I can AVOID __syncthreads() (I would use the VOLATILE keyword instead to synchronize among my single warps) and this would give me better speed too.

If a single WARP and multiple blocks per MP is sufficient to hide my global memory and register latencies in almost all CUDA enabled cards , I would prefer to stick with a Single WARP approach.

I can avoid __syncthreads overhead.

I can avoid worrying about un-necessary race conditions in my program.

I can avoid double-buffering (use volatile keyword instead).

Right now – This approach looks very promising to me. And, This is where I need help from the forum members. If people can run this test on a variety of graphics cards (I would like to see this profiled on a Tesla) – it could help me (us) get a rough idea of how to extract performance.

Our company is planning to buy a few more cards. But it might take sometime before we can lay our hands on it.

Anyway, Thanks for answering.

I request all of those who read this to carry out this simple experiment for me – if you have the bandwidth to do so… THANKS a BUNCH!!!

Best Regards,


In the multiwarp case all the 16 warps belong to the same block and __syncthreads() imposes an execution dependency onto them: each warp is stalled on the barrier point until all other warps from the same thread block reach the point, and such big thread block size prevents multiprocessor from keeping more than one thread block in flight, which should help performance because thread blocks are independent by defintion; thus with multiple in-flight blocks per multiprocessor while one block is waiting at a barrier hardware scheduler can switch to another block without any “logical” glitches and save time.

Thanks to the same multiprocessor architecture (which is the unit that actually executes thread blocks!) GPU performance scales very well across the GeForce 8 / Tesla board families (which is proved by practical observations). In particular, Tesla boards have no performance difference from GeForce 8800 GTX. As your test app is almost 100% memory bound you should be observing close-to-theoretical memory bandwidth on any board.

Yes, True! And, Thanks for the clarification on Tesla hardware. I really think SWARP programming model with multiple blocks would scale well when compared to multiple warps programming model.

Anyway, I think I will find that sooon. And, when I do, I will come back here and post my results.

Thanks for your inputs.

Best Regards,