What will be happen in the situation

If the GPU has 16 SM
and I only use 1 block and 512 threads
What will be happen in the situation?
Will the 16 warps be distributed to the 16 SM to process?
or only one SM will be used to do the calculate?

A block is never split among SMs.
So all 512 threads will run on the same single SM, leaving the other 15 SM’s idle.

Think of it this way, a block has shared memory that all the threads can read and write to efficiently… if a block ran on several SMs at once, there’s no easy way for them all to coordinate their shared memory efficiently.

Note that the OPPOSITE question, can an SM run multiple blocks at once, is TRUE. You can run up to 8 blocks simultaneously on one SM. This may even be common in kernels with low register and shared memory use.

Thanks for your reply

But in the programming guide, it said SM can only run only one warp(32thread) at once, isn’t it?

What I mean is if the number of SM in a GPU= 16, and nblock*nthread=4096, if we don’t regard to the resource problem(share mem and register)

then nblock=32, nthread=128 will be the same speed compared to nblock=64, nthread=64, is it?

No, an SM can schedule 24 warps at once (Compute 1.0 and 1.1) or 32 at once (Compute 1.2+)

This is all interleaved scheduling, which is the elegant way CUDA uses to hide the large latency of device memory access.

An SM can run those 24 or 32 warps from up to 8 blocks at once. This is also important and useful, since having multiple simultanous blocks hides the single-block-wide stalls when _syncthreads() is used.

Probably the 64 block version would be faster, since it gives the scheduler more options (finer granularity) in feeding work to each SM.

This is especially important when block workloads are not equally balanced. Imagine if one SM gets two “fast” blocks and another SM gets two “slow” blocks. The “fast” SM will finish its work and have nothing else to do. More but smaller/faster blocks helps reduce this problem.

It will mostly be the same speed.

However, SPWorley’s idea about asymmetric kernels is interesting, for the kind of kernels it applies to. It probably won’t make a huge difference, though.

Another thing I noticed is that sometimes running 1 block per SM is much faster (by a factor of two). This is down to a poorly-documented facet of the memory subsytem, namely DRAM channels. If your kernel is layed out to access these in a perfectly optimal manner (as some of mine are), running 2 blocks causes conflicts in the access pattern and hurts performance severely. I suppose the same effect may be met when going from 2 to 4 blocks.

Also keep in mind the future. Cards will come out (sooner rather than later) with 64+ SMs, in which case the nblock=32 code will obviously underperform.

But resource usage is the overriding concern, by far, in many situations. If you see that your algorithm can exploit the on-die SRAMs well, don’t hesitate in trying 1 block per SM or even 128 registers per thread. The gains from on-die SRAMs (registers, shared memory, constant memory) are typically orders of magnitude, while concerns like inter-block overlap and occupancy are typically much more incremental.

That’s an interesting point, one I never measured (and didn’t expect.)

Is this slowdown in the device memory access? Shared memory access?

How big is the effect, really a 2X speed difference?

I could hypothesize some shared memory overhead because multiple blocks require some simple virtualization of the shared memory addresses.

But even there I’d expect it’d be part of the hardware and not have any execution cost.

When you say “faster by a factor of two” I perk up and pay attention… tell us more, Alex! The hidden hardware behavior of GPUs is always interesting once you start tearing off the abstractions.

It was to DRAM. I had made my own local memory implementation, which turned out to be faster than the built-in local memory. Running 128 threads per SM (I needed the registers) on an 8600GT (2 DRAM channels, 4 SMs) resulted in great performance. Running two blocks with 64 threads each halfed it.

I should really go back and test this thoroughly. It’s possible the cause was something else. But DRAM is fickle. The channels are interlaced by 256 bytes. If you set up your addresses the wrong way (like try to address a matrix column-down when its stride is a multiple of (channel count)*(256 bytes)) you’ll hammer the same channel from all SMs and kill your bandwidth. DRAM also likes to read contiguous memory, saving itself the many cycles to redo addressing. Ideally you want to get the 256 bytes from your channel in one go, which I think can be achieved with a “meta-coalesced” access (ie, one that spans multiple warps). This might have been what caused the cliff from 2 blocks/SM. There is also the TLB to keep in mind, which can half performance if you start to address memory all over the place and cause page faults.

Like I said, I should retest what I saw, but there is a lot going on there. A lot more than the documentation reveals.

Dear Alex:

Maybe I met the same problem with you. What I want to do is to implement a 55 average filter on an image(40082672 16 bits).

so the threads will read the same address somehow at the same time.

My card is 9800 GTX which has 16SMs.

I did not use share memory to implement it. But I use texture to do it.(the Texture is cached?right?)

And The performance is very similar when I set 1 block, 167 thread, and 16 blocks, 167 threads

Hopefully to hear your new test result and root cause of it.~

It’s not a real cache.

You mean using 16 MPs did not improve throughput vs 1 MP?

Yes, its very strange~

What I do is to use global memory to do 5*5 average filter without using share memory.

As I memtioned in previous post.