Basic Cuda Confusion - help

First post – sorry if this is a dumb question:

I’m just trying out Cuda. I have a GT 240, spec’ed with 96 Cuda cores. I wanted to understand the basics of parallel execution before investing in something heavier.

I’m struggling a bit understanding the model of blocks, threads, cores, SMs, streams etc. So I tried a simple test to eliminate memory issues and get down to raw processing measurements. I wanted to understand the optimal configuration of blocks and threads per block, for my (GT 240) hardware, for a simple linear task that I can break into parallel chunks.

I’ve written a kernel that takes no parameters, and basically counts to 1,000,000 (adding the thread id so it has something to do) as a speed test.

global static void MyKernel()
const int tid = threadIdx.x;
int n;
int s=0;

for (n=0; n<=1000000; n++)


If I run this as a single thread, on a single block – it takes about 200ms to run. Fine.

MyKernel<<<1,1>>>(); // = ~ 200 ms

If I keep a single thread per block, and increase the number of blocks, it doesn’t take much longer…

MyKernel<<<50,1>>>(); // = ~ 200 ms

Until I hit 97 blocks, and it takes twice the time

MyKernel<<<97,1>>>(); // = ~ 400 ms

Perfect. I have 96 cores, so it makes perfect sense that it can complete 96 kernel runs in parallel, then needs an extra run to complete number 97. Increasing the blocks further remains at ~ 400 ms until I hit 193 blocks, then it takes ~ 600 ms. All as expected.

At that point I’m patting myself on the back and making coffee.

The problem comes when I tested the opposite. Having a single block and increasing the number of threads per block.

MyKernel<<<1,1>>>(); // = ~ 200 ms
MyKernel<<<1,50>>>(); // = ~ 200 ms
MyKernel<<<1,100>>>(); // = ~ 210 ms
MyKernel<<<1,150>>>(); // = ~ 220 ms
MyKernel<<<1,200>>>(); // = ~ 240 ms
MyKernel<<<1,250>>>(); // = ~ 240 ms
MyKernel<<<1,300>>>(); // = ~ 260 ms
MyKernel<<<1,400>>>(); // = ~ 300 ms
MyKernel<<<1,512>>>(); // = ~ 360 ms

I’m now baffled why I seem to be able to complete 200 parallel runs in not much more time than a single run, and there is no apparent step change in timing as I hit 97 threads.

What’s going on? And how does this help me determine the optimal number of blocks and threads per block for this kind of task?


If you are only running 1 thread and 1 block then you are using a tiny fraction of the GPU’s capability because it is designed to run several lots of 32 threads at a time on each SM, and has several SM

When running 1 thread per block the other 31 threads of the WARP are doing nothing, the hardware is there to do something but not being used. (and all the other SM are doing nothing)
when running 2 threads per block there are 30 threads doing nothing
running 32 threads per block should take about the same time as running 1 thread per block

NB hardware is designed to run multiple WARP’s on each SM. A warp is 32 threads (don’t have to use all 32, but ‘why not’ use them if you can)

When you were running 33…64 threads per block you actually ran 2 warps
This is good because if 1 warp had to pause for some reason the other warp ran instead, until it had to pause, and then hopefully the first warp was ready to run again.

So you got those extra 32 threads nearly for free.
Similarly as you used 100 threads which takes 4 warps (32+32+32+ only 4 threads in the 4th warp)

Aside: All 32 threads in a warp do an instruction in once clock cycle (SIMD), but each SM can only do 1 instruction ( on a single warp of threads) per clock cycle. (sometimes more on recent hardware)

But all of this is being done on a single SM, and without looking it up I think your GT240 has 16 SM, which can all run at the same time.
So you could easily be running 16 blocks in the same time that you ran 1 block.

Aside: Once a block is assigned to a SM it stays on that SM, a block can not be split across SM.

Really a GPU is designed to run (tens of) thousands of threads at once if you want to make full use of its capabilities.

So in your case (which I know was just a test rig) instead of having say 50 threads each doing 1 million adds, split it up so that you have 50 blocks each with 250 threads where each thread does 4000 adds, it should run a lot faster.
( In a real case you would also have to add the partial results of the 250 threads together using some “reduction” method)

How does this help you determine the optimal number of blocks and threads per block…

The optimal number will actually vary with the nature of the problem and your solution to it.
In real cases you will need to read and write data from/to the GPU’s RAM and that takes time, it is usually that IO that causes a warp to pause allowing another warp to run, so the amount of IO your code needs to do will determine the optimum number of warps that will run simultaneously on a SM.
NB Make sure IO is “contiguous” if you can.
Usually a good starting point is about have 128 to 256 threads per block and then vary the number of blocks with the size of the dataset you are processing. (I often start with 196 threads per block)
Once you have the code designed and working correctly you can then change the number of threads per block to see if there is a better block size to run.
NB usually best if the block size is a multiple of 32. And some applications that need to use more registers or more shared data may work better with fewer warps per block.

Thank you kbam, a very comprehensive answer. I appreciate you taking the time to explain that.

It is only a test rig, but I’m interested in exploring what the hardware can do where the threads are likely to execute in near identical time and don’t get blocked waiting for an external resource, like shared memory.

The tip for tuning against each given problem makes sense – after all, it’s real speed that counts, not the theory.

One other (obvious) question- the Nvidia card specs quote different clock speeds. The GTX 690 talks of Base 915Mhz/Boost 1019Mhz, older cards like my 240 talk of a Graphics clock at 550Mhz and a Processor clock at 1340Mhz.

Again, keeping it simple and only worrying about the raw thread speed when there is no shared memory resource, is it the processor clock I’m interested in, and how come the 690 appears to run a lot slower clock wise?

Thanks again.

Just to confirm you do indeed know what you are talking about, I ran the obvious test of trying every combination of block numbers (1…200) and threads per block (1…300), for the test kernel, calculating the total threads per millisecond completed with each combination.

The timing is a bit approximate (I’m using the system clock), but the results are clear.

Right up there as the most efficient are block counts of multiples (and divisors) of 96 (my GT 240 cores), and multiples of 32 threads per block.

Right at the top of the list (most efficient), was MyKernel<<<96,256>>>()

I suspect MyKernel<<<96,32>>>() uses the hardware most efficiently on my GT240, and the higher numbers just reflect the fact that I can do more with more threads without restarting the Kernel from the host.

But at least the real result matches the theory nicely.

I also know my GT240 can count (well add together) about 18 billion numbers a second. Not bad for a £40 card. Roll on my GTX690.


Thanks :)

To avoid making things to complicated I avoided mentioning one difference between the older and newer cards. Your GT 240 actually has 8 cores per SM and 16 SM’s. To do one instruction on a warp takes 4 clock cycles and the whole warp has to complete before the next instruction (or change of warps) can start.
The newer chips have more cores per SM so work faster even at a lower ‘clock speed’, they also do some things faster e.g. double precision, and have different cache.

Anyway you should forget about your card having 96 cores, the important thing is that threads always execute as a warp of 32 threads regardless of the card, and that GPU’s have multiple SM’s (maybe only 2 on a card for a mobile device and 10 or more on a card for a PC/workstation) Its good to design code that will work on a mobile device as well as on a high end card.

There is a rule of thumb that when IO (data transfers between the SM and the GPU’s RAM) is involved, then having 6 or more warps per SM allows the latency for data IO to be hidden by just switching to a different warp.
A block size of 192 threads (6 warps) will satisfy this, so will having 3 blocks of 2 warps each, again which one is better depends on your application.
(The older chips allowed up to 8 blocks and up to 24 or 32 warps to be resident on each SM at a time. Newer chips may allow more. Usually other factors make the optimum number of blocks and warps per SM less so no need to feel you have to be close to those numbers)

NB Kernel parameters of 96,32 on a GPU with 16 SM’s means 3 warps per block and 2 blocks per SM. Any less than this (with IO involved) and the GPU will not be fully utilized, not that that matters.
NB more resident blocks means fewer registers per thread which can often make performance worse.

Kbam, if I may try your patience and ask another question…

I’m now confused about where 96 comes into it.

Going back to my original post.

When I run

MyKernel<<<1,1>>>(); // = ~ 200 ms

I appreciate this is now lunacy of the highest order, and wastes warps, threads and hardware like they are going out of fashion, but that aside…

I guess I’m running 1 thread (in a warp of 32, 31 doing nothing), on 1 core, on 1 SM.

  1. What is happening when I run 2 blocks each with 1 thread?
  2. What is happening when I run 96 blocks each with 1 thread?
  3. And most importantly, what is happening when I run 97 blocks of 1 thread and why only at that point does it take twice as long?

Noting that if my GT 240 has 16 SMs with 8 cores, I’m expecting 128 to be a magic number, not 96.

Sorry, I will get this in a minute, but I’m feeling the confusion of moving from the CPU to GPU world.


Sorry my mistake it will be 12 SM each with 8 cores.

At 96 blocks with 12 SM there will be 8 blocks assigned to each SM, as 8 is the maximum that your hardware can support per SM no more blocks can be assigned until some blocks finish.
“…then needs an extra run to complete number 97”

NB The number of blocks that can be assigned simultaneously to a SM depends on the resources each block needs, e.g. registers and shared arrays. Worth remembering that with complex algorithms it can sometimes be useful to run fewer blocks per SM so that each block can have more registers.

There is something not happening in your results the way I would have expected it to, your algorithm is very simple (no IO ) and I would have expected 1 block to have run faster than 96. I expect it will be different on newer hardware. Hardly matters as usually there will be lots more blocks to run.

A late note of thanks. I’ve been away from the forum.

Guided by your feedback, I’ve done a fair bit of reading and testing and things are now becoming clearer. I’m fascinated by quite how different GPU architecture is to CPU, and I’m also surprised at how significant a re-write of CPU algorithms you sometimes have to do to get the best out of the hardware.

Today’s shocker came from my first attempt at using shared memory (just as a local cache), and discovering just how little ram is available (on my GT 240) per thread.



Some more detailed experiments/kernels confirm your statement “…8 is the maximum [number of blocks] that your hardware can support per SM no more blocks can be assigned until some blocks finish.”

Is it actually conincidence that each SM has 8 cores? (From this forum) I now appreciate that those 8 cores work in unison to execute a single warp at a time (with a scheduler switching context between warps to hide memory latency).

My tests show that my GT 240 SMs are happy to time slice up to 32 warps, provided they are only from up to 8 distinct blocks. Presumably this is where my deviceQuery “Maximum number of threads per multiprocessor 1024” comes in, as 32 warps of 32 threads.

Where is the max blocks per SM defined for each hardware?

Edit: has a good table of compute capability stating this, so I think I’ve got my answer. Although I’d like to know where this is on the Nvidia site, rather than Wiki.


The 8 blocks per SM is a coincidence, as it holds for devices all the way up to compute capability 2.1, which have very different numbers of CUDA cores per SM.

As I mentioned in your other thread (but also putting here for completeness in case someone else finds this thread while searching), CUDA Programming Guide Appendix F is the reference for these numbers.