blocks vs threads

Hi,

It seems increasing the number of blocks (blockNum) in a grid and the number of threads (threadNum) in a block can both make full use of all multiprocessors (blockNum > 16 and threadNum > 32). My question is when blockNum > 16 and threadNum > 32, what is the difference doing further increase to them? For example in case 1 I use 16 blocks and 512 threads per block, there will be totally 16 * 512 threads; in case 2 I use 256 blocks and 32 threads per block, so totally 256 * 32 (== 16 * 512) threads. How can I judge which case is better? What is the main difference between them? Or, they are equal? Thanks

Your code will probably not execute if you use too many threads per block. This is possibly due to the large amount of allocated registers (and possibly shared memory) per thread. Furthermore I think that increasing the blocks means that your code has more data parallellism since blocks don’t communicate which each other.

My guess is that the device scheduler can use this fact to efficiently map the blocks on the device in contrast to the large amount of threads which need to be mapped to the same multi-processor to facilitate communication.

By having more blocks than multiprocessors you

a) ensure that while one block is stalled waiting for device memory accesses / syncthreads, another block can be run on the multiprocessor.

b) ensure that your application will continue to scale on future GPUs, which will have more multiprocessors.

Mark

So would there be any reason to expect the performance of the following two configurations to differ (not worrying about future scalability)

  1. 32 blocks, 512 threads

  2. 64 blocks, 256 threads?

Would the main difference in performance come from the ability to switch to more blocks if the code uses a lot of __syncthreads() calls? Because memory latency can also be hidden by just switching to different threads within the same block, right?

While I can’t answer your other questions, I can say that the performance of my app was constant for >64 threads per block (bad results above 256 tpb). 32 tpb was only 20% slower. So, I would guess that as long as you don’t saturate your shared memory, those two configurations should perform very similarly. I hope this helps.

Remember that registers and shared memory are shared resources within a multiprocessor (sm). So if you use more than half the shared memory per block, you will only get one thread block per multiprocessor. Likewise if you use more than half the registers. In my experience, for most reasonably complex programs 512-thread blocks make it tough to fit more than one block per sm due to register usage (registers are allocated per thread).

So in most apps, eelsen, I would expect 64 blocks of 256 threads to perform better than 32 blocks of 512 on G80. Depending on your app, 128 blocks of 128 threads may perform even better.

There’s no exact formula for this stuff – it’s very app dependent. The best thing I can recommend is to build your app to be self-tuning (as FFTW and Atlas are).

Mark

While we’re on this topic, could you disclose to us how many registers are available per multiprocessor? I’m getting a kernel launch failure and I think it’s due to the register count, but I can’t be sure without knowing how many I have available, and I haven’t seen this number published anywhere.

I’m trying to launch 1 block of 32 threads in a particular kernel, and each thread requires 9 regs (according to the .cubin file). I know the shared memory is within limits, so my next step is guessing the number of registers is causing it. (Yes, I know that 1 block of 32 is inefficient, but this is a small portion of the code that is not too parallelizable.)

How much shared memory are you actually using? I’ve had problems with the same error, and I’m pretty sure its caused by the shared memory. 9 registers is not very many and 32 threads is also fairly small, so I doubt you’re running into a register problem.

A kernel that does basically nothing except allocate > ~3970 floats seems to cause this error for me, even though one should theoretically be able to get 4096 floats.

Thanks for info Mark. The current app I’m working on basically loads some stuff into the shared memory and from then on, almost all memory accesses are from the shared memory, so there is actually very little latency to hide. After writing a little self-tuning loop as per your suggestion, it seems that 512 threads seems to be the way to go.

It turns out this is because the major bottleneck in the code is actually the time it takes to load the data into the shared memory, and with more blocks it has to load the exact same data twice.

What this made me wonder is if there is a way to only load the data into the shared memory once on each sm, and then just read from it for the duration of the kernel. (The real app would have thousands of blocks, so the overhead in reloading the same data is significant.)

For example something like:

if (sdata[1000] != FLAG) {

    //load data into sdata

    //set sdata[1000] to FLAG

}

which seems to work, but I’m guessing such behavior is not/will not be supported by NVIDIA?

I think it is very unlikely that you will gain anything from a design like that. For example,

you have 16 kB shared memory per multiprocessor. Lets say you use 1024 blocks that all fully reload the shared memory. This means every shared memory is shared with 64 other blocks (there are 16 multiprocessors in total) and thus effectively downsized to 0.25 kB per block.

This will amount to 256 kB reloading of data. If you stream this data in your shared memory’s you will obtain something like 40 GB/s data bandwidth to the device memory and this means you are busy reloading data for only 0.0064 milliseconds. If this is the bottleneck in your application than something else is wrong :P

OK, perhaps you use execute multiple kernels with a smaller number of blocks, but still its highly unlikely that streaming from device memory to shared memory is the bottleneck.

Thanks for the input WW,

Some testing shows the runtime when not loading the shared data for subsequent blocks decreases by about 40%. There are 2032 extra blocks loading 16kb of memory each, which amounts to about an extra 31 MB, which at 40 GB/sec would take .8 msec to load. My actual program runtime goes from about 4 to 2.7 msecs with this change, so it would appear the shared memory is being loaded more at 25 GB/sec for some reason.

My main question is whether or not using the shared memory in this way would be supported or if it is something that could easily be broken by future drivers/hardware etc…

Because the number of reads from shared memory (per fragment) is not extremely large (~32) and they are at register speed (bandwidth is ~1.4TB/sec: 16 processors * 16 banks/processor * 4 bytes/bank/clock * 1.35 billion clocks/sec), and the number of math operations is also small, the time spent loading the shared memory is in fact comparable to the time spent doing useful work.

The above numbers also imply that you can load about 34 or so floats from the shared memory (at 1.4 TB/sec) in the same amount of time it takes to load one float from main memory (at 40 GB/sec). If you use the observed 25 GB/sec I seem to be getting, then its more like a 56 to 1 ratio.

Not loading the data into the shared memory first would result in even worse performance since then the memory access would essentially be random which would result in horrible performance if I used the global memory. Using texture fetches would help because a lot of the data would end up in the cache, but I’m already using the texture cache for something else…

I also thought of another solution, which I hadn’t thought of before because I was still in the “streaming” mindset. Since scatter is supported, I can loop inside the kernel over the output locations. In this way I can have only 16 blocks, which means I can make sure the hardware only loads the shared memory once without having to do anything weird with the shared memory. But then I have a very small number of blocks, which is probably not a great idea.

[Deleted]

Eelsen,

It sounds like your data are constant. Why not use use constant memory for this? Then none of your thread blocks have to load it, and you don’t have to worry about it breaking in the future. The working set size of the constant memory cache is 8KB. Since you were using shared memory before, we know your working set is at least smaller than 16KB. So chances are you will get a very good cache hit rate from constant memory.

Mark

After gathering some information here, it’s clear that eelsen’s technique is very dangerous. Shared memory is allocated and freed by the hardware as thread blocks are scheduled on the multiprocessors. It also could also break with a new compiler version, and is likely to break on future GPU architectures.

Mark

The reason I decided to go with shared memory over constant memory is that the location of the memory access from each thread is basically random (but with some convenient properties that allow for structuring the data access so that there are no bank conflicts), but as I understood the documentation if different threads from within a warp try and access different locations in the constant memory the access has to be serialized, which would result in essentially 16x less bandwidth for this particular application.

Thanks for that info, that was my real concern. I may be getting lucky right now because I’m allocating essentially the entire available shared memory and using 512 threads (so only one block at a time per processor), so I think the hardware probably doesn’t have much choice about how it is allocated. I guess I have to try a couple alternative approaches and see what works best.

Thanks for the help,

Erich