Maximum number of threads How to find maximum number of threads your Card can support

biebo · July 2, 2009, 5:45pm

Hi,

I want to calculate how many threads a Card can support. Or in CUDA words how many threads a block can have.

Is there any Generic method to calculate maximum threads which can reside in a block,

so as I have

9200M GS[33]

8 Stream Processors.
- 529 MHz core clock.
- 1300 MHz shader clock.
- 400 MHz memory clock.
- Up to 256 MB memory.
- 64-bit memory interface.
- 6.4 GB/s memory bandwidth.
- 2.1GPixel/s Pixel Fillrate.
- 4.2GTexel/s Texture Fillrate
  
  from wikipedia

Knowing this can I calculate max-Threads/Block

Thanks

avidday · July 2, 2009, 5:59pm

The answer can be found in Appendix A of the programming guide - the magic number is 512. Be aware that there are also register and shared memory limitations which might prevent a block with less than or equal to 512 threads from launching.

biebo · July 2, 2009, 6:47pm

So does this meant if i have a 8600 GT Maxthreads/block is the same as if i had a 9800 GT or any never one.

Nico · July 2, 2009, 7:10pm

yes

N.

cnicotra · July 4, 2009, 3:17pm

In my experimenting with a GTX 285 (30 SMs) and a GT 120 (4 SMs) and 512 was consistently the optimal number of threads.

I was also trying to find out the optimal number of blocks which has a dramatic impact on the performance of my test application. It seemed like a multiple of the number of SM’s would be ideal, but on both of these devices 20 blocks appears to be optimal.

Attached is the source to a program that adds 1,000,000 floating point numbers and tries various block configurations. I’ve not had much luck getting anyone else on a non-Mac platform to run it for me, but I’d love to find out if other 30 SM processsor cards exhibit the same performance characteristics.

There is a spreadsheet in the file that shows the performance at block sizes 4 to 200. Below 4, it was way slower, so I didn’t include it in the graph.

/Chris

[attachment=9860:avg2.zip]

Nico · July 4, 2009, 4:03pm

It would surprise me if that were correct.

Each block is executed on a single SM, so if you have 30 SMs and only 20 blocks then 10 SMs would be idling, plus you don’t have more than 1 block per SM which is bad in terms of hiding latency.

N.

_Big_Mac · July 4, 2009, 4:24pm

cnicotra, I must say your code and your results looks weird. I don’t even understand what happens there but you seem to be adding numbers in loops in a serial fashion. You actually penalize having smaller blocks by doing:

n = num / blockDim.x; //n is higher if blocks are smaller

and then

n2 = n >> 2;

	for (i=0;i<n2;i++)

	{

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	}

This causes loads of serial global memory accesses. Evil.

Ie. the smaller blocks you use, the longer the loops get, the more work each thread has. And this is serial work.

It’s definitely NOT optimal to have 20 blocks and 512 threads per block all the time, IIRC you’ve posted this in a different thread before. If you have less blocks than multiprocessors, as you do when you use GTX285, some MPs are idling! If your app indeed gets slower with different configs, it’s a good indication you didn’t understand how CUDA parallelism works and you’ve done it the wrong way.

512 threads per block can be wasteful on compute 1.0 and 1.1 cards as they can run up to 768 threads per MP. 1 block of 512 threads occupying an MP will waste the remaining 256 threads, this results in low occupancy (not necessarily low performance, but still). Remember, blocks can’t get split between MPs. There can be multiple blocks assigned per MP but a single block cannot be distributed across several MPs.

By the way

Those are two different questions.

Besides, there’s no overhead with having more threads than the GPU has processors. In fact, this is recommended. Use as many as you like, the more the better (in most cases).

cnicotra · July 4, 2009, 5:12pm

cnicotra, I must say your code and your results looks weird. I don’t even understand what happens there but you seem to be adding numbers in loops in a serial fashion. You actually penalize having smaller blocks by doing:
n = num / blockDim.x; //n is higher if blocks are smaller
and then
n2 = n >> 2;

	for (i=0;i<n2;i++)

	{

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	}
This causes loads of serial global memory accesses. Evil.

Ie. the smaller blocks you use, the longer the loops get, the more work each thread has. And this is serial work.

It’s definitely NOT optimal to have 20 blocks and 512 threads per block all the time, IIRC you’ve posted this in a different thread before. If you have less blocks than multiprocessors, as you do when you use GTX285, some MPs are idling! If your app indeed gets slower with different configs, it’s a good indication you didn’t understand how CUDA parallelism works and you’ve done it the wrong way.

512 threads per block can be wasteful on compute 1.0 and 1.1 cards as they can run up to 768 threads per MP. 1 block of 512 threads occupying an MP will waste the remaining 256 threads, this results in low occupancy (not necessarily low performance, but still). Remember, blocks can’t get split between MPs. There can be multiple blocks assigned per MP but a single block cannot be distributed across several MPs.

By the way

Those are two different questions.

Besides, there’s no overhead with having more threads than the GPU has processors. In fact, this is recommended. Use as many as you like, the more the better (in most cases).

Since CUDA has a limit of 512 threads, I assume that means there is no way to use their capability of using up to 768?

Interesting that the difference ratio between 768 and 512 is 1.5, so maybe that explains why 20 is the optimal number of blocks (20 * 1.5 = 30 SMs)?

Unfortunately, I don’t understand the comment on the may loop causing serialization. What would be the alternative? I tried approaches like folding in a binary fashion, like I did to add up the numbers from the threads, but after tweaking this to use the partially unrolled loop above, this approach ended up being much faster.

Interestingly, 50 blocks, which would have even smaller blocks, is almost as optimal as 20.

I really appreciate your help and feedback. Thanks!!! :-)

Nico · July 4, 2009, 5:17pm

Yes there is. For example, if you use 256 threads per block and have enough blocks, then an SM can have 3 active block simultaneously, resulting in 768 concurrent active threads.

No. There’s no way a block is split between two SMs, so an SM can’t process 1.5 blocks.

N.

cnicotra · July 4, 2009, 5:34pm

I find it very odd as well.

Would you mind running my test program on your card and sending me the avg.csv output file? :-)

/Chris

cnicotra · July 4, 2009, 8:06pm

So both warps have to execute on the same SM?

I tried changing the number of threads to 256, but got fairly consistent results compared to the 512 thread version.

/Chris

Nico · July 4, 2009, 8:17pm

I bet you do :) But that’s because the code you wrote is not suited for the speed test you’re trying to perform.

If you run your program through the visual profiler, you will see that the number of uncoalesced loads/stores as well as the number of divergent branches are dependent on the grid size.

N.

cnicotra · July 4, 2009, 11:13pm

In my ignorance, I think I’ve dealt with the uncolalesced issue. Originally, I had each thread working on a different chunk of memory, when I moved to interleaving them, things got an order of magnitude faster.

I have the project loaded in the visual profiler, but I don’t see how to see uncoalesced loads / stores?

Thanks,

/Chris

cnicotra · July 5, 2009, 2:01am

In the display preferences, I see a column named gld efficiency and gst efficiency, but I don’t seem to be able to display them anywhere.

/Chris

cnicotra · July 5, 2009, 12:50pm

Since the GTX 285 supports capabilities 1.3, maximum active threads is 1024?

/Chris

_Big_Mac · July 5, 2009, 6:10pm

Max 1024 (or 768 for older GPUs) threads can be scheduled per multiprocessor at any given time, that’s over 30k of threads for a Tesla 10/GTX 280. You should treat this number as the MINIMAL number of threads you’re aiming for as this will just cover all the pipelines.

cnicotra · July 7, 2009, 12:56am

cnicotra, I must say your code and your results looks weird. I don’t even understand what happens there but you seem to be adding numbers in loops in a serial fashion. You actually penalize having smaller blocks by doing:
n = num / blockDim.x; //n is higher if blocks are smaller
and then
n2 = n >> 2;

	for (i=0;i<n2;i++)

	{

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	sum += *d_data;

	d_data += blockDim.x;

	}
This causes loads of serial global memory accesses. Evil.

Ie. the smaller blocks you use, the longer the loops get, the more work each thread has. And this is serial work.

It’s definitely NOT optimal to have 20 blocks and 512 threads per block all the time, IIRC you’ve posted this in a different thread before. If you have less blocks than multiprocessors, as you do when you use GTX285, some MPs are idling! If your app indeed gets slower with different configs, it’s a good indication you didn’t understand how CUDA parallelism works and you’ve done it the wrong way.

512 threads per block can be wasteful on compute 1.0 and 1.1 cards as they can run up to 768 threads per MP. 1 block of 512 threads occupying an MP will waste the remaining 256 threads, this results in low occupancy (not necessarily low performance, but still). Remember, blocks can’t get split between MPs. There can be multiple blocks assigned per MP but a single block cannot be distributed across several MPs.

By the way

Those are two different questions.

Besides, there’s no overhead with having more threads than the GPU has processors. In fact, this is recommended. Use as many as you like, the more the better (in most cases).

Ok, I figured out my mistake.

The reason the number of blocks had such a huge impact on the performance and the times seemed to swing so wildly had to do with the alignment of the data each block was starting with. By forcing the start address of each block to read from a 64 byte boundary, the performance became much more consistent with various number of blocks. 20 blocks is still optimal for this case, but 30 blocks is vary close. In the low number of blocks above 10, there is some minor variation in performance, as the number of blocks increases above 90, performance becomes fairly constant.

The Blocks.gif image is an Excel plot of the performance at various block sizes.

The comments were extremely helpful. Thanks!!!

/Chris

[attachment=9880:Blocks.gif]

/Chris

Topic		Replies	Views
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	69	December 31, 2024
How many can use Blocks to effcient parallel prog CUDA Programming and Performance	8	5789	December 12, 2009
Registers per SM GTX 460 CUDA Programming and Performance	7	1912	April 17, 2011
Confusion about setting kernel block and grid size for maximum occupancy CUDA Programming and Performance cuda	11	611	March 30, 2024
Optimal threads vs blocks CUDA Programming and Performance	4	4019	April 24, 2011
Deep understanding how block is actually processed in MP CUDA Programming and Performance	28	30279	December 15, 2010
Max threads/blocks CUDA Programming and Performance	10	71	September 6, 2024
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19459	July 5, 2011
Basic Cuda Confusion - help CUDA Programming and Performance	9	1895	February 11, 2013

Maximum number of threads How to find maximum number of threads your Card can support

Related topics