how many threads to hide latency on Fermi? the number in the NVIDIA manual is 2x off (?)

On G80/GT200 it was recommended to run at least 192 threads to hide 24 cycle arithmetic pipeline latency. The new best practices guide recommends running at least 2x more threads on Fermi:

I think this is a mistake. It is not about dual issue, but about number of thread processors. If you have 4x more thread processors and same latency, you need 4x more threads.

8 thread processors and 24 pipeline stages on G80/GT200 required 824=192 threads. 32 thread processors and 24 pipeline stages on GF100 requires 3224=768 threads.

Experimentally, I find that 576 threads is enough to get the single-precision multiply-add peak on GTX480. (Less if using in-thread parallelism.) So, I’d speculate that the latency is only 18 cycles.

Please correct me if I am wrong.

Vasily

Hi,

Do you mean 576 threads per block? I never tried it with Fermi since the Occupancy calculator gives an error when putting such a value for CC 2.0.

Is that a mistake in the Occupancy calculator?

thanks

eyal

Hi,

Do you mean 576 threads per block? I never tried it with Fermi since the Occupancy calculator gives an error when putting such a value for CC 2.0.

Is that a mistake in the Occupancy calculator?

thanks

eyal

But because there are 4 times the number of cores and 2 warp schedulers instead of one wont this result in double the amount of threads per block but 2 blocks per MP to hide latency? I think because of the 2 schedulers you only use half the cores of a MP for one block while the other half is used to execute the threads of another block by the other scheduler. This would result in 384 threads so could this be the right way to go?

But because there are 4 times the number of cores and 2 warp schedulers instead of one wont this result in double the amount of threads per block but 2 blocks per MP to hide latency? I think because of the 2 schedulers you only use half the cores of a MP for one block while the other half is used to execute the threads of another block by the other scheduler. This would result in 384 threads so could this be the right way to go?

Yes, I mean 576 threads per block. There should be a bug in the occupancy calculator. CUDA programming guide 3.1 says that “On current GPUs, a thread block may contain up to 1024 threads”.

You would still get 2x384=768 threads per multiprocessor in total, i.e. 50% occupancy, not 25% as the NVIDIA manual cites. But my understanding is that warps in a single thread block are distributed across the two warp schedulers. This would explain the wiggling in the graph, for example.

Yes, I mean 576 threads per block. There should be a bug in the occupancy calculator. CUDA programming guide 3.1 says that “On current GPUs, a thread block may contain up to 1024 threads”.

You would still get 2x384=768 threads per multiprocessor in total, i.e. 50% occupancy, not 25% as the NVIDIA manual cites. But my understanding is that warps in a single thread block are distributed across the two warp schedulers. This would explain the wiggling in the graph, for example.

According to the manual, one scheduler handles even warps and another handles odd warps. So odd number of warps is not efficient.

My own test on GF104 shows that for both single MAD and MAD+MAD (since GF104 can do “superscalar”) the best performance comes at around 24 warps, and it’s very near the top at 20 warps.

According to the manual, one scheduler handles even warps and another handles odd warps. So odd number of warps is not efficient.

My own test on GF104 shows that for both single MAD and MAD+MAD (since GF104 can do “superscalar”) the best performance comes at around 24 warps, and it’s very near the top at 20 warps.

Are these MADs in MAD+MAD independent? Try MAD+MAD+MAD+MAD, all independent - you might be surprised to see that fewer warps are needed to get the best performance. At least, this happens on my GTX280, GTX280 and 8800GTX.

Are these MADs in MAD+MAD independent? Try MAD+MAD+MAD+MAD, all independent - you might be surprised to see that fewer warps are needed to get the best performance. At least, this happens on my GTX280, GTX280 and 8800GTX.

The MAD+MAD are independent, but they are to test the superscalar issue of GF104 (GF104 can, in theory, issue up to three MADs from two warps per cycle). So they can’t reduce the number of warps to hide ALU latency (at least not by much).

The MAD+MAD are independent, but they are to test the superscalar issue of GF104 (GF104 can, in theory, issue up to three MADs from two warps per cycle). So they can’t reduce the number of warps to hide ALU latency (at least not by much).

Hi Vasily,

You are correct… 384 threads is too low. 576 to 768 threads is more reasonable for exactly the reasons you all have pointed out. Thanks for calling our attention to this! I have corrected the text in the Best Practices Guide.

Note: The actual latency varies somewhat across different types of instructions, so more active threads is better as a general rule.

Regards,
Cliff

Hi Vasily,

You are correct… 384 threads is too low. 576 to 768 threads is more reasonable for exactly the reasons you all have pointed out. Thanks for calling our attention to this! I have corrected the text in the Best Practices Guide.

Note: The actual latency varies somewhat across different types of instructions, so more active threads is better as a general rule.

Regards,
Cliff