GTX460 number of multiprocessors

Hello everyone,

There are two things I would like to discuss:

  1. I would like to ask if anyone knows why the GTX460 is not mentioned in the Appendix A of the CUDA C Programming Guide v3.1.1? Furthermore, I am reading that in the 400 series, each SMs has 32 cores (e.g. the GTX480 has 15 SMs with 32 CUDA cores / SM, thus 1532=480 CUDA cores). However, if I apply the same calculations for the GTX460, which has 7 SMs (an 8th one is disabled as far as I read), then there should be 732=224 CUDA cores. But, according to the specifications, there are 336 cores, thus each SM should have 336/7=48 cores, instead of 32. Where am I doing wrong?? :unsure:

  2. In the GTX200 series, the GPUs have many SMs with few CUDA cores within each one SM. Now, in the GTX400 series happens the opposite, i.e. there are few SMs with many CUDA cores within each one SM. I think that this has a impact on the performance of the same code when executed on two GPUs of theses series. For example, I had already developed a CUDA C code and run it on a GTX275. Here are the occupancy results based on Visual Profiler 3.1.1:

Kernel details : Grid size: 24576 x 22, Block size: 64 x 1 x 1
Register Ratio = 0.25 ( 4096 / 16384 ) [4 registers per thread]
Shared Memory Ratio = 0.5 ( 8192 / 16384 ) [796 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 512 : 1024
Occupancy = 0.5 ( 16 / 32 )
Achieved occupancy = 0.5 (on 30 SMs)
Occupancy limiting factor = Block-Size
Warning: Grid Size (540672) is not a multiple of available SMs (30).

I run the same kernel to a GTX460 and here are the occupancy results:

Kernel details : Grid size: 24576 x 22, Block size: 64 x 1 x 1
Register Ratio = 0.25 ( 8192 / 32768 ) [12 registers per thread]
Shared Memory Ratio = 0.166667 ( 8192 / 49152 ) [772 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 512 : 1024
Occupancy = 0.5 ( 16 / 32 )
Achieved occupancy = 0.5 (on 7 SMs)
Occupancy limiting factor = Block-Size
Warning: Grid Size (540672) is not a multiple of available SMs (7).

First let me clarify that for sure I am not an expert in CUDA programming, but what I observed when measuring the timing for the same kernel when running on both GPUs, is that the GTX275 could process data approximately two times faster. :blink: I know that my kernel accesses a lot the external memory and the the GTX275 has a better memory throughput than the GTX460 (127 GB/sec and 115.2 GB/sec respectively) but I don’t think it is the only reason for this strange performance.

Any suggestions are great. Thank you very much and sorry for my long post.

Regards,
dtheodor

For your first question, what results you will see if running deviceQuery of SDK?
For the second one, what results you will see if you compile the code with the option -arch sm_13 and run the program with GTX460?

For your first question, what results you will see if running deviceQuery of SDK?
For the second one, what results you will see if you compile the code with the option -arch sm_13 and run the program with GTX460?

Simple. CUDA 3.1 came out before the GTX 460.

Check the hardware review sites (like anandtech): they discuss the architecture changes from GF100 to GF104 (one of which includes the jump to 48 CUDA cores per SM.

Hard to tell without specifics on your code. I suggest that you first experiment with the block size and see how your performance changes. In my application, getting the wrong block size leads to a 50% performance degradation. And no, the “right” block size is not the one with the highest occupancy - it must be empirically determined.

Simple. CUDA 3.1 came out before the GTX 460.

Check the hardware review sites (like anandtech): they discuss the architecture changes from GF100 to GF104 (one of which includes the jump to 48 CUDA cores per SM.

Hard to tell without specifics on your code. I suggest that you first experiment with the block size and see how your performance changes. In my application, getting the wrong block size leads to a 50% performance degradation. And no, the “right” block size is not the one with the highest occupancy - it must be empirically determined.

Hello,

First of all I would like to thank everyone for replying. Unfortunately I didn’t have access to the GTX460 until today, that’s why I couldn’t reply before. :)

Indeed the GTX460 has 48 cores instead of 32 that other, more powerful 400 series GPUs have. In order to test my system, I ran the official Final Fantasy IV benchmark from this link. Strangely, when I used the GTX275 I got a score of 3817, while with the GTX460 I got 3716. :blink:

It seems that the GTX275 is faster because of the better memory bandwidth, although it has fewer CUDA cores that the GTX460 (240 against 336 cores). Which means, and correct me if I am wrong, that newer GPUs with even more CUDA cores are not always faster than previous cards with less cores…

Once again thank you very much, and if anyone has any additional comments on that, they are most welcome.

Regards,

dtheodor

Hello,

First of all I would like to thank everyone for replying. Unfortunately I didn’t have access to the GTX460 until today, that’s why I couldn’t reply before. :)

Indeed the GTX460 has 48 cores instead of 32 that other, more powerful 400 series GPUs have. In order to test my system, I ran the official Final Fantasy IV benchmark from this link. Strangely, when I used the GTX275 I got a score of 3817, while with the GTX460 I got 3716. :blink:

It seems that the GTX275 is faster because of the better memory bandwidth, although it has fewer CUDA cores that the GTX460 (240 against 336 cores). Which means, and correct me if I am wrong, that newer GPUs with even more CUDA cores are not always faster than previous cards with less cores…

Once again thank you very much, and if anyone has any additional comments on that, they are most welcome.

Regards,

dtheodor

In GTX275-80TMU, and GT460-56TMU …
Fewer texture processors is often a weak point on the architecture of video games in the FERMI

In GTX275-80TMU, and GT460-56TMU …
Fewer texture processors is often a weak point on the architecture of video games in the FERMI

Hello ED1980,

So this means that the lower score for the GTX460 could be only because of the less TMUs or also because of something else, like lower memory bandwidth? The reason I ask is because, as mentioned above, I find strange the fact that my CUDA application performs better when mapped onto a GTX275, instead of a GTX460. So I thought to run the Final Fantasy Benchmark to test how a real program behaves when using these GPUs. And it seems that again, like in my program, the benchmark behaves the same way, i.e. the GTX275 is faster than the GTX460.

Regards,

dtheodor

Hello ED1980,

So this means that the lower score for the GTX460 could be only because of the less TMUs or also because of something else, like lower memory bandwidth? The reason I ask is because, as mentioned above, I find strange the fact that my CUDA application performs better when mapped onto a GTX275, instead of a GTX460. So I thought to run the Final Fantasy Benchmark to test how a real program behaves when using these GPUs. And it seems that again, like in my program, the benchmark behaves the same way, i.e. the GTX275 is faster than the GTX460.

Regards,

dtheodor

Hello dtheodor

Final Fantasy Benchmark as gaming applications, could greatly depend on the number of TMUs especially in high resolutions, but CUDA-program more on the settings of your programs under architecture GTX275 (compute capability 1.x), for best performance needed to reflect the architecture FERMI (compute capability 2.x) + dependency programs from the memory bandwidth GTX460 which is smaller than the GTX275 …

Hello dtheodor

Final Fantasy Benchmark as gaming applications, could greatly depend on the number of TMUs especially in high resolutions, but CUDA-program more on the settings of your programs under architecture GTX275 (compute capability 1.x), for best performance needed to reflect the architecture FERMI (compute capability 2.x) + dependency programs from the memory bandwidth GTX460 which is smaller than the GTX275 …

Hello ED1980,

First of all I would to thank you and the rest that take part on this topic.

What I would expect (as a rather beginner CUDA user :) ) is the GPU with more CUDA cores to perform better than the one with fewer cores. I have read the Fermi tuning and Compatibility guides, to see if there are any significant code changes required. In order to adjust my program to the Fermi architecture, I did the following:

  1. added to the Visual studio the following option to the command line, in order to invoke the 2.x capability during program compilation:

-gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_20,code=compute_20

  1. performed tests by configuring the on-chip memory either “cache preferred” or “shared memory preferred”, suing the cudaFuncSetCacheConfig function.

I don’t use any double precision calculations, and I hope I use the correct options during the program compilation, as I showed above. Is there any other program changes that are required to adjust to the Fermi architecture?? :huh:

Regards,

dtheodor

Hello ED1980,

First of all I would to thank you and the rest that take part on this topic.

What I would expect (as a rather beginner CUDA user :) ) is the GPU with more CUDA cores to perform better than the one with fewer cores. I have read the Fermi tuning and Compatibility guides, to see if there are any significant code changes required. In order to adjust my program to the Fermi architecture, I did the following:

  1. added to the Visual studio the following option to the command line, in order to invoke the 2.x capability during program compilation:

-gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_20,code=compute_20

  1. performed tests by configuring the on-chip memory either “cache preferred” or “shared memory preferred”, suing the cudaFuncSetCacheConfig function.

I don’t use any double precision calculations, and I hope I use the correct options during the program compilation, as I showed above. Is there any other program changes that are required to adjust to the Fermi architecture?? :huh:

Regards,

dtheodor

Hello dtheodor

your problem is, I quote:
results based on Visual Profiler 3.1.1:
Block size: 64 x 1 x 1
Occupancy limiting factor = Block-Size

Are you using the CUDA Toolkit 3.2? Only CUDA Toolkit 3.2 is officially full support GTX460 …

Hello dtheodor

your problem is, I quote:
results based on Visual Profiler 3.1.1:
Block size: 64 x 1 x 1
Occupancy limiting factor = Block-Size

Are you using the CUDA Toolkit 3.2? Only CUDA Toolkit 3.2 is officially full support GTX460 …