GTX460 number of multiprocessors

dtheodor · September 13, 2010, 2:22pm

Hello everyone,

There are two things I would like to discuss:

I would like to ask if anyone knows why the GTX460 is not mentioned in the Appendix A of the CUDA C Programming Guide v3.1.1? Furthermore, I am reading that in the 400 series, each SMs has 32 cores (e.g. the GTX480 has 15 SMs with 32 CUDA cores / SM, thus 1532=480 CUDA cores). However, if I apply the same calculations for the GTX460, which has 7 SMs (an 8th one is disabled as far as I read), then there should be 732=224 CUDA cores. But, according to the specifications, there are 336 cores, thus each SM should have 336/7=48 cores, instead of 32. Where am I doing wrong?? :unsure:
In the GTX200 series, the GPUs have many SMs with few CUDA cores within each one SM. Now, in the GTX400 series happens the opposite, i.e. there are few SMs with many CUDA cores within each one SM. I think that this has a impact on the performance of the same code when executed on two GPUs of theses series. For example, I had already developed a CUDA C code and run it on a GTX275. Here are the occupancy results based on Visual Profiler 3.1.1:

Kernel details : Grid size: 24576 x 22, Block size: 64 x 1 x 1
Register Ratio = 0.25 ( 4096 / 16384 ) [4 registers per thread]
Shared Memory Ratio = 0.5 ( 8192 / 16384 ) [796 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 512 : 1024
Occupancy = 0.5 ( 16 / 32 )
Achieved occupancy = 0.5 (on 30 SMs)
Occupancy limiting factor = Block-Size
Warning: Grid Size (540672) is not a multiple of available SMs (30).

I run the same kernel to a GTX460 and here are the occupancy results:

Kernel details : Grid size: 24576 x 22, Block size: 64 x 1 x 1
Register Ratio = 0.25 ( 8192 / 32768 ) [12 registers per thread]
Shared Memory Ratio = 0.166667 ( 8192 / 49152 ) [772 bytes per Block]
Active Blocks per SM = 8 : 8
Active threads per SM = 512 : 1024
Occupancy = 0.5 ( 16 / 32 )
Achieved occupancy = 0.5 (on 7 SMs)
Occupancy limiting factor = Block-Size
Warning: Grid Size (540672) is not a multiple of available SMs (7).

First let me clarify that for sure I am not an expert in CUDA programming, but what I observed when measuring the timing for the same kernel when running on both GPUs, is that the GTX275 could process data approximately two times faster. :blink: I know that my kernel accesses a lot the external memory and the the GTX275 has a better memory throughput than the GTX460 (127 GB/sec and 115.2 GB/sec respectively) but I don’t think it is the only reason for this strange performance.

Any suggestions are great. Thank you very much and sorry for my long post.

Regards,
dtheodor

athlonshi · September 13, 2010, 3:29pm

For your first question, what results you will see if running deviceQuery of SDK?
For the second one, what results you will see if you compile the code with the option -arch sm_13 and run the program with GTX460?

athlonshi · September 13, 2010, 3:29pm

For your first question, what results you will see if running deviceQuery of SDK?
For the second one, what results you will see if you compile the code with the option -arch sm_13 and run the program with GTX460?

MisterAnderson42 · September 13, 2010, 4:09pm

Simple. CUDA 3.1 came out before the GTX 460.

Check the hardware review sites (like anandtech): they discuss the architecture changes from GF100 to GF104 (one of which includes the jump to 48 CUDA cores per SM.

Hard to tell without specifics on your code. I suggest that you first experiment with the block size and see how your performance changes. In my application, getting the wrong block size leads to a 50% performance degradation. And no, the “right” block size is not the one with the highest occupancy - it must be empirically determined.

MisterAnderson42 · September 13, 2010, 4:09pm

Simple. CUDA 3.1 came out before the GTX 460.

Check the hardware review sites (like anandtech): they discuss the architecture changes from GF100 to GF104 (one of which includes the jump to 48 CUDA cores per SM.

Hard to tell without specifics on your code. I suggest that you first experiment with the block size and see how your performance changes. In my application, getting the wrong block size leads to a 50% performance degradation. And no, the “right” block size is not the one with the highest occupancy - it must be empirically determined.

dtheodor · September 20, 2010, 9:14am

Hello,

First of all I would like to thank everyone for replying. Unfortunately I didn’t have access to the GTX460 until today, that’s why I couldn’t reply before. :)

Indeed the GTX460 has 48 cores instead of 32 that other, more powerful 400 series GPUs have. In order to test my system, I ran the official Final Fantasy IV benchmark from this link. Strangely, when I used the GTX275 I got a score of 3817, while with the GTX460 I got 3716. :blink:

It seems that the GTX275 is faster because of the better memory bandwidth, although it has fewer CUDA cores that the GTX460 (240 against 336 cores). Which means, and correct me if I am wrong, that newer GPUs with even more CUDA cores are not always faster than previous cards with less cores…

Once again thank you very much, and if anyone has any additional comments on that, they are most welcome.

Regards,

dtheodor

dtheodor · September 20, 2010, 9:14am

Hello,

First of all I would like to thank everyone for replying. Unfortunately I didn’t have access to the GTX460 until today, that’s why I couldn’t reply before. :)

Indeed the GTX460 has 48 cores instead of 32 that other, more powerful 400 series GPUs have. In order to test my system, I ran the official Final Fantasy IV benchmark from this link. Strangely, when I used the GTX275 I got a score of 3817, while with the GTX460 I got 3716. :blink:

It seems that the GTX275 is faster because of the better memory bandwidth, although it has fewer CUDA cores that the GTX460 (240 against 336 cores). Which means, and correct me if I am wrong, that newer GPUs with even more CUDA cores are not always faster than previous cards with less cores…

Once again thank you very much, and if anyone has any additional comments on that, they are most welcome.

Regards,

dtheodor

ED1980 · September 20, 2010, 7:12pm

In GTX275-80TMU, and GT460-56TMU …
Fewer texture processors is often a weak point on the architecture of video games in the FERMI

ED1980 · September 20, 2010, 7:12pm

In GTX275-80TMU, and GT460-56TMU …
Fewer texture processors is often a weak point on the architecture of video games in the FERMI

dtheodor · September 20, 2010, 8:20pm

Hello ED1980,

So this means that the lower score for the GTX460 could be only because of the less TMUs or also because of something else, like lower memory bandwidth? The reason I ask is because, as mentioned above, I find strange the fact that my CUDA application performs better when mapped onto a GTX275, instead of a GTX460. So I thought to run the Final Fantasy Benchmark to test how a real program behaves when using these GPUs. And it seems that again, like in my program, the benchmark behaves the same way, i.e. the GTX275 is faster than the GTX460.

Regards,

dtheodor

dtheodor · September 20, 2010, 8:20pm

Hello ED1980,

So this means that the lower score for the GTX460 could be only because of the less TMUs or also because of something else, like lower memory bandwidth? The reason I ask is because, as mentioned above, I find strange the fact that my CUDA application performs better when mapped onto a GTX275, instead of a GTX460. So I thought to run the Final Fantasy Benchmark to test how a real program behaves when using these GPUs. And it seems that again, like in my program, the benchmark behaves the same way, i.e. the GTX275 is faster than the GTX460.

Regards,

dtheodor

ED1980 · September 21, 2010, 8:10pm

Hello dtheodor

Final Fantasy Benchmark as gaming applications, could greatly depend on the number of TMUs especially in high resolutions, but CUDA-program more on the settings of your programs under architecture GTX275 (compute capability 1.x), for best performance needed to reflect the architecture FERMI (compute capability 2.x) + dependency programs from the memory bandwidth GTX460 which is smaller than the GTX275 …

ED1980 · September 21, 2010, 8:10pm

Hello dtheodor

Final Fantasy Benchmark as gaming applications, could greatly depend on the number of TMUs especially in high resolutions, but CUDA-program more on the settings of your programs under architecture GTX275 (compute capability 1.x), for best performance needed to reflect the architecture FERMI (compute capability 2.x) + dependency programs from the memory bandwidth GTX460 which is smaller than the GTX275 …

dtheodor · September 22, 2010, 3:12pm

Hello ED1980,

First of all I would to thank you and the rest that take part on this topic.

What I would expect (as a rather beginner CUDA user :) ) is the GPU with more CUDA cores to perform better than the one with fewer cores. I have read the Fermi tuning and Compatibility guides, to see if there are any significant code changes required. In order to adjust my program to the Fermi architecture, I did the following:

added to the Visual studio the following option to the command line, in order to invoke the 2.x capability during program compilation:

-gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_20,code=compute_20

performed tests by configuring the on-chip memory either “cache preferred” or “shared memory preferred”, suing the cudaFuncSetCacheConfig function.

I don’t use any double precision calculations, and I hope I use the correct options during the program compilation, as I showed above. Is there any other program changes that are required to adjust to the Fermi architecture?? :huh:

Regards,

dtheodor

dtheodor · September 22, 2010, 3:12pm

Hello ED1980,

First of all I would to thank you and the rest that take part on this topic.

What I would expect (as a rather beginner CUDA user :) ) is the GPU with more CUDA cores to perform better than the one with fewer cores. I have read the Fermi tuning and Compatibility guides, to see if there are any significant code changes required. In order to adjust my program to the Fermi architecture, I did the following:

added to the Visual studio the following option to the command line, in order to invoke the 2.x capability during program compilation:

-gencode=arch=compute_20,code=sm_20 -gencode=arch=compute_20,code=compute_20

performed tests by configuring the on-chip memory either “cache preferred” or “shared memory preferred”, suing the cudaFuncSetCacheConfig function.

I don’t use any double precision calculations, and I hope I use the correct options during the program compilation, as I showed above. Is there any other program changes that are required to adjust to the Fermi architecture?? :huh:

Regards,

dtheodor

ED1980 · September 22, 2010, 9:18pm

Hello dtheodor

your problem is, I quote:
results based on Visual Profiler 3.1.1:
Block size: 64 x 1 x 1
Occupancy limiting factor = Block-Size

Are you using the CUDA Toolkit 3.2? Only CUDA Toolkit 3.2 is officially full support GTX460 …

ED1980 · September 22, 2010, 9:18pm

Hello dtheodor

your problem is, I quote:
results based on Visual Profiler 3.1.1:
Block size: 64 x 1 x 1
Occupancy limiting factor = Block-Size

Are you using the CUDA Toolkit 3.2? Only CUDA Toolkit 3.2 is officially full support GTX460 …

Topic		Replies	Views
GTX 460 - how man angels on the head of a pin how many cores per MP for a GTX 460 - 32 or 48 CUDA Programming and Performance	15	15799	July 18, 2010
GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?) CUDA Programming and Performance	8	10994	August 15, 2010
GTX 460 CUDA Programming and Performance	58	60563	August 5, 2010
gtx 465 performance CUDA Programming and Performance	34	4578	August 18, 2010
GF100 vs GF104 Performance question CUDA Programming and Performance	18	9111	September 4, 2010
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	20153	July 5, 2011
Cores in Tesla c2050 card shows 112 cores instead of 448 CUDA Programming and Performance	6	11325	September 4, 2010
GTX 460: number of cores per multiprocessor? CUDA Programming and Performance	6	10800	July 12, 2010
best CUDA-enabled card for $100 (or so) CUDA Programming and Performance	17	3302	March 20, 2011
Tesla C2050/GTX 470 limits? CUDA Programming and Performance	19	18891	June 6, 2010

GTX460 number of multiprocessors

Related topics