Questions about differences between CUDA FAQ 2.1 and Programming Guide 2.3

Phoenix97 · July 29, 2009, 3:38am

I’m trying to put together a summary of what is contained within a multiprocessor with Compute Capability 1.3.

Within the CUDA FAQ, under point 10. it states, The Tesla C1060 consists of 30 multiprocessors, each of which is comprised of 8 scalar processor cores, for a total of 240 processors. There is 16KB of shared memory per multiprocessor. Each processor has a floating point unit which is capable of performing a scalar multiply-add and a multiply, and a “superfunc” operation (such as rsqrt or sin/cos) per clock cycle. My interpretation of this is that there are 8 “superfunc” units per multiprocessor, or one for each of the 8 scalar processors. However, the Programming Guide states, in section 4.1, A multiprocessor consists of eight Scalar Processor (SP) cores, two special function units for transcendentals, a multithreaded instruction unit, and on-chip shared memory. This, and the throughputs given in section 5.1.1.1, suggest there are 2, instead of 8, “superfunc” units per multiprocessor. Which one is correct, or, if they are both correct, what are the differences between the two?
The programming guide states in section 5.1.1.1, sinf(x), cosf(x), tanf(x), sincosf(x) and corresponding double-precision instructions are much more expensive and even more so if the absolute value of x needs to be reduced. Is there a “superfunc” unit for double precision transcendental operations? If not, how are they handled, and what is their throughput per multiprocessor per clock cycle?
The Overview section in the thread about the CUDA Occupancy Calculator ( [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA ) says Each multiprocessor on the device has a set of N registers available for use by CUDA thread programs. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. However, Figure 4.2 in the CUDA Programming guide depicts one set of registers attached to each of M processors. Are the registers contiguous (in terms of silicon), or are they separately connected to each processor, as depicted in 4.2? If they are separately connected, can Processor 1 use Processor 2’s registers?

Thank you.

Simon_Green · July 30, 2009, 11:32am

Good questions, answers below:

Within the CUDA FAQ, under point 10. it states, The Tesla C1060 consists of 30 multiprocessors, each of which is comprised of 8 scalar processor cores, for a total of 240 processors. There is 16KB of shared memory per multiprocessor. Each processor has a floating point unit which is capable of performing a scalar multiply-add and a multiply, and a “superfunc” operation (such as rsqrt or sin/cos) per clock cycle. My interpretation of this is that there are 8 “superfunc” units per multiprocessor, or one for each of the 8 scalar processors. However, the Programming Guide states, in section 4.1, A multiprocessor consists of eight Scalar Processor (SP) cores, two special function units for transcendentals, a multithreaded instruction unit, and on-chip shared memory. This, and the throughputs given in section 5.1.1.1, suggest there are 2, instead of 8, “superfunc” units per multiprocessor. Which one is correct, or, if they are both correct, what are the differences between the two?

The programming guide is correct, there are two special function units per multiprocessor. I will fix the FAQ.

The double precision unit does just mul/fma/add, so most of the special functions are implemented in software. I don’t have any performance numbers, but it wouldn’t be hard to test.

That diagram is a bit misleading, there is one register file per MP, shared between the processors.

Phoenix97 · July 31, 2009, 4:51am

Thank you, I will update my research accordingly. :)