I’m trying to put together a summary of what is contained within a multiprocessor with Compute Capability 1.3.

Within the CUDA FAQ, under point 10. it states, The Tesla C1060 consists of 30 multiprocessors, each of which is comprised of 8 scalar processor cores, for a total of 240 processors. There is 16KB of shared memory per multiprocessor. Each processor has a floating point unit which is capable of performing a scalar multiplyadd and a multiply, and a “superfunc” operation (such as rsqrt or sin/cos) per clock cycle. My interpretation of this is that there are 8 “superfunc” units per multiprocessor, or one for each of the 8 scalar processors. However, the Programming Guide states, in section 4.1, A multiprocessor consists of eight Scalar Processor (SP) cores, two special function units for transcendentals, a multithreaded instruction unit, and onchip shared memory. This, and the throughputs given in section 5.1.1.1, suggest there are 2, instead of 8, “superfunc” units per multiprocessor. Which one is correct, or, if they are both correct, what are the differences between the two?

The programming guide states in section 5.1.1.1, sinf(x), cosf(x), tanf(x), sincosf(x) and corresponding doubleprecision instructions are much more expensive and even more so if the absolute value of x needs to be reduced. Is there a “superfunc” unit for double precision transcendental operations? If not, how are they handled, and what is their throughput per multiprocessor per clock cycle?

The Overview section in the thread about the CUDA Occupancy Calculator ( http://forums.nvidia.com/index.php?showtopic=31279 ) says Each multiprocessor on the device has a set of N registers available for use by CUDA thread programs. These registers are a shared resource that are allocated among the thread blocks executing on a multiprocessor. However, Figure 4.2 in the CUDA Programming guide depicts one set of registers attached to each of M processors. Are the registers contiguous (in terms of silicon), or are they separately connected to each processor, as depicted in 4.2? If they are separately connected, can Processor 1 use Processor 2’s registers?
Thank you.