The FFT numbers are quite impressive. Is this implementation significantly faster than CUFFT 3.0? For comparison, when I run a FFT using the cuda plug-in for Matlab’s fft2, I only get about a 3x speedup for pretty large problem sizes (eg 4096 by 4096) vs the built-in Matlab function (which runs on only a single core). This is on a gtx 285, which one would expect to perform roughly similar to the c1060, but the numbers in your charts are much, much better. In particular, you are getting 1-2 orders of magnitude speedup here, though of course there’s memory transfer time, etc.

Do you think the 2D implementation will show similar numbers? Any idea when 3.1 will be available? ;). cheers

This is very interesting. I do have a few questions about your material:

On slide 2, you show the single precision Gflops of a C1060 at 933 and the single precision Gflops of a C2050 at 1030. I thought the Fermi would be about double the single precision performance of the previous core. Are these numbers correct?

On slides 3 and 8 you show the speed up of NVIDIA parts vs. Intel parts, but it isn’t clear if the Intel runs were done with 1 thread or 4 threads (4 cores) or 8 threads (standard dual core Intel Xeon blade). Which are they?

On slide 4, there is a significant discontinuity in the single precision FFT results for the C2050 between a transform size of 4096 to 8192. Do you know why? Is this the size where the data no longer fits in the SP’s caches?

On slide 8, the C2050 results are much worse for some of the apps (I’m interested in the Circuit run for my work) than for others. What is it about those data sets that hurt the C2050? As I asked above, it isn’t clear if you are running the Intel jobs multi-threaded or not. If you are not, then there might not be any advantage at all on sparse matrix multiplication for circuit work.

Also on slide 8, it looks like there is not really an appreciable performance advantage on double precision between the C2050 and the C1060 even though there is a theoretical difference of 6.6X. Why is this?

On slide 9, there is no comparison to the Intel part that is on the rest of your slides. Why not?

Regarding your questions about sparse matrix-vector multiplication results:

Yes, the SpMV results represent “out of the box” performance of Cusp [1]. We haven’t introduced any Fermi-specific optimizations yet, but there are certainly opportunities to do so.

You are right that SpMV performance does not improve as dramatically as other benchmarks (e.g. DGEMM) when moving from the C1060 to the C2050. The reason for this is twofold:

SpMV is memory bandwidth limited and the memory bandwidth of the C2050 is only ~44% greater than that of C1060 (less with ECC enabled).

We haven’t performed any Fermi-specific tuning yet.

Adding to what nbell already said about SpMV, the CUFFT and CUBLAS codes were also run “out-of-the-box”, as were the Intel MKL codes.

In CUBLAS 3.1 for Fermi, we implemented a “double-blocking” algorithm for SGEMM and DGEMM, where the block size is 5x5 for SGEMM and 3x3 for DGEMM. These specialized high-performance kernels can only be used when the ‘m’ and ‘n’ dimensions of the input matrices are multiples of 80 for SGEMM and multiples of 48 for DGEMM, and the ‘k’ dimension is a multiple of 16 for both. Note: this only applies to the Fermi architecture and not to the Tesla architecture.

4096 is the largest single-precision transform size for which the data can be fit completely into shared memory.

The Circuit matrix performs relatively poorly because it’s so small: it only has 170K rows and 958K nonzero values (see our paper [1]). Small matrices are processed so quickly that the time spent launching an individual kernel (about 10 microseconds) becomes significant. For example, the C2050 performs a single precision SpMV operation with the Circuit matrix in 0.293 milliseconds (239 microseconds), which works out to a rate of 3400 SpMV operations per second.

The reason SpMV doesn’t get a 6.6x performance on Fermi is because its performance is limited by memory bandwidth, not floating point throughput. The C2050 has approx ~44% more memory bandwidth than the C1060, so the expected performance increase 1.44x, not 6.6x. Other codes like DGEMM perform many more flops per byte read from memory, and are therefore limited by the floating point throughput. Such codes will benefit more from Fermi’s double precision speed than memory-bound applications.

Anyway, I would suggest trying the Cusp sparse matrix library [2] on your matrices to see exactly what the performance will be. The MKL CPU results were all multithreaded (4 threads), so the performance comparison is accurate.

The main thing would be to take better advantage of the additional registers and shared memory that Fermi provides. For example, our CSR (vector) and COO kernels use shared memory sparingly (one value per thread), since it is scarce on the G80 and GT200 series processors. By using more than one value per thread we could eliminate some overhead in these kernels and hopefully make them faster.

Also, we haven’t thoroughly studied the impact of Fermi’s L1 cache and what partitioning of smem/L1 is best for SpMV (16/48 KB vs. 48/16 KB).

I haven’t done much coding as of yet for Fermi but from my understanding you would have up to 49152 bytes of shared memory == 4096*3 floats? Or is the actual transform size needed bigger than that?

I’d have to verify it with the CUFFT developers, but I believe your calculation is correct. Fermi has up to 48K of shared memory (depending on runtime configuration), and 48K / 4 bytes per float is 12288 floats.

The note I posted about 4096 being the max that would fit entirely in smem came directly from the CUFFT team, so yes, that would imply that you need 3x the transform size available for the working set (at least for this particular type of transform), which comes out to exactly 3*4096 == 12288.

Yes this seems to be valid. If one looks at the FFT benchmark for the C1060 ( can have max 4096 floats) we there also see a dip between 1024 and 2048 and this would make sense since 1024 < 4096/3 < 2048. Strangely the dip isn’t quite as steep though…