Recently while learning cuda, I am using a Tesla P100 graphics card. Why use the matrix multiplication sample program in nvidia cuda sample (“cuda-samples-12.2\Samples\0_Introduction\matrixMul”)to test floating-point performance, test speed single precision 1657.76 GFlop/s double precision double 1078.98 GFlop/s. Nowhere near the theoretical performance (only about 1/5).
What is the cause of it, what methods can make graphics card test floating-point performance further improve? Is it an optimized programming method? Or the theoretical floating-point performance of graphics cards can only be achieved by simple mathematical operations such as calculating a*b+c.
thank you.
yes, optimization is needed. Use cublas. There are numerous questions on these forums about it.
Thank you for your answer. Again, ask if using the cublas library (for example, cuda-samples-12.2\Samples\4_CUDA_Libraries\matrixMulCUBLAS in this example) to compute matrix multiplication represents the best possible speed for graphics cards.
Are there other ways to speed things up even further?
thank you.
to do better than cublas?
not for a typical programmer or use-case
There are certainly examples of people who have done better. I know of no better general recommendations than using cublas for matrix-matrix multiply, and/or to achieve something close to published peak theoretical flops numbers.
Thanks again for your answer
With some effort you can optimize other calculations, but probably not to the theoretical peak.
1/5 is not untypical, even after optimization it could be 1/2 and still be good, more than 80% or 90% is very difficult.
There are many parameters, e.g. also the memory bandwidth, to consider.
BTW it is also difficult to fully use a CPU, even a single core: You would need to use hand-crafted assembly AVX or SSE vector instructions to max. out the computation speed.