Hi,
I use CUDA from 1 month now. My code is then 100 times faster using CUDA that the same code in C programming. That’s a good point!
I’d like to estimate the maximum speed-up I can achieve with CUDA.
I’ve read some things about bandwidth and I’ve read that I can speed-up my code up to the theorical maximum graphic card bandwidth. So, my questions are :
Is this correct (all what I said about bandwidth, etc.)?
How to estimate the bandwidth?
Thanks for you help.
Vincent
I have seen some slides about this. What was done was basically this:
calculate how many GByte/s you are processing (read & write together)
calculate how many GFLOPS you are processing
Compare both to the possible peak to see if you have achieved maximum performance for your algorithm. You need to do both, because your kernel can be either bandwith-bound or processing-bound.
Well how many data do you read and write in your kernel (GB) and divide it by the time it takes your kernel to run (s) → GB/s
Count how many FLOP you have in your kernel (FPLOP) and divide it by the time it takes your kernel to run (s) → FLOPS
There is something I don’t understand.
The time spent by a kernel is composed by reads/writes in memory and by some computations.
So, it’s not possible to measure only the time spent by the memory acces and the time spent for FLOPS?
A kernel does not exclusively perform one or the other. Multiple warps are kept in flight on each multiprocessor. While some are reading memory, others are performing computations.
So, if you are limited by memory, you will have an effective bandwidth of 70 GiB/s and only, say 10 GFLOP/s. If you are instead limited by computations and not memory, then you will have ~150 GFLOP/s (or more if your code uses lots of MADDS) and only, say, 1 GiB/s.
In the first memory limited case, you get all of the computations “for free” because of the interleaving of execution and memory operations.
But this will probably make a lot of difference. You were timing the first execution of the kernel which includes some one-time overhead, so you should always time subsequent calls. And also averaging a few kernel calls is also smart to do.