These are my results of running cublas DGEMM on 4 GPUs using 2 streams for each GPU (Tesla M2050):
Order | HtoD-ker-DtoH (in s) | Gflops
2048 | 0.029263 | 587.085047
4096 | 0.113735 | 1208.413911
6144 | 0.258755 | 1792.647372
8192 | 0.351895 | 3124.544576
I have tested my results and they are alright; I am concerned about the high Gflops value that I am getting, compared with the versions that uses the default stream. I am calculating the Gflops using the formula:
Gflops = {2.0*10^-9*(N^3+N^2)}/elapsed_time_in_s
For the version that uses multiple streams, do I need to modify this formula in any way?
Check your time measurements… Are you doing a cudaThreadSynchronize() before measuring them…
OR possibly the memcopy and kernel launches are happening simultaneously and hence you are getting very high gflops.