CUBLAS dgemm performance query

Hi,

These are my results of running cublas DGEMM on 4 GPUs using 2 streams for each GPU (Tesla M2050):

Order | HtoD-ker-DtoH (in s) | Gflops


2048 | 0.029263 | 587.085047


4096 | 0.113735 | 1208.413911


6144 | 0.258755 | 1792.647372


8192 | 0.351895 | 3124.544576


I have tested my results and they are alright; I am concerned about the high Gflops value that I am getting, compared with the versions that uses the default stream. I am calculating the Gflops using the formula:

Gflops = {2.0*10^-9*(N^3+N^2)}/elapsed_time_in_s

For the version that uses multiple streams, do I need to modify this formula in any way?

Thanks,

Sayan

Check your time measurements… Are you doing a cudaThreadSynchronize() before measuring them…
OR possibly the memcopy and kernel launches are happening simultaneously and hence you are getting very high gflops.

You are very correct. I added a cudaStreamSynchronize before measuring time, and I get reasonable results, as follows:

Order HtoD-ker-DtoH (in s) Gflops

2048 0.051682 332.414951

4096 0.261301 525.979417

6144 0.7148 648.931824

8192 1.384706 794.039754

Crosspost to stackoverflow : cuda - CUBLAS dgemm performance query - Stack Overflow

Thats good to know!

Thats good to know!