CUDA 4 RC 64 bit performance

Hi everyone!

Could someone tell me why am I getting a huge performance hit when I compile program to 64 bit executable. I took Matrix multiplication kernel from examples (CUDA C programming guide) with minor modifications and made several measurements (kernels were executed on GTX470 and GTX260 respectively):

x64 bit

device: 0

setting device to number: 0

creating events

creating device arrays

allocating memory on device

copying data to device: 0

copy time 16.0369

kernel executed, calculating time

copying results

copy results time: 7.43722

done

done, total time: 780ms and kernel time 693.008

device: 1

setting device to number: 1

creating events

creating device arrays

allocating memory on device

copying data to device: 1

copy time 16.0518

kernel executed, calculating time

copying results

copy results time: 8.1408

done

done, total time: 2824ms and kernel time 2726.26

x86

device: 0

setting device to number: 0

creating events

creating device arrays

allocating memory on device

copying data to device: 0

copy time 16.345

kernel executed, calculating time

copying results

copy results time: 7.29514

done

done, total time: 499ms and kernel time 340.643

device: 1

setting device to number: 1

creating events

creating device arrays

allocating memory on device

copying data to device: 1

copy time 16.3376

kernel executed, calculating time

copying results

copy results time: 7.96096

done

done, total time: 905ms and kernel time 815.011

Matrix sizes is NxN where N=2048, float single precision.

I once noted the same thing with CUDA 3.1, although I don’t recall that it was as severe as your results.

When you look at the PTX code (compile with the ‘keep’ flag to save the intermediate PTX files), you can compare the address calculation parts between 32- and 64-bit. 64-bit requires some extra PTX instructions, which can explain some of the timing differences (but probably not for all of it).

I also noted that performance difference between float and double computation is very small (in my case). Seems that performance is memory bound is 64-bit mode.