Hi everyone!
Could someone tell me why am I getting a huge performance hit when I compile program to 64 bit executable. I took Matrix multiplication kernel from examples (CUDA C programming guide) with minor modifications and made several measurements (kernels were executed on GTX470 and GTX260 respectively):
x64 bit
device: 0
setting device to number: 0
creating events
creating device arrays
allocating memory on device
copying data to device: 0
copy time 16.0369
kernel executed, calculating time
copying results
copy results time: 7.43722
done
done, total time: 780ms and kernel time 693.008
device: 1
setting device to number: 1
creating events
creating device arrays
allocating memory on device
copying data to device: 1
copy time 16.0518
kernel executed, calculating time
copying results
copy results time: 8.1408
done
done, total time: 2824ms and kernel time 2726.26
x86
device: 0
setting device to number: 0
creating events
creating device arrays
allocating memory on device
copying data to device: 0
copy time 16.345
kernel executed, calculating time
copying results
copy results time: 7.29514
done
done, total time: 499ms and kernel time 340.643
device: 1
setting device to number: 1
creating events
creating device arrays
allocating memory on device
copying data to device: 1
copy time 16.3376
kernel executed, calculating time
copying results
copy results time: 7.96096
done
done, total time: 905ms and kernel time 815.011
Matrix sizes is NxN where N=2048, float single precision.