32-bit and 64-bit compiled codes

I have a question about performance comparison of 32-bit and 64-bit compiled codes running on Tesla V100 32GB GPU.

I compiled V10.0 samples firstly on Win32 configuration. Apps run on Tesla V100 32GB Gpu. Then I compiled same samples on x64 configuration and run them on same GPU.

The performance results dissappointed me. Because the apps compiled with win32 configuration run faster than the apps compiled with x64 configuration. I attached the results of MatrixMul app to this mail.

The Development environment summary is:
Microsoft Windows 10 Pro(x64) Build 18362.175, MS Visual Studio 2012, Cuda Version 10.0.130, Cuda Driver Version 412.29
Hardware Summary :
Intel Xeon Silver 4114 CPU @ 2.20GHz, 32 GB RAM, Tesla V100 32GB GPU


32-bit app development isn’t supported any more in CUDA. (Yes it is still supported in a limited way on VS2012, but not any newer toolchain than that.) It may still work in some cases, but there are plenty of use-cases that won’t work, such as use of CUDA libraries such as CUBLAS

https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#x86-32-bit-support

That has often been the case historically when comparing 32-bit and 64-bit versions of the same app. An expectation of identical performance is unrealistic and has never been the case, generally. Usage of 64-bit pointers, which is mandatory for 64-bit apps, causes at least some reduction in performance, generally speaking, to pick one example of possible differences between 32-bit and 64-bit apps.

Any sort of serious CUDA development work currently will need to acknowledge that 64-bit app development is the only sensible path at this time. There is no use in looking back at 32-bit app developement, in spite of whatever benefits there may have been.

There were certainly some downsides to 32-bit app development. For example it would have been impossible to use more than 4GB of the 32GB memory on that V100 card.