64 vs 32 bit Why 64 bit code is significantly slower than 32 bit code?

I have a computer with AMD X6 1090T, 8GB DDR3 and a Palit GTX 460 1GB. The OS is Windows 7 64 bit.

Running my CUDA application requires 81s if I build 64 bit code and 62s for 32 bit code (I built the code on another machine with XP 32 bit and I copied only the executable and a required dll). I’m using Visual Studio 2008.

Is this normal? My understanding is that in 64bit only pointers are longer, the rest of variables have the same size as in 32 bit mode. I do not have a lot of pointers in my CUDA code, so I cannot explain what is happening.

I have a computer with AMD X6 1090T, 8GB DDR3 and a Palit GTX 460 1GB. The OS is Windows 7 64 bit.

Running my CUDA application requires 81s if I build 64 bit code and 62s for 32 bit code (I built the code on another machine with XP 32 bit and I copied only the executable and a required dll). I’m using Visual Studio 2008.

Is this normal? My understanding is that in 64bit only pointers are longer, the rest of variables have the same size as in 32 bit mode. I do not have a lot of pointers in my CUDA code, so I cannot explain what is happening.

Can you look at the number of registers used in both cases? One possibility is that a small register increase pushed you past a resource threshold and the number of concurrent blocks per multiprocessor has dropped by 1, reducing your overall occupancy and increasing the contribution of memory latency to your overall runtime.

I’m not sure how to best get this information with the Visual Studio toolchain. On Linux, if you pass --ptxas-options=-v to nvcc, then you will see the number of registers per thread reported on the command line. Presumably something similar is possible in Windows.

Can you look at the number of registers used in both cases? One possibility is that a small register increase pushed you past a resource threshold and the number of concurrent blocks per multiprocessor has dropped by 1, reducing your overall occupancy and increasing the contribution of memory latency to your overall runtime.

I’m not sure how to best get this information with the Visual Studio toolchain. On Linux, if you pass --ptxas-options=-v to nvcc, then you will see the number of registers per thread reported on the command line. Presumably something similar is possible in Windows.

There is another factor here, potentially, that I have seen people complain about. Windows 7 (and Vista) have more overhead in their video driver model, which leads to some CUDA operations taking longer compared to Windows XP. Can you measure the time for your CUDA kernel calls (preferably many calls in a row). I’m curious if some of your initialization code takes a very long time on Win7 for some reason.

There is another factor here, potentially, that I have seen people complain about. Windows 7 (and Vista) have more overhead in their video driver model, which leads to some CUDA operations taking longer compared to Windows XP. Can you measure the time for your CUDA kernel calls (preferably many calls in a row). I’m curious if some of your initialization code takes a very long time on Win7 for some reason.

I will try to extract this info tomorrow, but the application (neural network) is complex, having more than a dozen kernels. I observed this problem more than six months ago, when we bought a new 64 bit machine. The same problem happens on four different cards: GTX 280, GTX 285, GTX 480 (but they were on different machines, so it wasn’t conclusive) and now GTX 460 (first time when I tested 32&64 bit code on the same 64 bit machine).

Why would the number of register increase on 64 bit code? Compiler peculiarity in 64bit mode or it is normal?

I will try to extract this info tomorrow, but the application (neural network) is complex, having more than a dozen kernels. I observed this problem more than six months ago, when we bought a new 64 bit machine. The same problem happens on four different cards: GTX 280, GTX 285, GTX 480 (but they were on different machines, so it wasn’t conclusive) and now GTX 460 (first time when I tested 32&64 bit code on the same 64 bit machine).

Why would the number of register increase on 64 bit code? Compiler peculiarity in 64bit mode or it is normal?

Considering that both 32 & 64 bit code run on the same Win 7 64bit machine, I think the issue is not Win 7 driver overhead. To be able to do some testing, I would need XP 32 and Win 7 64bit installed on the same machine. I cannot install another OS on any of the machines. Is there any point in measuring this on different machines (I can do this)?

Considering that both 32 & 64 bit code run on the same Win 7 64bit machine, I think the issue is not Win 7 driver overhead. To be able to do some testing, I would need XP 32 and Win 7 64bit installed on the same machine. I cannot install another OS on any of the machines. Is there any point in measuring this on different machines (I can do this)?

Known issue. CUDA programs are faster on xp and slower on Win7. At least they were a few months ago.

Known issue. CUDA programs are faster on xp and slower on Win7. At least they were a few months ago.

I’m NOT saying that XP is faster than Win 7! What I’m saying is that on the same machine (Win 7 64bit) code compiled for 32 bit architecture is faster than when it is compiled for 64 bit.

I’m NOT saying that XP is faster than Win 7! What I’m saying is that on the same machine (Win 7 64bit) code compiled for 32 bit architecture is faster than when it is compiled for 64 bit.

Since you’ve ruled out the XP vs. Win7 operating system issue, we’re back to the register usage hypothesis.

If you access anything in global memory (and possibly shared? not sure how pointers to shared memory work for stuff compiled compute capability 1.x), then you will have some pointers floating around in your code. These pointers will require extra registers. One extra register isn’t normally a serious issue, but could be depending on your previous register requirements.

To give an example: Suppose your kernel required 16 registers, the card has 8192 registers per multiprocessor, and you run with a block size of 256 threads. You can run two blocks at a time per multiprocessor, which might be important for hiding latency depending on the kernel. If the kernel in 64-bit mode requires 17 registers per thread, now you can only run 1 block and your occupancy is cut in half. This could have noticeable effects.

However, you mention that you have tried both GTX 200 and 400-series GPUs, which might counter this hypothesis. Those cards have different register limits and different latencies, so I’m a little surprised that switching 32 to 64-bit has the same effect on all of them.

Yet another hypothesis is that you’ve found a code generation bug in the compiler when producing 64-bit code. Testing this would require comparing the PTX between a 32 and 64 bit compile and seeing if anything jumps out at you.

Since you’ve ruled out the XP vs. Win7 operating system issue, we’re back to the register usage hypothesis.

If you access anything in global memory (and possibly shared? not sure how pointers to shared memory work for stuff compiled compute capability 1.x), then you will have some pointers floating around in your code. These pointers will require extra registers. One extra register isn’t normally a serious issue, but could be depending on your previous register requirements.

To give an example: Suppose your kernel required 16 registers, the card has 8192 registers per multiprocessor, and you run with a block size of 256 threads. You can run two blocks at a time per multiprocessor, which might be important for hiding latency depending on the kernel. If the kernel in 64-bit mode requires 17 registers per thread, now you can only run 1 block and your occupancy is cut in half. This could have noticeable effects.

However, you mention that you have tried both GTX 200 and 400-series GPUs, which might counter this hypothesis. Those cards have different register limits and different latencies, so I’m a little surprised that switching 32 to 64-bit has the same effect on all of them.

Yet another hypothesis is that you’ve found a code generation bug in the compiler when producing 64-bit code. Testing this would require comparing the PTX between a 32 and 64 bit compile and seeing if anything jumps out at you.

"I will try to extract this info tomorrow, but the application (neural network) is complex, having more than a dozen kernels. I observed this problem more than six months ago, when we bought a new 64 bit machine. The same problem happens on four different cards: GTX 280, GTX 285, GTX 480 (but they were on different machines, so it wasn’t conclusive) and now GTX 460 (first time when I tested 32&64 bit code on the same 64 bit machine).
"

Sorry. I for unknwon reason thought when you bought 64 bit machine you had migrated from xp to win7. Also I thought that you run 32 bit version on xp and compare performance. I think you need to try some tuning i.e. register number per block etc on release machine.

"I will try to extract this info tomorrow, but the application (neural network) is complex, having more than a dozen kernels. I observed this problem more than six months ago, when we bought a new 64 bit machine. The same problem happens on four different cards: GTX 280, GTX 285, GTX 480 (but they were on different machines, so it wasn’t conclusive) and now GTX 460 (first time when I tested 32&64 bit code on the same 64 bit machine).
"

Sorry. I for unknwon reason thought when you bought 64 bit machine you had migrated from xp to win7. Also I thought that you run 32 bit version on xp and compare performance. I think you need to try some tuning i.e. register number per block etc on release machine.

Did you look at the resulting PTX-code? You might find the difference there.

In an example piece of code that is compiled on a 64-bit machine, the pointers grow from 32 to 64-bit. That is not a problem an sich, however, to create this 64-bit pointer could involve the conversion from 32 to 64-bit. I am not sure if this additional conversion instruction still exists after compiling with ptxas to the GPU binary, however, it exists in the PTX-code.

Did you look at the resulting PTX-code? You might find the difference there.

In an example piece of code that is compiled on a 64-bit machine, the pointers grow from 32 to 64-bit. That is not a problem an sich, however, to create this 64-bit pointer could involve the conversion from 32 to 64-bit. I am not sure if this additional conversion instruction still exists after compiling with ptxas to the GPU binary, however, it exists in the PTX-code.