Mixed 32/64 compilation on 0.9

My post on 64 bit was a bit tongue in cheek, but I really do have a problem with building CUDA apps on 64 bit, thought it was my makefiles but the standard ones do it too. The problem is that all device code is compiled 64 bit by default meaning all devmem pointers are 8 bytes - even those in shared memory that are not accessible from the host! Even if ptxas partially converts back (always assumes top 32bits == 0 to save registers) there is a significant waste of shared memory (cubin definitely has both shared mem and device mem pointer arrays 8 bytes per entry) and also device memory and bandwidth, given it makes no sense to hold host pointers on the device. High price to pay for 64 bit longs implemented as 32 bit long long routines. Is the only way out to turn on -m32 for everything? Then 32 bit versions of all the CUDA libs should be provided as well, along with a switch facility. The 64 bit video driver should not have a problem with a 32bit CUDA app.

OR is there a way to build and sensibly communicate within a mixed environment?

Thanks,
Eric

Guess this was a major design decision. Was there some info in the tools release notes (that were left out of the distro)? There is nothing about 64 bit at all in the 0.9 nvcc manual either.

EDIT: Removal of previous post. I was confused and didn’t realize that the problem was only with pointers.

There is of course, a simple solution. Don’t store large arrays of pointers on the device! Instead, store array indices where you can control how many bits are stored to your liking.

Pointer are required to address dev mem and they are in smem for performance reasons only. Not just the smem usage but also every reference to a shared mem pointer array by tid now gets a 2x bank conflict wheras it was conflict free in 32bit. I just wish I was told in advance instead of having to waste time working it out. There is a tradeoff between host CUDA app performance and device performance on 64 bit that is only minor, but might affect some people. There really is no choice for a mixed env as host and device are closely linked and pointers on the device would need to be ints on the host meaning structures were not identical. Does not fit with the programming model well. This step is an inconvenience till GPUs are 64 bit internally - G92? or later?

Perhaps some method needs to be considered for handling disparate architectures on host and device - pointer sizes and endian sex - then again these MP cores will turn up next to single threaded cores on the same die in general purpose CPUs in the not too distant future and then they will always match. A whole new topic.

At this stage is it probably wise to set up a dual build environment as you will find some things break in 32 bit and others break in 64 bit. Now that ptxas is so much more complicated there is a much bigger risk of bugs (they are always there in software) and we can’t see what code is going off to the device and not having a debugger on the device means, hey it is getting much more risky to use commercially, and debugging tools problems will get more difficult in the future.

Thanks, Eric

The technology is great, just the lack of information and the willingness to supply is a downer.

Well 1.0 looks like a big improvement - compiler looks useful enough to promote my 8800 hi tech paperweight to active service. All went reasonably smoothly on FC7 x86_64. Installed 32 and 64 bit toolkits. The nvcc manual says: “-m Specify 32 vs. 64 bit architecture. Currently only to be used when compiling with –cubin on linux64 platform.” however, it appears to be much more useful as one can build complete 32bit binaries by feeding everything -m32. The emulator runs about 10% slower in 32 bit mode, as expected, and the important benchmarks like the bandwidth test give exactly the same numbers running 32 bit. So far I have not found anything that does not work 32 bit. A refreshing change - but am I about to get bitten?

Can we please have a confirmation that 32 bit binaries are supported on the x86_64 platform (certainly should be)?

There is still nothing about 32 vs 64 bit in the manuals in 1.0 so can we please have a specification of the exact differences to be expected on the device, such as pointers always take 8 bytes and are aligned 8 bytes everywhere on the device (except in registers)?

This seems like the only difference that should be evident at run time and ptxas should fix everything else… I wonder what is supposed to happen if one left shifts a device pointer?

My tests on 0.9 gave same register use 32/64, but then there were too many bugs to test far - now I have a small kernel that uses 1 less register in 64 bit mode? and another larger kernel that spills 1 register with maxrregcount=32 in 64 bit mode and spills NONE with maxrregcount=16 in 32 bit mode! I suspect this is a bug but nothing is documented to say so.

Eric

Why is Nvidia not responding here?

The 64 bit spec for the G80 must be a pretty simple document and not IP sensitive, surely?

I gave away using 64 bit after less than the first day after spurious system lockups that did not happen in 32 bit mode (ptxas codegen problems). Definitely ptxas scrambles something in device memory access in 64 bit mode so that 32 bit fully coalesced reads only have 60% of performance of 32 bit mode. This shows up in all my benchmarks.

Given the number of people that were waiting on the 64 bit release here, I don’t understand why there are not more questions…

Eric

PS waiting on this spec before posting any bugs.

ed: forgot to mention that in 64 bit mode cudaThreadSynchronize() does not report kernel launch failures.

I don’t know what your problem is, I’m using the 64-bit linux version with no problems whatsoever.

Really? It shows up in none of mine. Every benchmark code I have run, and every test run of my production code shows nearly identical performance to the exact same code compiled under windows XP.

I have noticed that, occasionally, the cubin compiled on XP will show a few more or a few less registers used compared to the one compiled on 64-bit linux. And yes, I see both. One kernel that compiles to 27 regs used in linux has only 25 in XP. Another uses 16 in linux and 17 in XP.

In debug mode, I call cudaThreadSynchronize() and then call cudaGetLastError() and have not noticed any missed launch failures.

Well I guess you are lucky - I have a 5 LOC kernel that compiles to 8 regs on 32 bit and 16 regs on 64 bit and one can’t squash it down much before it spills. The problem is in global memory address calculation. I worked out how to change the code to something that is logically identical and it now uses 7 regs in 64 bit and runs at full speed. So 60% penalty was caused by whatever this problem is and reduced occupancy but still notice minor degradation in performance in 64 bit in CPU bound kernels that do some global access.

Still waiting for a spec before reporting.

I said cudaThreadSynchronise() does not report - it should return an error if the kernel had a launch failure. I check its return before calling cudaGetLastError() and this does not work 64 bit but is fine 32 bit.

Can also send ptxas into an infinite loop easily…

Eric

Well we still don’t have a spec for what 64bit should look like on the device…

What Nvidia should have said in response to the above is: use the device API - compile the host and device code separately. Here device pointers (CUdeviceptr) are an int on the host so mixed environments work fine. All one loses is type checking of kernel calling parameters and if one always passes a struct then everything is tied together. One needs to have two versions of every struct switched on CUDACC in headers with CUdeviceptr for all device pointers on the host side and the required char* or whatever for the device side. Quite simple to use the Intel compiler for the host side, and also saves crud in the text and bss segments that you would get if you used nvcc for the host side. 64bit on the device is pretty brain damaged and irrelevant.

Eric