32 or 64 bit native integer?

When performing bitwise operator on ints/longs. How can I regard a the hardware in that respect? Will it handle 64 bit at once?

You could find out by taking the sizeof(long)

I believe the emulator will get this right (the emulator is a great tool), so you could save the hassle of actually asking the device (with cudaMalloc(), cudaMemcpy() and stuff), and instead just write a tiny kernel:

__global__ void myKernel(void)



  printf("a long is %d bytes\n", sizeof(long));



and call it as usual. Just compile with “make emu=1”, and the emulator will calculate the answer. (Strictly, the “#ifdef DEVICE_EMULATION” and “#endif” are superfluous, but it’s robust, and a habit I like).

Thanks for your reply but it isn’t the answer I am looking for.

For example, you have 32-bit and 64-bit processors. The 64 bit processor read 64 bits at once, where the 32 bit processor needs two clock cycles.

So, to put my question in a different way? Does it make a different for the number of clockcycles to load an int (32 bit) or a long ( 64 ) bit?

Think of the G80 architecture as 32 bit. There are special opcodes for fetch/store to device memory in 64 and 128 bit chunks but they interface to multiple 32 bit registers.

  1. As my little experiment above shows, a long is 4 bytes, the same as an int, so you have an answer right there; it takes the same number of cycles to load an int as a long, but only because sizeof(long) == sizeof(int); both are 4 bytes, a long is not 8 bytes.

  2. I assume that this still isn’t the answer you’re looking for. I am guessing what you’re after, but the answer to questions of the general form ‘do 64 bits loads take the same number of cycles as 32 bit loads, or am I stuck loading 32 bits at a time?’, is (of course) ‘it depends’ (a real engineering answer :D) but NVIDIA claim a G80 can do 32 bit, 64 bit, or 128 bit memory access.

This NVIDIA paper explains how to improve over 32 bit memory access to get 64 or 128 bit memory transactions:


My interpretation of the paper is that this is not achieved through multiple 32 bit transactions, but is through one 64 or 128 bit memory transaction. This sounds plausible given the G80 has a 384 (or 320) bit wide memory interface (so is a G80 a 384 bit processor? No.).

I think a 32 bit vs 64 bit memory transaction question is somewhat orthogonal to whether or not a G80 is a 32 bit architecture. That becomes important when it comes to doing something with the bits. If it’s okay for your algorithm to deal with 2x32 bit items (or even 8x32 bits) concurrently rather than as a single 64 bit item, then you’re all set. In my limited experiments, a G80 can do 8x4 byte bit-wise operations concurrently on each multiprocessor.

If this doesn’t answer your question, would you like to say what you are trying to do? It may be easier to try to help if you put questions into context.

BarryB, I did not say multiple 32 bit transactions, just that 64 and 128 bit loads/stores interface to multiple registers. If you get your coalescing right then yes you will get 384 bit loads and stores in parallel. Has to be spread over more than 1 thread.

ed: by the register width definition you would have to credit the G80 with a 512 bit architecture as it can do 16 x 32 bitwise operations in 1 GPU clock cycle (per multiprocessor). It’s actually half that and the clock runs twice as fast at the bottom level.

final ed: to answer the original topic question I think you should look at the G80 as 1024 bit architecture (that is how many bits a single instruction controls). It has a skinny bus and various mechanisms to handle that. The programming model is 32 bit within a thread. The thread architecture is a lot more useful that a traditional 1024 bit machine EXCEPT there is no carry from one 32 bit ALU to another.

I need a 64 bit int and will perform bitwise operators on it.

First I need to find out if I can use a 64 bit at all?

Use 2 threads per 64 bit int - as long as you don’t want to add or subtract, multiply or divide or shift for that matter. Bitwise ops will all work.
but dynamic bit setting will not (as that involves shifting) so you probably need to redo your code for 32 bit.

Hmmm, but I need to shift and to compare. So that’s probably not gonna work?


Have we suddenly got new hardware or what?

When I compile on 0.9 Linux64 I get ptx code with 64 bit registers (rdN) and 64 bit arithmetic operations on device addresses and 64 bit longs with 64 bit arithmetic operations!

       add.s64         $rd7, $rd3, 127;        //

        shl.b64         $rd8, $rd7, 32;         //

        ld.param.u64    $rd9, [__cudaparm_retl];        //  id:36 __cudaparm_retl+0x0

        st.global.s64   [$rd9+0], $rd8; //  id:37

Is this just wound back to 32 bit by ptxas or do we really have full 64 bit integers?


ed: I presume all makefiles have to be updated to include “-m32” directly to nvcc, perhaps the default should have been left as 32 bit?

ed: No it does not generate linkable output otherwise, so back to the default 64 bit version so how does ptxas handle it? (also libGLEW.a is in the generic SDK and so is 32 bit and is incompatible with everything else, also doc/CUDA_Release_Notes_1.0.txt is missing from the Linux64 tools package - says that filename in the README - this release downgraded from 1.0 to 0.9?

ed: maybe not so well: ptxas: /home/lfriedman/cuda64-stuff/sw/gpgpu_rel1.0/compiler/…/…/gpgpu_rel1.0/cg/src/common/cop/codegen/nv50/cop_nv50_common.cpp:3548: Dag* TransformScalarizeDagNV50(LdStruct*, Dag*, void*, int): Assertion `0 && “Unsupported division opcode”’ failed.