32 or 64 bit native integer?

hieronymus · May 9, 2007, 5:15pm

When performing bitwise operator on ints/longs. How can I regard a the hardware in that respect? Will it handle 64 bit at once?

GarryB · May 9, 2007, 6:21pm

You could find out by taking the sizeof(long)

I believe the emulator will get this right (the emulator is a great tool), so you could save the hassle of actually asking the device (with cudaMalloc(), cudaMemcpy() and stuff), and instead just write a tiny kernel:

__global__ void myKernel(void)

{

#ifdef __DEVICE_EMULATION__

  printf("a long is %d bytes\n", sizeof(long));

#endif

}

and call it as usual. Just compile with “make emu=1”, and the emulator will calculate the answer. (Strictly, the “#ifdef DEVICE_EMULATION” and “#endif” are superfluous, but it’s robust, and a habit I like).

hieronymus · May 9, 2007, 8:46pm

Thanks for your reply but it isn’t the answer I am looking for.

For example, you have 32-bit and 64-bit processors. The 64 bit processor read 64 bits at once, where the 32 bit processor needs two clock cycles.

So, to put my question in a different way? Does it make a different for the number of clockcycles to load an int (32 bit) or a long ( 64 ) bit?

osiris1 · May 9, 2007, 11:11pm

Think of the G80 architecture as 32 bit. There are special opcodes for fetch/store to device memory in 64 and 128 bit chunks but they interface to multiple 32 bit registers.
Eric

GarryB · May 10, 2007, 12:33am

As my little experiment above shows, a long is 4 bytes, the same as an int, so you have an answer right there; it takes the same number of cycles to load an int as a long, but only because sizeof(long) == sizeof(int); both are 4 bytes, a long is not 8 bytes.
I assume that this still isn’t the answer you’re looking for. I am guessing what you’re after, but the answer to questions of the general form ‘do 64 bits loads take the same number of cycles as 32 bit loads, or am I stuck loading 32 bits at a time?’, is (of course) ‘it depends’ (a real engineering answer :D) but NVIDIA claim a G80 can do 32 bit, 64 bit, or 128 bit memory access.

This NVIDIA paper explains how to improve over 32 bit memory access to get 64 or 128 bit memory transactions:

http://developer.download.nvidia.com/compu…vectorLoads.pdf

My interpretation of the paper is that this is not achieved through multiple 32 bit transactions, but is through one 64 or 128 bit memory transaction. This sounds plausible given the G80 has a 384 (or 320) bit wide memory interface (so is a G80 a 384 bit processor? No.).

I think a 32 bit vs 64 bit memory transaction question is somewhat orthogonal to whether or not a G80 is a 32 bit architecture. That becomes important when it comes to doing something with the bits. If it’s okay for your algorithm to deal with 2x32 bit items (or even 8x32 bits) concurrently rather than as a single 64 bit item, then you’re all set. In my limited experiments, a G80 can do 8x4 byte bit-wise operations concurrently on each multiprocessor.

If this doesn’t answer your question, would you like to say what you are trying to do? It may be easier to try to help if you put questions into context.

osiris1 · May 10, 2007, 2:28am

BarryB, I did not say multiple 32 bit transactions, just that 64 and 128 bit loads/stores interface to multiple registers. If you get your coalescing right then yes you will get 384 bit loads and stores in parallel. Has to be spread over more than 1 thread.
Eric

ed: by the register width definition you would have to credit the G80 with a 512 bit architecture as it can do 16 x 32 bitwise operations in 1 GPU clock cycle (per multiprocessor). It’s actually half that and the clock runs twice as fast at the bottom level.

final ed: to answer the original topic question I think you should look at the G80 as 1024 bit architecture (that is how many bits a single instruction controls). It has a skinny bus and various mechanisms to handle that. The programming model is 32 bit within a thread. The thread architecture is a lot more useful that a traditional 1024 bit machine EXCEPT there is no carry from one 32 bit ALU to another.

hieronymus · May 10, 2007, 7:26am

I need a 64 bit int and will perform bitwise operators on it.

First I need to find out if I can use a 64 bit at all?

osiris1 · May 10, 2007, 7:32am

Use 2 threads per 64 bit int - as long as you don’t want to add or subtract, multiply or divide or shift for that matter. Bitwise ops will all work.
Eric
but dynamic bit setting will not (as that involves shifting) so you probably need to redo your code for 32 bit.

hieronymus · May 10, 2007, 7:38am

Hmmm, but I need to shift and to compare. So that’s probably not gonna work?

osiris1 · May 10, 2007, 7:39am

No

osiris1 · June 6, 2007, 1:07am

Have we suddenly got new hardware or what?

When I compile on 0.9 Linux64 I get ptx code with 64 bit registers (rdN) and 64 bit arithmetic operations on device addresses and 64 bit longs with 64 bit arithmetic operations!

       add.s64         $rd7, $rd3, 127;        //

        shl.b64         $rd8, $rd7, 32;         //

        ld.param.u64    $rd9, [__cudaparm_retl];        //  id:36 __cudaparm_retl+0x0

        st.global.s64   [$rd9+0], $rd8; //  id:37

Is this just wound back to 32 bit by ptxas or do we really have full 64 bit integers?

Eric

ed: I presume all makefiles have to be updated to include “-m32” directly to nvcc, perhaps the default should have been left as 32 bit?

ed: No it does not generate linkable output otherwise, so back to the default 64 bit version so how does ptxas handle it? (also libGLEW.a is in the generic SDK and so is 32 bit and is incompatible with everything else, also doc/CUDA_Release_Notes_1.0.txt is missing from the Linux64 tools package - says that filename in the README - this release downgraded from 1.0 to 0.9?

ed: maybe not so well: ptxas: /home/lfriedman/cuda64-stuff/sw/gpgpu_rel1.0/compiler/…/…/gpgpu_rel1.0/cg/src/common/cop/codegen/nv50/cop_nv50_common.cpp:3548: Dag* TransformScalarizeDagNV50(LdStruct*, Dag*, void*, int): Assertion `0 && “Unsupported division opcode”’ failed.

Topic		Replies	Views
Memory transaction size CUDA Programming and Performance	4	14420	April 13, 2009
64-bit versus 32-bit CUDA code Any benefit at all? CUDA Programming and Performance	5	12956	November 3, 2009
Mixed 32/64 compilation on 0.9 CUDA Programming and Performance	7	7801	August 4, 2007
How to tell if GPU cores are actually 32/64-bit processors CUDA Programming and Performance	11	12791	February 16, 2017
G80 - 14 clocks per Instruction ? CUDA Programming and Performance	4	3217	March 4, 2008
32-bit nvcc makes faster GPU code than 64-bit variant In CUDA version 2.1 CUDA Programming and Performance	9	10494	February 14, 2009
64 vs 32 bit Why 64 bit code is significantly slower than 32 bit code? CUDA Programming and Performance	19	4265	October 11, 2010
Multiprocessors or Cuda Cores CUDA Programming and Performance	25	19726	July 5, 2011
Clock Cycles of CUDA kernel How to determine the clock cycles...? CUDA Programming and Performance	9	14810	June 27, 2008
Instruction timings More info than in the guide CUDA Programming and Performance	5	8287	May 21, 2007

32 or 64 bit native integer?

Related topics