CUDA speed of read memory vs. function


I’m working on MT19937 RNG in CUDA.

In RNGs, they use bit operation with integer. the most important operation is

compute x*A (x is 0 or 1 bit vector) and (A is 32 bit integer represents of matrix)

to compute this,

original code of Mersenne Twister is below

#define A 0x9908b0dfUL;

    unsigned long mag01[2]={0x0UL, A};

    x = mag01[ y & 0x1UL];

CUDA SDK example in Mersenne Twister is below

#define A 0x9908b0dfUL;

    x = ( (y & 1) ? A : 0 );

original code use “&” operation on integer and use memory.

in my cuda application, I’ll save mag01 on register or constant or shared memory.

in CUDA sdk, they use only “?” operation. I think the developer of CUDA SDK want to save # of registers.

Do you have any opnions which one is good? in aspects of speed and memory control in CUDA environment. I guess ? operation is heavier than & operation. am I right?

? could be heavier, but it is not always true for all architectures and all samples of code (as is not always true even for x86).

If NVidia have “conditional set” operation, ? could be as fast as &.

But memory access is almost always evil.

I guess when you run alot of threads on 1 MP CUDA can hide most of delay of both variants of code, so they could have similar performance.

The only way to know that for sure is to benchmark. I guess it would be usefull for everyone if you do the benchmark and post results here :-)

If you just want to avoid the branch and the compiler is not bright enough to do it by itself, the following code should work as well:
x = (-(y & 1)) & A
(and yes, it is bad style since it assumes numbers are represented as two’s complement).
And also, your first example might end up using local memory, which would be very slow, better check the generated code, either the ptx code or with decuda.