Speed - Global memory access vs. bitwise operation

What is faster in Cuda 4.0, 2.0 device…

write / read global memory OR bitwise operation (shifts) during runtime (more than once… in loop). Operation will be done on 64bit numbers (or even “simulated” 128bit, by struct).

So far I am using global memory, but efficiency is realy poor. I have stack stored in global memory, but stack is only from 4bit numbers (well… in stack stored as char), so I thought, store stack as 64bit number and use logical operation instead of array access to global memory. If I use 64bit number as stack for 4 bit numbers, i got capacity 16, which is enough for me for most cases.

Do you have to use global memory?

In the past I placed the stack in shared memory (each thread had its own stack).


Today I would be tempted to place each thread’s stack in local memory

and rely on top of the stack being in cache most of the time.

If that does not work: how deep does your stack have to be? If you only have

4 bits per item could you fake a stack by shifting a register or two

left-right 4 bits for each pop-push?


