Speed - Global memory access vs. bitwise operation

What is faster in Cuda 4.0, 2.0 device…

write / read global memory OR bitwise operation (shifts) during runtime (more than once… in loop). Operation will be done on 64bit numbers (or even “simulated” 128bit, by struct).

So far I am using global memory, but efficiency is realy poor. I have stack stored in global memory, but stack is only from 4bit numbers (well… in stack stored as char), so I thought, store stack as 64bit number and use logical operation instead of array access to global memory. If I use 64bit number as stack for 4 bit numbers, i got capacity 16, which is enough for me for most cases.

Do you have to use global memory?

In the past I placed the stack in shared memory (each thread had its own stack).


Today I would be tempted to place each thread’s stack in local memory

and rely on top of the stack being in cache most of the time.

If that does not work: how deep does your stack have to be? If you only have

4 bits per item could you fake a stack by shifting a register or two

left-right 4 bits for each pop-push?


Dr. W. B. Langdon,

    Department of Computer Science,

    University College London

    Gower Street, London WC1E 6BT, UK


CIGPU 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/cigpu

EvoPAR 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/evopar

EuroGP 2012 30 Nov

RNAnet http://bioinformatics.essex.ac.uk/users/wlangdon/rnanet/

A Field Guide to Genetic Programming


GP EM http://www.springer.com/10710

GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/