What is faster in Cuda 4.0, 2.0 device…
write / read global memory OR bitwise operation (shifts) during runtime (more than once… in loop). Operation will be done on 64bit numbers (or even “simulated” 128bit, by struct).
So far I am using global memory, but efficiency is realy poor. I have stack stored in global memory, but stack is only from 4bit numbers (well… in stack stored as char), so I thought, store stack as 64bit number and use logical operation instead of array access to global memory. If I use 64bit number as stack for 4 bit numbers, i got capacity 16, which is enough for me for most cases.
Do you have to use global memory?
In the past I placed the stack in shared memory (each thread had its own stack).
Today I would be tempted to place each thread’s stack in local memory
and rely on top of the stack being in cache most of the time.
If that does not work: how deep does your stack have to be? If you only have
4 bits per item could you fake a stack by shifting a register or two
left-right 4 bits for each pop-push?
Dr. W. B. Langdon,
Department of Computer Science,
University College London
Gower Street, London WC1E 6BT, UK
CIGPU 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/cigpu
EvoPAR 2012 http://www.cs.ucl.ac.uk/staff/W.Langdon/evopar
EuroGP 2012 30 Nov
A Field Guide to Genetic Programming
GP EM http://www.springer.com/10710
GP Bibliography http://www.cs.bham.ac.uk/~wbl/biblio/