I have a few specific questions about the GeForce 8800 architecture and how it handles code and memory.
I would like to know where the device code is stored on the card when it is ready to be run. Is there dedicated memory for it, and if so how much? Or is it stored in RAM and then cached, and how big is the cache.
I understand that device memory takes hundreds of clock cycles of latency when accessing it, but I was wondering if someone could clarify what that means? does it mean memory clock cycles or GPU clock cycles? How fast in real time does it actually take to read a 128bit word from device memory?
Is it possible to set up a sort of cache for device memory in software. For instance can you create a buffer (register or shared memory) where you store data for processing and then do calculations on the data at the same time as you read more data into a second buffer? Can this be done inside the same thread instead of using multiple warps which swap out.
If a register is being declared in a specific scope (ie/ a device function call) does the same register get used for each call to that function per thread or does the device use another register for that variable. I ask because I kept getting launch failures when I was looping over a set of data within the kernel but not when I looped over the same data on the CPU and called the kernel multiple times. And before anyone asks, no I was not hitting the 5 second limit, the entire calculation takes about 0.05 seconds to run. Also, when I check the number of registers used it shows less than 100.
Thanks in advance.