Where is code stored on the device? And other interesting questions.

codercat · May 3, 2007, 4:43am

I have a few specific questions about the GeForce 8800 architecture and how it handles code and memory.

I would like to know where the device code is stored on the card when it is ready to be run. Is there dedicated memory for it, and if so how much? Or is it stored in RAM and then cached, and how big is the cache.

I understand that device memory takes hundreds of clock cycles of latency when accessing it, but I was wondering if someone could clarify what that means? does it mean memory clock cycles or GPU clock cycles? How fast in real time does it actually take to read a 128bit word from device memory?

Is it possible to set up a sort of cache for device memory in software. For instance can you create a buffer (register or shared memory) where you store data for processing and then do calculations on the data at the same time as you read more data into a second buffer? Can this be done inside the same thread instead of using multiple warps which swap out.

If a register is being declared in a specific scope (ie/ a device function call) does the same register get used for each call to that function per thread or does the device use another register for that variable. I ask because I kept getting launch failures when I was looping over a set of data within the kernel but not when I looped over the same data on the CPU and called the kernel multiple times. And before anyone asks, no I was not hitting the 5 second limit, the entire calculation takes about 0.05 seconds to run. Also, when I check the number of registers used it shows less than 100.

Thanks in advance.

prkipfer · May 3, 2007, 12:42pm

How many threads? The multiprocessor has a 8k register file to be shared among all threads, ie. #reg * #threads * sizeof(regs) <= 8k

Peter

edit: sorry, forget the sizeof, it has 8k 32bit registers.

osiris1 · May 3, 2007, 1:30pm

I will have a go at answering some of these questions from my understanding of the literature

This is a replaced answer: the correct answer from another recent topic is in video RAM limited to 2Mb which is quite a lot for the application areas that the current packaging is suitable for.

Now this is a really interesting one that I was also wondering about. When one calculates back from the headline memory throughput spec of 86Gb/sec on GTX that does jibe with 1800MHz and 48 bytes wide. Presumably in page mode so must be sequential (long time since I have been involved in memory design). Now that equates to 1800 x 3 / 16 Mwords/sec available per multiprocessor, that’s 337 Mwords (128 bit words) / sec. When you calculate back from a stated latency of say 250 GPU clocks for a global read (only makes sense to quote GPU clocks as that is what one is counting in your program) which is 32x32/128 = 8 words @ 575 MHZ gives 18.4Mwords/sec per multiprocessor. Now that is a huge gap - cut the headline rate in half for random access and there is still a big gap. Some of these extra cycles will be instruction fetches. Even if we are generous and say they meant mem clocks there is still more than a factor of 2 - are we being had?

The whole architecture is oriented towards just having enough warps that someone always has something to do, with a very fast context switch (just a multiplexor). It really is a very efficient intelligent caching system, I think it is the beauty of the architecture.

This may relate to another of my recent posts on overlaying auto storage - it could be that your top level loop is being unrolled and your function is inlined that many times and registers are not recycled and you are running out. Sounds like a worst case scenario.

Hope that helps - anyone else comment on the discrepancy in memory throughput?

Eric

ed: Forgot there are 16 warps doing their thing/multiproc so that’s 290 Mwords/sec so we are nearly there, or 1800/3/16/16/8 * 575 = 168 GPU clocks of pure throughput available so 2-300 sounds reasonable in practice.

osiris1 · May 3, 2007, 7:20pm

There is a very simple test for my last hypothesis - try passing the number of loops as a variable to your kernel. That will stop the compiler being able to unroll the loop.
Eric

codercat · May 4, 2007, 7:52am

Ooh, good thought. I will give that a shot sometime soon.

Also, thanks for the quick reply.

Topic		Replies	Views
Putting the GPU at work CUDA Programming and Performance	21	20253	July 5, 2007
Local Memory - What is that? Memory Hierarchies CUDA Programming and Performance	26	22555	December 6, 2007
number of registers of a GPU processor 5 general purpose registers in a x86 CPU CUDA Programming and Performance	11	16137	January 7, 2009
Device memory latency and lookup tables - please advise CUDA Programming and Performance	26	7091	February 28, 2009
What else is the device memory being used for? CUDA Programming and Performance	5	1231	June 4, 2009
At what memory is the GPU code located in execution, in host memory or in device memory? CUDA Programming and Performance	3	4735	July 10, 2010
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24777	December 21, 2009
About the different memories CUDA Programming and Performance	12	11808	December 6, 2007
Specifics on performance CUDA Programming and Performance	7	2832	November 11, 2008
questions on register, local memory and block CUDA Programming and Performance	5	4916	February 28, 2008

Where is code stored on the device? And other interesting questions.

Related topics