Where Is the Kernel Code Stored?

Alex.K · June 18, 2011, 4:52pm

Hello,

I am writing a kernel for which the number of registers
is a limiting factor, hence I want to minimize their use.
One way to do that is through loop unrolling.

Furthermore, my kernel consists of several logical kernels
(e.g. one kernel for Matrix vector multiplication, one kernel
for transforming the resulting vector in some way, etc.) all
written in one big kernel. I read somewhere that uniting
logically different kernels in one big kernel is a good way to
save some clock cycles used for setting up a new kernel.

The problem is that both loop unrolling and kernel merging
leads to a large kernel code.

Can someone please help me and tell me where the kernel code
is stored, and how does the size of the kernel code affect both
speed and possibly other resources (e.g. shared memory)?

Thank you in advance.

hyqneuron · June 18, 2011, 5:11pm

Unrolling doesn’t reduce register usage. It reduces some loop instruction overhead. Don’t do excessive unrolling as it may overflow the instruction cache, increase pressure on global memory bandwidth and reduce L2 caching efficiency.

When you have too much register spillage, try breaking it into separate kernels.

Kernel machine code is stored in cubin files, which can be embedded in executables. You can use cuobjdump -elf cubin to check out the kernel size.

Alex.K · June 18, 2011, 6:20pm

Thank you for the answer.

Suppose the loop has two looping variables --for instance, if I am looping over a two dimensional splitting of a large matrix into smaller matrix blocks-- then that adds up to two registers per thread. If I also have 1024 threads, then that is 2048 extra registers, which is about 12% of the total number of registers for a block in a Tesla C1060. Doesn’t the complete unrolling of that loop reduce register usage?

Or is it the case that the larger code can lead to increased register usage that dwarfs the two extra registers needed for the looping variables?

It is implicit in your answer that during execution, the kernel code is stored into global memory and then cached into the instruction cache. This is what I wanted to know. I will also find the kernel size and compare it with the size of the instruction cache on my machine.

Maybe I should start another thread, but I’ll ask this here anyway: What is the downside to using multiple kernels?

Is it just the time needed to transfer the code and data from global memory to instruction cache, shared memory and registers (I assume that no RAM transfers are needed)?

Or is there some other time consuming activity involved in starting a new kernel?

Thank you again for the answer.

hyqneuron · June 19, 2011, 12:49am

Thank you for the answer.

Suppose the loop has two looping variables --for instance, if I am looping over a two dimensional splitting of a large matrix into smaller matrix blocks-- then that adds up to two registers per thread. If I also have 1024 threads, then that is 2048 extra registers, which is about 12% of the total number of registers for a block in a Tesla C1060. Doesn’t the complete unrolling of that loop reduce register usage?

Or is it the case that the larger code can lead to increased register usage that dwarfs the two extra registers needed for the looping variables?

It is implicit in your answer that during execution, the kernel code is stored into global memory and then cached into the instruction cache. This is what I wanted to know. I will also find the kernel size and compare it with the size of the instruction cache on my machine.

Maybe I should start another thread, but I’ll ask this here anyway: What is the downside to using multiple kernels?

Is it just the time needed to transfer the code and data from global memory to instruction cache, shared memory and registers (I assume that no RAM transfers are needed)?

Or is there some other time consuming activity involved in starting a new kernel?

Thank you again for the answer.

Well, I still don’t get why you think using unrolling would reduce register usage. Are you saying that unrolling would help you to reduce the number of threads needed, and thus the total number of registers? That wouldn’t be called unrolling. That’s increasing the number of cycles per loop per thread. Of course, this is always the way to go when you have a large dataset and want to avoid kernel/block scheduling overhead.
If the second cycle in the loop uses values calculated in the first cycle and when such values cannot be known at compile-time, you might get a large register spillage, depending on what the compiler chooses to do.
That’s right. It’s in Global memory, then loaded into L2 and then the instruction cache. Size of instruction cache for CC1.x should be 8 KB, according to a paper that did some comprehensive microbenchmarking on G80.
Well, I’m not sure how efficient the GigaThread Engine is. Few people are. Maybe someone else who has done some microbenchmarking could enlighten you. All I know is that the overhead involves something at the driver’s side, then the kernel launch request needs to be sent over the PCI-E to the GPU. If the request is sent while another kernel is running, these overheads may be hidden. Once the kernel launch request enters the GPU, it is queued until the previous kernel is finished. Then something needs to be done with constant memory, which is used to pass the kernel function arguments. And of course as you said the instruction also needs to load the respective instructions if they’re not already cached.

Alex.K · June 19, 2011, 3:31pm

To my mind, the code:

“int a=5;
for(int i = 0; i<3; ++i)
{a*=a;}”

must use at least two registers, a and i, (unless the compiler does a complete loop unrolling), while the code:

“int a=5;
a*=a;
a*=a;
a*=a;”

can use only one register.

That is the possibly simplistic reasoning I used for saying that loop unrolling can decrease register usage.

Regarding the kernel start-up time, if starting the kernel involves transfer of whatever data between the CPU or RAM and the GPU then that involves large latency, so that may indeed be a strong reason against using multiple kernels. But if this is undocumented territory then I will be better off simply experimenting with a single kernel and with multiple kernels for my specific problem, to see if the latency can be hidden or not.

Anyway, you’ve been very helpful – I have a good general idea of what I wanted to know (the loop unrolling discussion is a sideshow).