NVIDIA people, please pay attention, still have no meaningful answer How to estimate the proximity t

Romant · November 5, 2010, 2:58pm

Hi,

At this moment my kernel takes about 5 minutes to compile (for sm_20) with --opencc-options -OPT:Olimit=0 option. The size of resulting .cubin file is 480 Kilobytes, the size of the .fatbin.c file is about 9 megabytes, it contains 459294 items in the static const unsigned long long _deviceText$sm_20$ array.

Is it a lot or not ? How close my kernel is to the 2 million instructions limit ?

Thanks in advance.

Romant · November 5, 2010, 2:58pm

Hi,

At this moment my kernel takes about 5 minutes to compile (for sm_20) with --opencc-options -OPT:Olimit=0 option. The size of resulting .cubin file is 480 Kilobytes, the size of the .fatbin.c file is about 9 megabytes, it contains 459294 items in the static const unsigned long long _deviceText$sm_20$ array.

Is it a lot or not ? How close my kernel is to the 2 million instructions limit ?

Thanks in advance.

tera · November 5, 2010, 3:21pm

As current GPUs have (at least) 32-bit wide instructions and 32 bit are 8 characters in a hexdump, you are still an order of magnitude away from the instruction limit.

How do you arrive at these large kernels. Is your source file of similar size, or do you use a lot of templating and/or loop unrolling? For loop unrolling, partial unrolling often is about as effective as complete unrolling, but reduces the pressure on the instruction cache a lot.

tera · November 5, 2010, 3:21pm

As current GPUs have (at least) 32-bit wide instructions and 32 bit are 8 characters in a hexdump, you are still an order of magnitude away from the instruction limit.

How do you arrive at these large kernels. Is your source file of similar size, or do you use a lot of templating and/or loop unrolling? For loop unrolling, partial unrolling often is about as effective as complete unrolling, but reduces the pressure on the instruction cache a lot.

Romant · November 5, 2010, 4:47pm

I think the reason is in aggressive inlining that ptx compiler does. There are about 10 core functions in my kernel that do the actual job (these 10 functions are complex enough) and about 50 other functions that use those 10 core functions, seems like all the references to the core functions are inlined.

What exactly do you call the hexdump ? .cubin or fatbin.c ? Explain please, how did you interpret the numbers I gave and how did you get the ~200000 instructions approximation ?

Romant · November 5, 2010, 4:47pm

I think the reason is in aggressive inlining that ptx compiler does. There are about 10 core functions in my kernel that do the actual job (these 10 functions are complex enough) and about 50 other functions that use those 10 core functions, seems like all the references to the core functions are inlined.

What exactly do you call the hexdump ? .cubin or fatbin.c ? Explain please, how did you interpret the numbers I gave and how did you get the ~200000 instructions approximation ?

Topic		Replies	Views
How to find out how many ptx instructions are in the kernel ? Keeping in mind the 2 million ptx inst CUDA Programming and Performance	11	7308	September 18, 2009
CUDA kernel size What if it exceeds 2MB CUDA Programming and Performance	4	3811	November 5, 2007
Size of CUDA Object Code? CUDA Programming and Performance	5	1746	November 24, 2010
Estimating kernel size? CUDA Programming and Performance	5	2541	March 1, 2010
Is there an instruction limit ? CUDA Programming and Performance	3	6744	April 4, 2008
G80 Instruction Limits CUDA Programming and Performance	1	3875	March 23, 2007
Maximum number of instruction inside a Kernel CUDA Programming and Performance	9	2849	October 13, 2009
Kernel code size limitations CUDA Programming and Performance	2	4558	March 9, 2007
kernel function size limit? how many lines or variables are allowed? CUDA Programming and Performance	7	7068	November 15, 2007
Very large kernels How to compile a large cuda kernel? CUDA Programming and Performance	2	8381	December 13, 2008

NVIDIA people, please pay attention, still have no meaningful answer How to estimate the proximity t

Related topics