NVIDIA people, please pay attention, still have no meaningful answer How to estimate the proximity t

Hi,

At this moment my kernel takes about 5 minutes to compile (for sm_20) with --opencc-options -OPT:Olimit=0 option. The size of resulting .cubin file is 480 Kilobytes, the size of the .fatbin.c file is about 9 megabytes, it contains 459294 items in the static const unsigned long long _deviceText$sm_20$ array.

Is it a lot or not ? How close my kernel is to the 2 million instructions limit ?

Thanks in advance.

Hi,

At this moment my kernel takes about 5 minutes to compile (for sm_20) with --opencc-options -OPT:Olimit=0 option. The size of resulting .cubin file is 480 Kilobytes, the size of the .fatbin.c file is about 9 megabytes, it contains 459294 items in the static const unsigned long long _deviceText$sm_20$ array.

Is it a lot or not ? How close my kernel is to the 2 million instructions limit ?

Thanks in advance.

As current GPUs have (at least) 32-bit wide instructions and 32 bit are 8 characters in a hexdump, you are still an order of magnitude away from the instruction limit.

How do you arrive at these large kernels. Is your source file of similar size, or do you use a lot of templating and/or loop unrolling? For loop unrolling, partial unrolling often is about as effective as complete unrolling, but reduces the pressure on the instruction cache a lot.

As current GPUs have (at least) 32-bit wide instructions and 32 bit are 8 characters in a hexdump, you are still an order of magnitude away from the instruction limit.

How do you arrive at these large kernels. Is your source file of similar size, or do you use a lot of templating and/or loop unrolling? For loop unrolling, partial unrolling often is about as effective as complete unrolling, but reduces the pressure on the instruction cache a lot.

I think the reason is in aggressive inlining that ptx compiler does. There are about 10 core functions in my kernel that do the actual job (these 10 functions are complex enough) and about 50 other functions that use those 10 core functions, seems like all the references to the core functions are inlined.

What exactly do you call the hexdump ? .cubin or fatbin.c ? Explain please, how did you interpret the numbers I gave and how did you get the ~200000 instructions approximation ?

I think the reason is in aggressive inlining that ptx compiler does. There are about 10 core functions in my kernel that do the actual job (these 10 functions are complex enough) and about 50 other functions that use those 10 core functions, seems like all the references to the core functions are inlined.

What exactly do you call the hexdump ? .cubin or fatbin.c ? Explain please, how did you interpret the numbers I gave and how did you get the ~200000 instructions approximation ?