Kernel max instructions?

dev10e12 · February 8, 2018, 1:48am

In the CUDA C programming guide v. 7.5, page 207 it states that
“Maximum number of instructions per kernel” equates to 512 million.

I have a hard time getting a grasp on how to estimate my kernels use of “instructions”.
Are we being referred to ASM code compiled and these instructions or are we referring to direct variable assignments, function calls and other “instructions” in the manual?

I tried reading up on the more intricate details on this in various other forums to see what is said in general about it before diving into more CUDA specific topics.

Unfortunately far to many are of the apparent opinion that one should refrain from writing “time critical code” in C.

I am personally equally opinionated as to say, “What’s it to others what I do?” ;)

I seriously have a need to find this out given the fact that I try to write a specific piece of code to progress through some N^22 sets of equations seeking a matching result with expected predictions in some N^15 of the options met during the walk through.

The performance increase by replacing four nested for-loops with a simple routine to derive index by (blockIdx.x*f)+threadIdx.x in addition to a specific given start value gave me 1200% increased performance.

If I now can figure out somehow how many instructions my kernel may require to fully validate one test, I could potentially increase throughput by giving the kernel free range to loop immediately to the next by a set interval limit.

Doing so will mitigate a few issues I am faced with.

A: My GeForce is merely a Fermi architecture, hence program model 2.0

B: No dynamic parallelism is available to Fermi.

C: I can further reduce memory allocation and moving data back and forth between host and device by simply sending a single parameter containing the first threads starting point to keep going before a TDL occurs on the GPU.

So, the questions is, to over simplify the matter and to provide a simple example on the type of evaluation performed in my kernel.

How many instructions would the following require given
VS 2013, Windows 10 and CUDA 8.0 for a GeForce GTX 570?

The code provided is not CUDA but an example of the current alpha code from C simulating the routines.

I perform some 30-45 similar validations against a lookup table which should reside in GLOBAL memory for the duration of the whole run. The goto’s may seem hideous but outperform anything else I have tried for the past 2 years in developing this “monster”.
OR outperforms AND by a factor of 2.

if(oX>3){goto next3;}                  // skipping to next
  if((oX==0)&&(oSY>2)){goto next3;}     // skipping to next
  if(oSY>1){goto next3;}                 // skipping to next
  for(t=1;t<7;t++){/** !!! All must match Any skew is not permitted here. */
      if(OQC[t]<OFSLIM[0][t]||OQC[t]>OFSLIM[1][t]){goto next3;} // skipping to next
  }
/**
*   And here ends the validation process!
*    If I ever get here- Good catch store the index only - data is known ;)
*/
  for(t=0;t<4;t++){
      HITEMM[(id[t])][t]++;
  }

Robert_Crovella · February 8, 2018, 2:08am

the instructions referred to are “ASM” (i.e. SASS) instructions.

You can get an idea of how many there are for any particular CUDA kernel using the cuobjdump utility:

[url]CUDA Binary Utilities :: CUDA Toolkit Documentation

In my experience you’re unlikely to write a kernel that comes anywhere near the 512 Million limit.

dev10e12 · February 8, 2018, 2:14am

Thanks again txbob.
Great link, will do some reading.

On the issue of not coming close to the limit…
If I set my mind to it I can do my best to play cliffhanger with any limit ;)

So in short, a loop recursing the kernel before it ends of a given arbitrary number does not increase the instruction count but only the runtime and hence risk of TDL?

It then becomes, “written instructions” and not “executed instructions”?

Robert_Crovella · February 8, 2018, 2:20am

I have no doubt that given a challenge, someone could create a pathological kernel that exceeds the limit. For serious work, however, I’ve never heard a report of anyone hitting the limit.

Yes, it is “written instructions”, not “executed instructions”. A kernel limited to 512 Million executed instructions would be severely limited for most uses. GPUs can execute on the order of ~1 Billion instructions per second, although that throughput is not likely to be achieved in practice, for a single thread. Nevertheless 512 Million executed instructions would be horribly short for many uses.

dev10e12 · February 8, 2018, 2:28am

Great txbob.

One less headache to worry over and I can get one step closer to actually seeing this run on the GPU rather than as a simulation model or framework for a proposed action before finding myself back at the drafting side of things again. My windows begin to look somewhat cluttered for the lack of a whiteboard at home.

And I sure hope for a taste of those N^9 instructions per second.
Have a great day/night.

njuffa · February 8, 2018, 7:56am

The normal technical term for “written instructions” is “static instruction count”, and the technical term for “executed instructions” is “dynamic instruction count”.

One reason your are unlikely to get anywhere close to the 512 million limit is that you will likely experience practical issues with the CUDA toolchain once you approach 100K instructions (such as lengthy build times). Having worked with a number of different use cases over the years, most real-life applications don’t lead to kernels anywhere near that size. If I recall correctly. the biggest kernel I ever encountered had 120K instructions, mostly driven by hundreds of calls to pow().

Note that compiler optimizations such as loop unrolling and function inlining often lead to the generation of many more machine instructions than might be suggested by a quick look at the source code.

Robert_Crovella · February 8, 2018, 8:38am

as near as I can tell, the nvidia profiler terminology for “executed instructions” is “instructions executed”, or perhaps “executed instructions”, approximately:

[url]Profiler :: CUDA Toolkit Documentation

njuffa · February 8, 2018, 4:26pm

I guess that since the profiler is targeted at a broad audience of programmers with all kind of backgrounds, it uses more colloquial terms which are perfectly adequate for its purpose.

dev10e12 · February 8, 2018, 11:17pm

Thanks Njuffa.
It helps having the terminology adjusted when searching for facts.
More options give alternative search results.
More precise options hopefully renders more useful results ;)

Topic		Replies	Views
Maximum number of instructions per kernel CUDA Programming and Performance	2	1045	September 11, 2014
Maximum number of instruction inside a Kernel CUDA Programming and Performance	9	2982	October 13, 2009
Is there an instruction limit ? CUDA Programming and Performance	3	6804	April 4, 2008
what is the number of operations in one kernel help CUDA Programming and Performance	8	7266	May 25, 2010
What is maximum size of kernel code? CUDA Programming and Performance	2	8689	February 18, 2010
maximun size of cuda kernal CUDA Programming and Performance	2	1535	April 10, 2010
NVIDIA people, please pay attention, still have no meaningful answer How to estimate the proximity t CUDA Programming and Performance	5	683	November 5, 2010
Any applicable kernel size constraints..? CUDA Programming and Performance	6	1299	April 7, 2014
G80 Instruction Limits CUDA Programming and Performance	1	3930	March 23, 2007
How to find out how many ptx instructions are in the kernel ? Keeping in mind the 2 million ptx inst CUDA Programming and Performance	11	7427	September 18, 2009

Kernel max instructions?

Related topics