Kernel max instructions?

In the CUDA C programming guide v. 7.5, page 207 it states that
“Maximum number of instructions per kernel” equates to 512 million.

I have a hard time getting a grasp on how to estimate my kernels use of “instructions”.
Are we being referred to ASM code compiled and these instructions or are we referring to direct variable assignments, function calls and other “instructions” in the manual?

I tried reading up on the more intricate details on this in various other forums to see what is said in general about it before diving into more CUDA specific topics.

Unfortunately far to many are of the apparent opinion that one should refrain from writing “time critical code” in C.

I am personally equally opinionated as to say, “What’s it to others what I do?” ;)

I seriously have a need to find this out given the fact that I try to write a specific piece of code to progress through some N^22 sets of equations seeking a matching result with expected predictions in some N^15 of the options met during the walk through.

The performance increase by replacing four nested for-loops with a simple routine to derive index by (blockIdx.x*f)+threadIdx.x in addition to a specific given start value gave me 1200% increased performance.

If I now can figure out somehow how many instructions my kernel may require to fully validate one test, I could potentially increase throughput by giving the kernel free range to loop immediately to the next by a set interval limit.

Doing so will mitigate a few issues I am faced with.

A: My GeForce is merely a Fermi architecture, hence program model 2.0

B: No dynamic parallelism is available to Fermi.

C: I can further reduce memory allocation and moving data back and forth between host and device by simply sending a single parameter containing the first threads starting point to keep going before a TDL occurs on the GPU.

So, the questions is, to over simplify the matter and to provide a simple example on the type of evaluation performed in my kernel.

How many instructions would the following require given
VS 2013, Windows 10 and CUDA 8.0 for a GeForce GTX 570?

The code provided is not CUDA but an example of the current alpha code from C simulating the routines.

I perform some 30-45 similar validations against a lookup table which should reside in GLOBAL memory for the duration of the whole run. The goto’s may seem hideous but outperform anything else I have tried for the past 2 years in developing this “monster”.
OR outperforms AND by a factor of 2.

if(oX>3){goto next3;}                  // skipping to next
  if((oX==0)&&(oSY>2)){goto next3;}     // skipping to next
  if(oSY>1){goto next3;}                 // skipping to next
  for(t=1;t<7;t++){/** !!! All must match Any skew is not permitted here. */
      if(OQC[t]<OFSLIM[0][t]||OQC[t]>OFSLIM[1][t]){goto next3;} // skipping to next
  }
/**
*   And here ends the validation process!
*    If I ever get here- Good catch store the index only - data is known ;)
*/
  for(t=0;t<4;t++){
      HITEMM[(id[t])][t]++;
  }

the instructions referred to are “ASM” (i.e. SASS) instructions.

You can get an idea of how many there are for any particular CUDA kernel using the cuobjdump utility:

http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#cuobjdump

In my experience you’re unlikely to write a kernel that comes anywhere near the 512 Million limit.

Thanks again txbob.
Great link, will do some reading.

On the issue of not coming close to the limit…
If I set my mind to it I can do my best to play cliffhanger with any limit ;)

So in short, a loop recursing the kernel before it ends of a given arbitrary number does not increase the instruction count but only the runtime and hence risk of TDL?

It then becomes, “written instructions” and not “executed instructions”?

I have no doubt that given a challenge, someone could create a pathological kernel that exceeds the limit. For serious work, however, I’ve never heard a report of anyone hitting the limit.

Yes, it is “written instructions”, not “executed instructions”. A kernel limited to 512 Million executed instructions would be severely limited for most uses. GPUs can execute on the order of ~1 Billion instructions per second, although that throughput is not likely to be achieved in practice, for a single thread. Nevertheless 512 Million executed instructions would be horribly short for many uses.

Great txbob.

One less headache to worry over and I can get one step closer to actually seeing this run on the GPU rather than as a simulation model or framework for a proposed action before finding myself back at the drafting side of things again. My windows begin to look somewhat cluttered for the lack of a whiteboard at home.

And I sure hope for a taste of those N^9 instructions per second.
Have a great day/night.

The normal technical term for “written instructions” is “static instruction count”, and the technical term for “executed instructions” is “dynamic instruction count”.

One reason your are unlikely to get anywhere close to the 512 million limit is that you will likely experience practical issues with the CUDA toolchain once you approach 100K instructions (such as lengthy build times). Having worked with a number of different use cases over the years, most real-life applications don’t lead to kernels anywhere near that size. If I recall correctly. the biggest kernel I ever encountered had 120K instructions, mostly driven by hundreds of calls to pow().

Note that compiler optimizations such as loop unrolling and function inlining often lead to the generation of many more machine instructions than might be suggested by a quick look at the source code.

as near as I can tell, the nvidia profiler terminology for “executed instructions” is “instructions executed”, or perhaps “executed instructions”, approximately:

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference-3x

I guess that since the profiler is targeted at a broad audience of programmers with all kind of backgrounds, it uses more colloquial terms which are perfectly adequate for its purpose.

Thanks Njuffa.
It helps having the terminology adjusted when searching for facts.
More options give alternative search results.
More precise options hopefully renders more useful results ;)