In the CUDA C programming guide v. 7.5, page 207 it states that
“Maximum number of instructions per kernel” equates to 512 million.
I have a hard time getting a grasp on how to estimate my kernels use of “instructions”.
Are we being referred to ASM code compiled and these instructions or are we referring to direct variable assignments, function calls and other “instructions” in the manual?
I tried reading up on the more intricate details on this in various other forums to see what is said in general about it before diving into more CUDA specific topics.
Unfortunately far to many are of the apparent opinion that one should refrain from writing “time critical code” in C.
I am personally equally opinionated as to say, “What’s it to others what I do?” ;)
I seriously have a need to find this out given the fact that I try to write a specific piece of code to progress through some N^22 sets of equations seeking a matching result with expected predictions in some N^15 of the options met during the walk through.
The performance increase by replacing four nested for-loops with a simple routine to derive index by (blockIdx.x*f)+threadIdx.x in addition to a specific given start value gave me 1200% increased performance.
If I now can figure out somehow how many instructions my kernel may require to fully validate one test, I could potentially increase throughput by giving the kernel free range to loop immediately to the next by a set interval limit.
Doing so will mitigate a few issues I am faced with.
A: My GeForce is merely a Fermi architecture, hence program model 2.0
B: No dynamic parallelism is available to Fermi.
C: I can further reduce memory allocation and moving data back and forth between host and device by simply sending a single parameter containing the first threads starting point to keep going before a TDL occurs on the GPU.
So, the questions is, to over simplify the matter and to provide a simple example on the type of evaluation performed in my kernel.
How many instructions would the following require given
VS 2013, Windows 10 and CUDA 8.0 for a GeForce GTX 570?
The code provided is not CUDA but an example of the current alpha code from C simulating the routines.
I perform some 30-45 similar validations against a lookup table which should reside in GLOBAL memory for the duration of the whole run. The goto’s may seem hideous but outperform anything else I have tried for the past 2 years in developing this “monster”.
OR outperforms AND by a factor of 2.
if(oX>3){goto next3;} // skipping to next
if((oX==0)&&(oSY>2)){goto next3;} // skipping to next
if(oSY>1){goto next3;} // skipping to next
for(t=1;t<7;t++){/** !!! All must match Any skew is not permitted here. */
if(OQC[t]<OFSLIM[0][t]||OQC[t]>OFSLIM[1][t]){goto next3;} // skipping to next
}
/**
* And here ends the validation process!
* If I ever get here- Good catch store the index only - data is known ;)
*/
for(t=0;t<4;t++){
HITEMM[(id[t])][t]++;
}