Consider the code
__global__ void someFunc(){
for(unsigned int i = 0; i < 10000000; i++){
generateNextRandomNumber();
}
}
which skips some numbers (offset). generateNextRandomNumber() is device function that returns next random number and accordingly changes the state of the generator.
When I run this code in a single block with a single thread and measure the runtime, it takes no time to execute (I use cudaThreadSynchronise() after the kernel call). No matter how big the i is, it always runs in 0 ms.
I decided to check whether the compiler optimizes the code. I use CUDA build rule shipped with 2.3 SDK. I disabled optimization for both solution and .cu file that contains kernels (Properties → CUDA Build Rule v2.3.0 → General → Optimization set to Disabled (/Od)). After that built the project with preprocessed files kept and took a look at ptx file:
//-----------------------------------------------------------
// Options:
//-----------------------------------------------------------
// Target:ptx, ISA:sm_13, Endian:little, Pointer Size:64
// -O3 (Optimization level)
// -g0 (Debug level)
// -m2 (Report advisories)
//-----------------------------------------------------------
Down the file there was
.entry _Z22testMersenneOffsetTimev
{
.loc 2 179 0
$LBB1__Z22testMersenneOffsetTimev:
.loc 2 192 0
exit;
$LDWend__Z22testMersenneOffsetTimev:
} // _Z22testMersenneOffsetTimev
I don’t know how to read this assembly code but is seems to me that it’s hardly optimized External Image
Well, the same results I got passing -Xopencc -O0 and/or -Xptxas -O0 command line options.
Note that I used project clean/rebuild every time, and the ptx file was updated.
Note that my kernel is in kernels.cu and I looked into kernels.ptx file.
Note that changing GPU architecture from 1.3 to 1.2 actually had an effect on ptx file (ISA:sm_13 became ISA:sm_12).
Questions:
-
Am I doing something wrong, e.g. looking into the wrong file?
-
If not, how do I disable optimization?
Sorry for such a long reading External Image