NVCC Loop Invariant Optimization

Hi,

I have a question regarding whether nvcc can do loop invariant optimization.

I have such a code snippet in a larger computation:

...
    #pragma unroll
    for (int kw_inner = 0; kw_inner < 7; ++kw_inner) {
      #pragma unroll
      for (int occ_c = 0; occ_c < 2; ++occ_c) {
        #pragma unroll
        for (int oc_block_c = 0; oc_block_c < 4; ++oc_block_c) {
          #pragma unroll
          for (int ic_block = 0; ic_block < 3; ++ic_block) {
            if (((((int)blockIdx.z) * 2) + (((int)threadIdx.z) / 8)) < n) {
                ...

The if condition doesn’t involve for loop variables, and should be able to move up to top level:

...
    if (((((int)blockIdx.z) * 2) + (((int)threadIdx.z) / 8)) < n) {
      #pragma unroll
      for (int kw_inner = 0; kw_inner < 7; ++kw_inner) {
        #pragma unroll
        for (int occ_c = 0; occ_c < 2; ++occ_c) {
          #pragma unroll
          for (int oc_block_c = 0; oc_block_c < 4; ++oc_block_c) {
            #pragma unroll
            for (int ic_block = 0; ic_block < 3; ++ic_block) {
                ...

The second kernel can achieve 2x performance comparing to the first one. Doesn nvcc do such kind of optimization?

My nvcc info:
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

It doesn’t do it in every imaginable case.

Here is another example:

https://devtalk.nvidia.com/default/topic/1057205/cuda-programming-and-performance/-solved-array-not-fully-transfered-to-device-memory/post/5360735/#5360735

If you’d like to see a change in nvcc behavior, you may wish to file a bug. The instructions are linked in a sticky post at the top of this sub-forum.

I would also encourage you to test on the most recently released compiler. Items get addressed all the time. I’m not saying I know for sure anything is fixed in CUDA 10.1, but I would test against CUDA 10.1 (or the latest CUDA) before filing a bug.

Thank you for your help! I’ll try 10.1.

An alternative hypothesis would be that the if-statement within the inner-most loop interferes with the desired complete unrolling of all loops. This would explain the suprisingly large performance difference observed. Did you analyze the generated machine code to determine what exactly is responsible for the performance difference?

I haven’t check the assembly code yet. For quick check, I removed unroll pragma but it didn’t help.