ptxas increase to huge memory then hangs, possibly a bug

Hi,
Today I encountered a very strange problem and possibly a bug of nvcc/ptxas, when I added 1 line of code calling fmaxf() inside a kernel function, ptxas memory usage increases to 1.5GB, and hangs forever (>2 hours) while cpu is busy. -ptxas-options=-v does not show any info in the hanging case.

code sketch is like this:

__global__ void __launch_bounds__(64) standardize(Variance* __restrict__ cvariance, ....)
{
    float cvar[40*64] = {0};
    float gvar[40] = {0};

    for(int layer=0;layer<40;layer++)
    {
        for(int c=0;c<64;c++)
        {
            for(int x=0;x<360;x++)
            {
                cvar[layer*64+c] = fmaxf(cvar[layer*64+c], cvariance[layer].entry(x,c,1));
            }
            gvar[layer] = fmaxf(gvar[layer], cvar[layer*64+c]);  //// problematic line
        }
    }

    ....
}

Note that gvar is not reference anywhere else.
However, if I comment out the problematic line or change fmaxf() to (> ? :) operator, compilation finishes in minutes, and I never notice ptxas showing up in process list (I use nvcc command)

My setup is Win7 64 bit, CUDA Toolkit 10.1.105_418.96, Visual Studio 2017 Community Edition v15.8.0
Compilation command is “nvcc --gpu-architecture=sm_61 --default-stream=per-thread --optimize=3 --x=cu *.cpp --use_fast_math --library=cuda,cudart_static

Any comments? Thank you very much.

Update:
If I change the problematic line to (and move inside inner loop)

gvar[layer] = fmaxf(gvar[layer], cvariance[layer].entry(x,c,1));

, then the hang disappears as well.

So the compiler seems to have problem resolving the dependency between gvar & cvar, in the presence of fmax()

I usually recommend that you provide a complete code (not “sketch”) that demonstrates the issue and file a bug. The instructions for filing a bug are linked to a sticky post at the top of this forum.

Sorry, complete code is not allowed to be published.

It doesn’t need to be your whole code. Just enough to allow reproduction of the issue. If the issue occurs based on what you say here, it should only require a few more lines. complete means something that can be actually compiled and run, it does not mean your complete code.

Anyway, without that case, its unlikely that nvidia developers will be able to sort it out.