Would setting a low number of registers per thread cause the crash during the compilation? I’m using PGI Fortran 10.5, CUDA 2.3 and CUDA 3.0.
ptxas /tmp/pgcudaforNYj1s0R2fTW.ptx, line 0; fatal : (C9999) max reg limit too low
PGF90-F-0000-Internal compiler error. pgnvd job exited with nonzero status code 0 (gpu_utility.f95: 353)
PGF90/x86-64 Linux 10.5-0: compilation aborted
The PGI Fortran compiler sets the max reg limit to use for the NVIDIA ptx assembler based on the thread block size and the device type. For instance, for compute capability 1.0-1.2, there are 8K registers available; for Tesla (1.3) and Fermi (2.0), there are 16K available. If the compiler uses a thread block of, say, 16x16, it will divide the register count (8K or 16K) by the number of threads in a block (256), to make sure there are enough registers for at least one thread block. In this case, 8K/256 is 32.
I’m really surprised by this message from the ptx assembler, I hadn’t seen this one before.
Two possible ways to affect this: If you are using the Accelerator model, use loop directives to explicitly set the thread block size for each loop, making the thread block smaller. This will allow more registers per thread. Alternatively, use the -ta=nvidia,maxregcount:n or -Mcuda=maxregcount:n (for CUDA Fortran) to set the max reg count explicitly. You can run the compiler with ‘-v’ to see the invocation of the GPU compiler; this will be an invocation of ‘pgnvd’ and you can see the ‘-regs’ argument to this that sets the max reg limit to the PTX assembler. The -Minfo=accel messages will tell you the thread block size being used for each loop as well.
I have to agree with you, there should be no problem setting the value too low, if spilling works.
But this is not a problem that PGI can solve. The messages comes from the NVIDIA PTX assembler ptxas, which we redistribute, but which is provided by NVIDIA. I’m sorry to say, but I don’t think we can help much here.