program does not work with PGI194+cuda10.1 under Ubuntu 18.04

I have a code which works correctly with PGI 19.4 + cuda 10.0 under Windows 10 with Nvidia quadro P4000

I now have a new computer which has a more powerful graphic card (Quadro RTX 5000), but the code does not run correctly with this new machine (gives a lot of NaNs). This machine runs Ubuntu 18.04, PGI 19.4 and cuda 10.1.

I compiled the code with the following options:
for Windows machine (working correctly):

FFLAGS = -fast -Mlarge_arrays -Mcuda=cc60,ptxinfo -ta=tesla:cc60

with Ubuntu (not working correctly):

FFLAGS = -fast -Mlarge_arrays -Mcuda=cc75,ptxinfo -ta=tesla:cc75

I see some differences in the ptxinfo during compiling.

pgfortran -c -fast -Mlarge_arrays -Mcuda=cc75,ptxinfo -ta=tesla:cc75 check_cuda-single.f90
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘check_mod_check_kernel4_’ for ‘sm_75’
ptxas info : Function properties for check_mod_check_kernel4_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 66 registers, 400 bytes cmem[0]
ptxas info : Function properties for check_mod_checkov0_
56 bytes stack frame, 56 bytes spill stores, 56 bytes spill loads

and

pgfortran -c -fast -Mlarge_arrays -Mcuda=cc60,ptxinfo -ta=tesla:cc60 check_cuda-single.f90
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘check_mod_check_kernel4_’ for ‘sm_60’
ptxas info : Function properties for check_mod_check_kernel4_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 61 registers, 368 bytes cmem[0], 36 bytes cmem[2]
ptxas info : Function properties for check_mod_checkov0_
40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads

See here, with the new machine, it says:

ptxas info : Used 66 registers, 400 bytes cmem[0]

but with the old working machine it says:

ptxas info : Used 61 registers, 368 bytes cmem[0], 36 bytes cmem[2]

But is it the reason? If it is so, what shall I do? Or if there are other reasons?

Thank you very much in advance!

Hi dypang,

But is it the reason? If it is so, what shall I do? Or if there are other reasons?

I highly doubt that CUDA 10.1 using a few more registers would be the issue, though unfortunately I can tell what’s wrong from the given information.

Given you’re using the same compiler version, the problem is more likely a difference between Window and Linux, CUDA 10.0 vs 10.1, or the devices. Though if you could post the compiler informational messages (i.e. add “-Minfo=accel”) for both platforms, we can see if there are any differences.

NANs are often a sign that the device code is using uninitialized memory, So one possibility is that it works on the Windows system because the device memory happens to get initialized to zero. You can test this theory by setting the environment flag “PGI_ACC_FILL=1” to have the runtime initialize all device data allocate in an OpenACC data clause to zero. You can also use the flag “PGI_ACC_FILL_VALUE=” if you want to initialize to some value other than zero.

Also, if you could post a reproducing example (or post a link to an example), I can take a look to see if I can find the error. Note that we have been having issues with the Forum where it will give a “403” error when posting code which our Web Master has been unable to determine the cause. If you encounter this issue or the code is too big to post, please let me know, and I’ll have our customer service folks contact you able getting a copy of the code.

Best Regards,
Mat

Dear Maclog,

Thank you very much for your reply. I just found that the subroutine which calls for GPU to make calculations is not actually executed. When I add

ierrSync = cudaGetLastError()
if (ierrSync/= cudaSuccess) write (,) “Sync kernel error:”, cudaGetErrorString(ierrSync)

I got the following message:

Sync kernel error:
too many resources requested for launch

Strange thing is that the same code runs well with my older machine under Windows 10. This new machine has a more powerful GPU and more memory but it is complaining about “too many resources requested”. So what shall I do?

The subroutine is called in this way:

call overl_kernel2<<<(nsize-1)/nthr+1,nthr>>>(IAd,JAd,nx,ny,nc,nsize,abandd,lim,kmi,kma,llm)

here, nsize=65536 and nthr=512. I tried to change these numbers, for instance, changing nsize=65536 to 32768, but it does not help.

thank you very much again!

Pang

The information about this machine from deviceQuery are:

Device 0: “Quadro RTX 5000”
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 16125 MBytes (16908615680 bytes)
(48) Multiprocessors, ( 64) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 1545 MHz (1.54 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

With the diagnostic information, I found my problem is solved by decreasing the number of threads per block!

I now need to find the best combinations of 1) the number of threads per block and 2) the number of calculated elements at each run (depends on the GPU memory).