Why different versions of CUDA affect the results?

Yesterday, the Windows 10 upgrade the NVIDIA Driver of version 512.15, the code cannot obtain a correct result. After I downgrade to NVIDIA Driver of version 456.81, then I obtain the correct result.
The environment is as follows:
OS: Windows 10 (19044.1826)
PGI Compiler: PGI 19.10 (include cuda 9.2, 10.0 and 10.1)
NVIDIA Driver: 456.81 and 512.15
Why NVIDIA Driver of version 456.81 can obtain a correct result, while that of 512.15 cannot obtain a correct result.

I also made a test on REHL 8. The environment is as follows:
OS: REHL 8.5
PGI Compilers: HPC-SDK 21.5 (including cuda 11.3)
NVIDIA Driver: 470.86
It also cannot obtain a correct result. Why ?

Thanks a lot!

Hi ysliu,

Sorry, no idea what’s wrong.

What devices are you using? Are they still supported by the CUDA drivers?

By “code cannot obtain a correct result”, are you indeed getting incorrect results or are your kernels not running on the device?

What language are you using? CUDA C? CUDA Fortran? OpenACC?

What is the output from the “pgaccelinfo” utility?

Internally we do have 515.38 installed on several of our Linux systems, and they seem to be fine. Though these have A100s installed.

-Mat

Hi Mat,
Thank you for your reply.
The GPU is TiTan Xp with 12 GB memory.
I tried several versions of NVIDIA Driver, including 11.1.1_456.81_win10,11.2.0_460.89_win10, 11.5.1_496.13.
Unfortunately, only the 11.1.1_456.81_win10 and 11.2.0_460.89_win10 obtain a correct result using the PGI 19.10, while 11.5.1_496.13 cannot obtain a correct result.
However, the CUDA Driver is backward compatibility, which means that an application built with older CUDA toolkit can work on a newer NVIDIA Driver. Why it does not work?
The meaning of “code cannot obtain a correct result” is that the result is wrong but it running on the device.
I use the OpenACC in Fortran.
I found that it does not work until 11.5.1_496.13. I’m not sure the last version, may be 11.3 or 11.4. However, it seems work using Driver with version of 11.2.
I’m confused.
Thanks a lot!

 The pgaccelinfo is as follows:

 PGI$ pgaccelinfo

CUDA Driver Version: 11010

Device Number: 0
Device Name: TITAN Xp
Device Revision Number: 6.1
Global Memory Size: 12884901888
Number of Multiprocessors: 30
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1582 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 5705 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 3145728 bytes
Max Threads Per SMP: 2048
Async Engines: 5
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
PGI Default Target: -ta=tesla:cc60

Device Number: 1
Device Name: TITAN Xp
Device Revision Number: 6.1
Global Memory Size: 12884901888
Number of Multiprocessors: 30
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1582 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 5705 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 3145728 bytes
Max Threads Per SMP: 2048
Async Engines: 5
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
PGI Default Target: -ta=tesla:cc60

Device Number: 2
Device Name: Quadro P400
Device Revision Number: 6.1
Global Memory Size: 2147483648
Number of Multiprocessors: 2
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1252 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2005 MHz
Memory Bus Width: 64 bits
L2 Cache Size: 524288 bytes
Max Threads Per SMP: 2048
Async Engines: 5
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: No
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
PGI Default Target: -ta=tesla:cc60

For driver issues, I would expect the device to not be recognized or a program to get runtime error. Wrong results typically are issues with the program or compiler, but since neither has changed, I have no idea why this would occur.

Are you able to provide a minimal reproducing example that I can use to investigate?

Yes. The only change is the driver version. Thanks a lot!
Sorry, the program is very large (more than 50000 lines).
I appreciate your suggestion.
I will check it again.
Thanks a lot!