Inconsistency between the release and debug builds of the same CUDA kernel

As there are several source code files involved, I have packaged the source code
src.zip (4.1 KB)
and uploaded it.The source code implements the Montgomery modular reduction algorithms on a finite field. Different implementation approaches are used on the host and device sides. The entry kernel function is in the ‘kernel’ function in ‘main.cu’.
The results of the debug and release versions are also inconsistent. The release version produces correct results, but the debug version outputs incorrect results.
The command to compile the release version is:
nvcc -o test main.cu
The output of release version:

e98b9564, a92043ac, b25e5075, 70d69a83, 2f4a1a59, 1f8ade1c, 8c1d97e5, 343b588d, 108ce2db, d4df2d9b, f276f5d6, 1795837
e98b9564, a92043ac, b25e5075, 70d69a83, 2f4a1a59, 1f8ade1c, 8c1d97e5, 343b588d, 108ce2db, d4df2d9b, f276f5d6, 1795837

The command to compile the debug version is:
nvcc -o test -G main.cu
The output of debug version:

e98b9564, a92043ac, b25e5075, 70d69a83, 2f4a1a59, 1f8ade1c, 8c1d97e5, 343b588d, 108ce2db, d4df2d9b, f276f5d6, 1795837
54331d22, 6410f330, 9badc234, 28c1d693, 8acd7b6b, c1f71e54, c66c6c90, d3b5a2ef, e87388f6, 9854398c, 34839e1, c812a3

The two lines of numbers in the output should be exactly the same.

Output of nvidia-smi:

Sun Jul 23 16:02:54 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3080         On | 00000000:01:00.0 Off |                  N/A |
| 30%   40C    P8               36W / 320W|      1MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3080         On | 00000000:25:00.0 Off |                  N/A |
| 30%   31C    P8               23W / 320W|      1MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3080         On | 00000000:41:00.0 Off |                  N/A |
| 30%   30C    P8               22W / 320W|      1MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3080         On | 00000000:61:00.0 Off |                  N/A |
| 30%   28C    P8               26W / 320W|      1MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 3080         On | 00000000:81:00.0 Off |                  N/A |
| 30%   29C    P8               25W / 320W|      1MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 3080         On | 00000000:A1:00.0 Off |                  N/A |
| 30%   29C    P8               20W / 320W|      1MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 3080         On | 00000000:C1:00.0 Off |                  N/A |
| 30%   28C    P8               29W / 320W|      1MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 3080         On | 00000000:E1:00.0 Off |                  N/A |
| 30%   27C    P8               17W / 320W|      1MiB / 10240MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Output of cuda-gdb --version:

NVIDIA (R) CUDA Debugger

CUDA Toolkit 12.1 release

Portions Copyright (C) 2007-2023 NVIDIA Corporation

**GNU gdb (GDB) 12.1**

Copyright (C) 2022 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.

What raises a red flag in this code for me is that the carry flag is used to transport information between multiple asm() statements. My understanding is that there is no guarantee that this will have the desired effect. Adding the __volatile__ qualifier does not change that, i.e. it does not regulate what happens in between multiple asm() statements.

To be on the safe side, you would want to change the code such that carry flag generation and carry flag consumption are always contained in the same asm() statement. Information that needs to flow between asm() blocks must be bound to variables.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.