I have a project where I’m trying to offload some loop to the GPU, it manages to compile but when running fails with
Current file: /.../file.inc
function: function_name
line: 172
This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86
I’m struggling to find the cause of this, the line points to a do concurrent loop, I have tried export NVCOMPILER_TERM=trace
but I get no additional information. Likewise, I am struggling also to reduce it to a minimal reproducible example because a single test file compiles and runs fine, for example
program doconcurrent
implicit none
integer :: i, n
real :: a, x(100), y(100)
n = 100
a = 2.0
x = [(real(i), i=1,n)]
y = [(real(i), i=1,n)]
do concurrent (i = 1:n) local(x, y)
y(i) = y(i) + a*x(i)
enddo
print *, "Results:"
do i = 1, n
print *, "y(", i, ") = ", y(i)
enddo
end program doconcurrent
compiled with nvfortran -stdpar=gpu -acc=gpu -gpu=cc80 -gpu=cc86 doconcurrent.f90
and nvidia-smi
and nvidiaaccelinfo
seem to output correctly
CUDA Driver Version: 12080
NVRM version: NVIDIA UNIX Open Kernel Module for x86_64 570.133.07 Release Build (dvs-builder@U22-I3-G01-1-1) Fri M
ar 14 12:57:14 UTC 2025
Device Number: 0
Device Name: NVIDIA GeForce RTX 3060 Ti
Device Revision Number: 8.6
Global Memory Size: 8232108032
Number of Multiprocessors: 38
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1740 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 7001 MHz
Memory Bus Width: 256 bits
L2 Cache Size: 3145728 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Unified Memory: HMM
Memory Models Flags: -gpu=mem:separate, -gpu=mem:managed, -gpu=mem:unified
Default Target: cc86
Wed May 14 09:12:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07 Driver Version: 570.133.07 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Ti Off | 00000000:01:00.0 Off | N/A |
| 0% 34C P8 7W / 220W | 32MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 181012 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 181438 G /usr/bin/gnome-shell 3MiB |
+-----------------------------------------------------------------------------------------+
I don’t know if the fact that I’m using MPI might help narrow down the cause. Thank you in advance!