CUDA Fortran Error

Hi I’m Linwei
I compile a program to running by CUDA Fortran, but compile get some error

Software info: PGI version (pgfortran 19.10-0 LLVM 64-bit target on x86-64 Linux -tp sandybridge), nvcc version (Cuda compilation tools, release 11.0, V11.0.194), Linux version (CentOS Linux release 7.8.2003 (Core)),

Error information:

0: ALLOCATE: 0 bytes requested; not enough memory: 0(no error)

/opt/pgi/linux86-64-llvm/19.10/lib/libpgf90.so(__fort_abort+0x4d) [0x7f499ce9d72d]

/opt/pgi/linux86-64-llvm/19.10/lib/libcudafor.so(+0x74ec7) [0x7f49aeb4cec7]

/opt/pgi/linux86-64-llvm/19.10/lib/libcudafor.so(pgf90_dev_auto_alloc04_i8+0x29) [0x7f49aeb4cf31]

./a.out() [0x401833]

/usr/lib/gcc/x86_64-redhat-linux/4.8.5/…/…/…/…/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f499adcb555]

./a.out() [0x4016d9]

My objective is to try to used GPU to improved calculation time and used “Device” to transfer data. But this error can’t be solved. Hope the engineer can help me.

Thanks

This seems to indicate that the device you’re using doesn’t have enough memory to satisfy the requested amount of memory needed for this allocation.

What device are you using and how much memory are you trying to allocate?

Do you have small reproducing example that you can share? This would help to determine the problem.

This is Cuda-memcheck information

Usage: cuda-memcheck [options] [your-program] [your-program-options]
Options:
–binary-patching <yes|no> [Default : yes]
Control the binary patching of the device code. This is enabled by default.
Disabling this option will result in a loss of precision for error reporting.
–check-api-memory-access <yes|no> [Default : yes]
Check cudaMemcpy/cudaMemset for accesses to device memory
–check-deprecated-instr <yes|no> [Default : no]
Check for usage of deprecated instructions.
If deprecated instruction usage is found, an error will be reported.
Which instructions are checked might depend on the selected tool.
This is disabled by default.
–check-device-heap <yes|no> [Default : yes]
Check allocations on the device heap. This is enabled by default.
–demangle <full|simple|no> [Default : full]
Demangle function names
full : Show full name and prototype
simple : Show only device kernel name
no : Show mangled names
–destroy-on-device-error <context|kernel> [Default : context]
Behavior of cuda-memcheck on a precise device error.
NOTE: Imprecise errors will always destroy the context.
context : CUDA Context is terminated with an error.
kernel : Kernel is terminated. Subsequent kernel launches are still allowed.
–error-exitcode [Default : 0]
When this is set, memcheck will return the given exitcode when any errors are detected
–filter key1=val1,key2=val2,…
The filter option can be used to control the kernels that will be checked by the tool
Multiple filter options can be defined. Each option is additive, so kernels matching
any specified filter will be checked
Filters are specified as key value pairs, with each pair separated by a ‘,’
Keys have both a long form, and a shorter form for convenience
Valid values for keys are:
kernel_name, kne : The value is the full demangled name of the kernel
kernel_substring, kns : The value is a substring present in the demangled name of the kernel
NOTE: The name and substring keys cannot be simultaneously specified
–flush-to-disk <yes|no> [Default : no]
Flush errors to disk. This can be enabled to ensure all errors are flushed down
–force-blocking-launches <yes|no> [Default : no]
Force launches to be blocking.
-h | --help Show this message.
–help-debug Show information about debug only flags
–language <c|fortran> [Default : c]
This option can be used to enable language specific behavior. When set to fortan, the thread and block indices
of messages printed by cuda-memcheck will start with 1-based offset to match Fortran semantics.
–log-file File where cuda-memcheck will write all of its text output. If not specified, memcheck output is written to stdout.
The sequence %p in the string name will be replaced by the pid of the cuda-memcheck application.
The sequence %q{FOO} will be replaced by the value of the environment variable FOO. If the environment variable
is not defined, it will be replaced by an empty string.
The sequence %% is replaced with a literal % in the file name.
Any other character following % will cause the entire string to be ignored.
If the file cannot be written to for any reason including an invalid path, insufficient permissions or disk being full
the output will go to stdout
–leak-check <full|no> [Default : no]
Print leak information for CUDA allocations.
NOTE: Program must end with cudaDeviceReset() for this to work.
–prefix Changes the prefix string displayed by cuda-memcheck.
–print-level <info|warn|error|fatal> [Default : warn]
Set the minimum level of errors to print
–print-limit [Default is : 10000]
When this is set, memcheck will stop printing errors after reaching the given number
of errors. Use 0 for unlimited printing.
–read Reads error records from a given file.
–racecheck-report <all|hazard|analysis> [Default : analysis]
The reporting mode that applies to racecheck.
all : Report all hazards and race analysis reports.
hazard : Report only hazards.
analysis : Report only race analysis results.
–report-api-errors <all|explicit|no> [Default : explicit]
Print errors if any API call fails
all : Report all CUDA API errors, including those APIs invoked implicitly
explicit : Report errors in explicit CUDA API calls only
no : Disable reporting of CUDA API errors
–save Saves the error record to file.
The sequence %p in the string name will be replaced by the pid of the cuda-memcheck application.
The sequence %q{FOO} will be replaced by the value of the environment variable FOO. If the environment variable
is not defined, it will be replaced by an empty string.
The sequence %% is replaced with a literal % in the file name.
Any other character following % will cause an error.
–show-backtrace <yes|host|device|no> [Default : yes]
Display a backtrace on error.
no : No backtrace shown
host : Only host backtrace shown
device : Only device backtrace shown for precise errors
yes : Host and device backtraces shown
See the manual for more information
–tool <memcheck|racecheck|synccheck|initcheck> [Default : memcheck]
Set the tool to use.
memcheck : Memory access checking
racecheck : Shared memory hazard checking
Note : This disables memcheck, so make sure the app is error free.
synccheck : Synchronization checking
initcheck : Global memory initialization checking
–track-unused-memory <yes|no> [Default : no]
Check for unused memory allocations. This requires initcheck tool.
-V | --version Print the version of cuda-memcheck.

And this is nvidia -smi info

±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05 Driver Version: 450.51.05 CUDA Version: 11.0 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 1660 Off | 00000000:03:00.0 On | N/A |
| 32% 31C P8 4W / 120W | 240MiB / 5944MiB | 2% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2200 G /usr/bin/X 120MiB |
| 0 N/A N/A 3449 G /usr/bin/gnome-shell 117MiB |
±----------------------------------------------------------------------------+

This is pgaccelinfo

CUDA Driver Version: 11000
NVRM version: NVIDIA UNIX x86_64 Kernel Module 450.51.05 Sun Jun 28 10:33:40 UTC 2020

Device Number: 0
Device Name: GeForce GTX 1660
Device Revision Number: 7.5
Global Memory Size: 6232997888
Number of Multiprocessors: 22
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 65536
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 2147483647 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1800 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 4001 MHz
Memory Bus Width: 192 bits
L2 Cache Size: 1572864 bytes
Max Threads Per SMP: 1024
Async Engines: 3
Unified Addressing: Yes
Managed Memory: Yes
Concurrent Managed Memory: Yes
Preemption Supported: Yes
Cooperative Launch: Yes
Multi-Device: Yes
PGI Default Target: -ta=tesla:cc75

This is code example:

nvidiadebug.txt (32.4 KB)

I’m try to used GPU acceleration, may engineer provide an example debug for me? Thanks a lot!