Async directive

When I ran a job with written in OpenACC, an error has occurred.
Could you let me know why I got the error message ?
And then I set the environment variable “PGI_ACC_BUFFERSIZE” and reduce the value to 4M.
No error occurred.
Could you let me know why I did NOT get the error message when I set the environment variable “PGI_ACC_BUFFERSIZE” ?

Here is the error message of standard output.

total/free CUDA memory: 1011023872/998965248
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 2.1
host:0x602d00 device:0x400540000 size:512 presentcount:1+0 line:11 name:a
allocated block device:0x400540000 size:512 thread:1
call to cuMemHostAlloc returned error 2: Out of memory


And here is the source code. It is very simple.

program async_test
implicit none
integer(kind=4),parameter :: n = 128
integer(kind=4),dimension(n) :: a
integer(kind=4) :: i

do i = 1,n
a(i) = i
end do

!$acc data create(a(:))
do i = 1,n
!$acc update device(a(i:i)) async(i)
end do
!$acc wait
!$acc end data

end program async_test



Here is the result of “deviceQuery”.


CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GT 610”
CUDA Driver Version / Runtime Version 8.0 / 7.5
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 964 MBytes (1011023872 bytes)
( 1) Multiprocessors, ( 48) CUDA Cores/MP: 48 CUDA Cores
GPU Max Clock rate: 1620 MHz (1.62 GHz)
Memory Clock rate: 500 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 65536 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65535), 3D=(2048, 2048, 2048)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (65535, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 11 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 610
Result = PASS

When I compiled a job with written in OpenACC, I used the PGI compiler.
% pgfortran -V

pgfortran 16.5-0 64-bit target on x86-64 Linux -tp p7
The Portland Group - PGI Compilers and Tools
Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved

Hi KOUCHI_Hiroyuki,

Every async queue will need it’s own set of buffers. Given you’re creating 128 queues and the default buffer size is 16 MB, it makes sense that you’d run of memory on this device.

Reducing the buffer size works, but you can also use a fixed number of queues.

  • Mat

Dear Mat-san,

Thank you for the reply.
I did not know the default size of environment variable is 16M.
As you said, my job wasted physical memory resource on my machine.

16M x 128 queues = 2048 M, but the amount of physical memory on my machine is 1024M.

And when I reduced the buffer size to 7M, my job normally finished, as you said.

Could you let me ask more questions ?

  1. Configuration of “memlock”
    Could you let me know if a configuration of “memlock” is concerned with this problem ?
    When my job failed and put the error message, I thought “pinned-memory” causes this problem.
    So I modified the configuration file /etc/security/limits.conf on my machine and I increased the size of memlock.
    Here is the part of configuration file.

  • soft memlock unlimited
  • hard memlock unlimited

If so, when I run a job with written in OpenACC using PGI compiler,
I suppose if I need to increase the size of memlock.

  1. CUDA Environment
    What about the “CUDA” environment ?
    Could you let me know if the environment variable “PGI_ACC_BUFFERSIZE” concerns with the “CUDA” environment ?

Sincerely yours,

  1. Configuration of “memlock”

I’ve personally never encountered a problem with running out of pinned memory. I suppose it could cause a problem if set too low.

  1. CUDA Environment
    What about the “CUDA” environment ?

I’m not clear what you mean by the “CUDA” environment nor what your asking. PGI_ACC_BUFFERSIZE controls the buffer size the PGI runtime uses when transferring data between the host and device. The PGI runtime does use CUDA API calls to manage this movement when using an NVIDIA device but the buffer size is PGI controlled.

  • Mat

Dear Mat-san,

Thank you for your advice.
When I configured "/etc/security/limits.conf and increased the value of “memlock” to
unlimited and reduced the value of environment variable “PGI_ACC_BUFFERSIZE”,
my job works normally with “async” directives. I think the both caused the problem.

I did not clarify if PGI_ACC_BUFFERSIZE is only available for PGI runtime.

At the current moment, there is no problem when I use my job with “async” directives.

Sincerely yours,