integer, device :: temp1_d(1382400000)
integer :: temp1(1382400000), istat
if I define these two array and I execute temp1_d = temp1 or temp1 = temp1_d, it will cause error “0: copyout MemcpyAsync (host=0x000000014013B240, dev=0x0000004309800000, size=5529600000, stream=1) FAILED: 11(invalid argument)”

My GPU is Titan XP. the memory size is 12GB. What is the problem?

(Note that this the same answer from your other post: What is the greatest size of data can transform)


I tried to write a reproducer so I could see what’s wrong, but it worked fine for me (See below). Can you please either modify my example to better capture your error by writing an example with causes the failure?

Note that Windows doesn’t support the medium memory model. So if you’re on Windows you must allocate the arrays instead of using fixed size (and compile with -Mlarge_arrays)


Note that I don’t have a TitanXp so I used the closest that I have which is a Titan X. Though, they should be similar in behavior.

% pgaccelinfo 

CUDA Driver Version:           9020 
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  396.26  Mon Apr 30 18:01:39 PDT 2018 

Device Number:                 0 
Device Name:                   TITAN X (Pascal) 
Device Revision Number:        6.1 
Global Memory Size:            12788498432 
Number of Multiprocessors:     28 
Concurrent Copy and Execution: Yes 
Total Constant Memory:         65536 
Total Shared Memory per Block: 49152 
Registers per Block:           65536 
Warp Size:                     32 
Maximum Threads per Block:     1024 
Maximum Block Dimensions:      1024, 1024, 64 
Maximum Grid Dimensions:       2147483647 x 65535 x 65535 
Maximum Memory Pitch:          2147483647B 
Texture Alignment:             512B 
Clock Rate:                    1531 MHz 
Execution Timeout:             No 
Integrated Device:             No 
Can Map Host Memory:           Yes 
Compute Mode:                  default 
Concurrent Kernels:            Yes 
ECC Enabled:                   No 
Memory Clock Rate:             5005 MHz 
Memory Bus Width:              384 bits 
L2 Cache Size:                 3145728 bytes 
Max Threads Per SMP:           2048 
Async Engines:                 2 
Unified Addressing:            Yes 
Managed Memory:                Yes 
Concurrent Managed Memory:     Yes 
Preemption Supported:          Yes 
Cooperative Launch:            Yes 
  Multi-Device:                Yes 
PGI Default Target:            -ta=tesla:cc60 

% cat test.cuf 
module bar 

 use cudafor 

attributes(global) subroutine test(temp1) 
   integer, dimension(:) :: temp1 
   integer i 
   i = threadidx%x 
   temp1(1382400000-i) = i 

end subroutine test 

end module bar 

program foo 

use cudafor 
use bar 

integer, device :: temp1_d(1382400000) 
integer :: temp1(1382400000), istat 

temp1_d = temp1 
call test<<<1,32>>>(temp1_d) 
temp1 = temp1_d 
print *, temp1(1382400000-32:1382400000) 

end program foo 
% pgf90 test.cuf -mcmodel=medium -V18.4 
% a.out 
           32           31           30           29           28           27 
           26           25           24           23           22           21 
           20           19           18           17           16           15 
           14           13           12           11           10            9 
            8            7            6            5            4            3 
            2            1            0