What is the greatest size of data can transform

What is the greatest size of data can transform between devices and host?


The theoretical or the practical limit? Theoretically it’s 2^64 bytes but in practice its the size of the device memory that’s the limit. Though, if you use CUDA Unified Memory with CUDA 9.0 or later on a Pascal or Volta, you can oversubscribe the GPU memory.

Note, if any one dynamically allocated object is greater than 2GB, then add the flag “-Mlarge_arrays”. Or if you have a large static object and are on Linux, add “-mcmodel=medium”. By default we use 32-bit offsets for address since it’s better for performance. The above flags change this to use 64-bit offsets.


integer, device :: temp1_d(1382400000)
integer :: temp1(1382400000), istat
if I define these two array and I execute temp1_d = temp1 or temp1 = temp1_d, it will cause error “0: copyout MemcpyAsync (host=0x000000014013B240, dev=0x0000004309800000, size=5529600000, stream=1) FAILED: 11(invalid argument)”

My GPU is Titan XP. the memory size is 12GB. What is the problem?


I tried to write a reproducer so I could see what’s wrong, but it worked fine for me (See below). Can you please either modify my example to better capture your error or write an example with causes the failure?

Note that Windows doesn’t support the medium memory model. So if you’re on Windows you must allocate the arrays instead of using fixed size (and compile with -Mlarge_arrays)


Note that I don’t have a TitanXp so I used the closest that I have which is a Titan X. Though, they should be similar in behavior.

% pgaccelinfo

CUDA Driver Version:           9020
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  396.26  Mon Apr 30 18:01:39 PDT 2018

Device Number:                 0
Device Name:                   TITAN X (Pascal)
Device Revision Number:        6.1
Global Memory Size:            12788498432
Number of Multiprocessors:     28
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1531 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             5005 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 3145728 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
  Multi-Device:                Yes
PGI Default Target:            -ta=tesla:cc60

% cat test.cuf
module bar

 use cudafor

attributes(global) subroutine test(temp1)
   integer, dimension(:) :: temp1
   integer i
   i = threadidx%x
   temp1(1382400000-i) = i

end subroutine test

end module bar

program foo

use cudafor
use bar

integer, device :: temp1_d(1382400000)
integer :: temp1(1382400000), istat

temp1_d = temp1
call test<<<1,32>>>(temp1_d)
temp1 = temp1_d
print *, temp1(1382400000-32:1382400000)

end program foo
% pgf90 test.cuf -mcmodel=medium -V18.4
% a.out
           32           31           30           29           28           27
           26           25           24           23           22           21
           20           19           18           17           16           15
           14           13           12           11           10            9
            8            7            6            5            4            3
            2            1            0

Thank you very much!