cuMemcpyDtoH error 99

I’m testing the OpenACC directives with the following function:

void update(intrestrict solid, floatrestrict fp, floatrestrict rho, float*restrict ux, float restrict uy, floatrestrict uz, int nx, int ny, int nz, float mass, int time_total)
int x,y,z,t;
for (t=0; t<=time_total; t++)
mass = 0.0;

#pragma acc kernels loop present(solid[0:nx-1][0:ny-1][0:nz-1], fp[0:nx-1][0:ny-1][0:nz-1][0:18], rho[0:nx-1][0:ny-1][0:nz-1], ux[0:nx-1][0:ny-1][0:nz-1], uy[0:nx-1][0:ny-1][0:nz-1], uz[0:nx-1][0:ny-1][0:nz-1])

for (x=0; x<nx; x++)
for (y=0; y<ny; y++)
for (z=0; z<nz; z++)
if (!solid[y][z])
rho[y][z] = fp[y][z][0]+fp[y][z][1]+fp[y][z][2]+fp[y][z][3]+fp[y][z][4]+fp[y][z][5]+fp[y][z][6]+fp[y][z][7]+fp[y][z][8]+fp[y][z][9]+fp[y][z][10]+fp[y][z][11]+fp[y][z][12]+fp[y][z][13]+fp[y][z][14]+fp[y][z][15]+fp[y][z][16]+fp[y][z][17]+fp[y][z][18];

ux[y][z] = (fp[y][z][1]+fp[y][z][2]+fp[y][z][8]-fp[y][z][4]-fp[y][z][5]-fp[y][z][6]+fp[y][z][15]+fp[y][z][18]-fp[y][z][16]-fp[y][z][17])/rho[y][z];

uy[y][z] = (fp[y][z][2]+fp[y][z][3]+fp[y][z][4]-fp[y][z][6]-fp[y][z][7]-fp[y][z][8]+fp[y][z][9]+fp[y][z][14]-fp[y][z][11]-fp[y][z][12])/rho[y][z];

uz[y][z] = (fp[y][z][9]+fp[y][z][10]+fp[y][z][11]-fp[y][z][12]-fp[y][z][13]-fp[y][z][14]+fp[y][z][15]+fp[y][z][16]-fp[y][z][17]-fp[y][z][18])/rho[y][z];

mass += rho[y][z];


And the function is called from the main with:

#pragma acc data copy(s[0:nx-1][0:ny-1][0:nz-1], f[0:nx-1][0:ny-1][0:nz-1][0:18], r[0:nx-1][0:ny-1][0:nz-1], uxx[0:nx-1][0:ny-1][0:nz-1], uyy[0:nx-1][0:ny-1][0:nz-1], uzz[0:nx-1][0:ny-1][0:nz-1])
update(s, f, r, uxx, uyy, uzz, nx, ny, nz, mass, time_total);

The info messages and error I got are the following:

PGI$ pgcc -acc -Minfo=accel -fast lbm3Dacc4.c
NOTE: your trial license will expire in 10 days, 6.38 hours.
268, Generating present(uz[0:nx-1][0:ny-1][0:nz-1])
Generating present(uy[0:nx-1][0:ny-1][0:nz-1])
Generating present(ux[0:nx-1][0:ny-1][0:nz-1])
Generating present(rho[0:nx-1][0:ny-1][0:nz-1])
Generating present(fp[0:nx-1][0:ny-1][0:nz-1][0:18])
Generating present(solid[0:nx-1][0:ny-1][0:nz-1])
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
269, Loop is parallelizable
271, Loop is parallelizable
273, Loop is parallelizable
Accelerator kernel generated
269, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.z /
271, #pragma acc loop gang, vector(4) /
blockIdx.x threadIdx.y /
273, #pragma acc loop vector(16) /
threadIdx.x */
CC 1.0 : 24 registers; 1136 shared, 72 constant, 0 local memory byt
es; 33% occupancy
CC 2.0 : 32 registers; 1032 shared, 136 constant, 0 local memory by
tes; 66% occupancy
282, Sum reduction generated for mass
368, Generating copy(uzz[0:nx-1][0:ny-1][0:nz-1])
Generating copy(uyy[0:nx-1][0:ny-1][0:nz-1])
Generating copy(uxx[0:nx-1][0:ny-1][0:nz-1])
Generating copy(r[0:nx-1][0:ny-1][0:nz-1])
Generating copy(f[0:nx-1][0:ny-1][0:nz-1][0:18])
Generating copy(s[0:nx-1][0:ny-1][0:nz-1])

PGI$ lbm3Dacc4.exe
call to cuMemcpyDtoH returned error 999: Unknown
CUDA driver version: 4000

I tried to follow strictly the program structure of the example file “acc_c3a.c” for OpenACC. I should make note that for sizing arrays I had used calloc instead of malloc.

I would appreciate any advice about this error. Regards!

Hi amunozf,

It looks like the compiler is functioning correctly, but there is some issue with your driver Our default is to use CUDA 4.0, so the generated code should work fine with a 4.0 driver, why it’s not, I’m not sure.

What is the output from the command “pgaccelinfo”? Can you try updating your driver to the latest CUDA development version (



After upgrading drivers as suggested, we were’nt able to fix the cuMemcpyDtoH error. Below are the results we got,

PGI$ pgcc -acc lbm3Dacc4.c
NOTE: your trial license will expire in 0 days, 7.86 hours.
PGI$ lbm3Dacc4.exe
call to cuMemcpyDtoH returned error 999: Unknown
CUDA driver version: 4020
PGI$ pgaccelinfo
CUDA Driver Version: 4020

Device Number: 0
Device Name: GeForce GTX 580
Device Revision Number: 2.0
Global Memory Size: 1610285056
Number of Multiprocessors: 16
Number of Cores: 512
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1544 MHz
Execution Timeout: Yes
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: default
Concurrent Kernels: Yes
ECC Enabled: No
Memory Clock Rate: 2004 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 1
Unified Addressing: No
Current free memory: 1433202688
Upload time (4MB): 940 microseconds ( 620 ms pinned)
Download time: 940 microseconds ( 780 ms pinned)
Upload bandwidth: 4462 MB/sec (6765 MB/sec pinned)
Download bandwidth: 4462 MB/sec (5377 MB/sec pinned)

'Cause we ran out of time, we’ll look to future releases of the acc engine.



Hi Antony,

Having a complete example might be useful since I’m not sure how your arrays are declared or allocated. One major difference between the PGI Accelerator Model and OpenACC is that for PGI the copy clauses use the range to copy while in OpenACC it uses the starting element and the number of copy. So in your case, the error may be caused by using the “nx-1”, “ny-1”, “nz-1”, assuming that your arrays have nx, ny, and nz elements. Also, OpenACC requires arrays to be contiguous and not arrays of pointers.

  • Mat