Illegal address during kernel execution for large grids

Hi,
for the same code described in another post of mine: https://forums.developer.nvidia.com/t/from-four-nested-loops-to-3d-blocks/134127/1 I noticed a strange error which may be related to the one found by another user, described in this post: https://forums.developer.nvidia.com/t/how-dose-pgi-manage-collapse-clause/134079/1

The error I am speaking about is the following one:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

and is occurring only for problem sizes larger than a certain amount.

As shown in my previous post, in my code there are 4 nested loops. The produced executable is running fine, producing correct results, up to a dimension equal to 48x48x48x38, further increasing one of the four dimensions produce the mentioned error.
The same error is produced using both the A and B approaches mentioned in my previous post: https://forums.developer.nvidia.com/t/from-four-nested-loops-to-3d-blocks/134127/1

The same code compiled targeting the host runs fine on the CPU for any problem size (fitting in memory).

Running cuda-memcheck against an executable targeting the GPU using the approach B gave me the following output (which is much longer, but continues with the same kind of errors related to other threads…).

========= CUDA-MEMCHECK
========= Program hit error 201 on CUDA API call to cuCtxAttach 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0x144de2]
=========     Host Frame:./test-acc [0x12cd9]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Invalid __global__ read of size 8
=========     at 0x00000c10 in Deo_123_gpu
=========     by thread (479,0,0) in block (0,0,0)
=========     Address 0x228b681df8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
=========     Host Frame:./test-acc [0x14b01]
=========
========= Invalid __global__ read of size 8
=========     at 0x00000c10 in Deo_123_gpu
=========     by thread (478,0,0) in block (0,0,0)
=========     Address 0x228b681de8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
=========     Host Frame:./test-acc [0x14b01]
=========
========= Invalid __global__ read of size 8
=========     at 0x00000c10 in Deo_123_gpu
=========     by thread (477,0,0) in block (0,0,0)
=========     Address 0x228b681dd8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
=========     Host Frame:./test-acc [0x14b01]

...

Given that the same code running on the CPU is not performing any out-of-bound accesses and that the same code for smaller dimensions is correctly running on the GPU, I guess that for bigger dimensions for some reasons the threads are not correctly scheduled… does someone have any hints about what could be the reasons causing this problem?

p.s.
Data transferred from host to device fits both in the host and in the device memory.


Thanks in advance,

Enrico

Hi Enrico,

It could be a problem in your code or it could be a problem with the PGI generated device code. Can you send me the code so I can investigate? If so, please send to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

You can try running the CPU version under Valgrind (www.valgrind.org) to see if anything pops out.

  • Mat

Hi,
thanks for your reply.

I run the CPU version under Valgrind as you suggested and no issues where found apart from a couple of warnings which, as far as I understood, are related to large memory blocks allocation/deallocation, but are not evidences of code errors.

==18340== Memcheck, a memory error detector
==18340== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==18340== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==18340== Command: ./test-acc-host
==18340== 
==18340== Warning: set address range perms: large range [0x3aeed080, 0xf12ed080) (undefined)

==18340== Warning: set address range perms: large range [0x3aeed068, 0xf12ed098) (noaccess)

==18340== 
==18340== HEAP SUMMARY:
==18340==     in use at exit: 0 bytes in 0 blocks
==18340==   total heap usage: 31 allocs, 31 frees, 3,312,523,517 bytes allocated
==18340== 
==18340== All heap blocks were freed -- no leaks are possible
==18340== 
==18340== For counts of detected and suppressed errors, rerun with: -v
==18340== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)

Anyhow I am going to send to you the code…

Thanks in advance,

Enrico