Illegal address during kernel execution for large grids

benry · June 20, 2014, 2:47pm

Hi,
for the same code described in another post of mine: From four nested loops to 3D blocks I noticed a strange error which may be related to the one found by another user, described in this post: how dose PGI manage collapse clause

The error I am speaking about is the following one:

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

and is occurring only for problem sizes larger than a certain amount.

As shown in my previous post, in my code there are 4 nested loops. The produced executable is running fine, producing correct results, up to a dimension equal to 48x48x48x38, further increasing one of the four dimensions produce the mentioned error.
The same error is produced using both the A and B approaches mentioned in my previous post: From four nested loops to 3D blocks

The same code compiled targeting the host runs fine on the CPU for any problem size (fitting in memory).

Running cuda-memcheck against an executable targeting the GPU using the approach B gave me the following output (which is much longer, but continues with the same kind of errors related to other threads…).

========= CUDA-MEMCHECK
========= Program hit error 201 on CUDA API call to cuCtxAttach 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0x144de2]
=========     Host Frame:./test-acc [0x12cd9]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
=========     Host Frame:./test-acc [0x15d46]
=========
========= Invalid __global__ read of size 8
=========     at 0x00000c10 in Deo_123_gpu
=========     by thread (479,0,0) in block (0,0,0)
=========     Address 0x228b681df8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
=========     Host Frame:./test-acc [0x14b01]
=========
========= Invalid __global__ read of size 8
=========     at 0x00000c10 in Deo_123_gpu
=========     by thread (478,0,0) in block (0,0,0)
=========     Address 0x228b681de8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
=========     Host Frame:./test-acc [0x14b01]
=========
========= Invalid __global__ read of size 8
=========     at 0x00000c10 in Deo_123_gpu
=========     by thread (477,0,0) in block (0,0,0)
=========     Address 0x228b681dd8 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
=========     Host Frame:./test-acc [0x14b01]

...

Given that the same code running on the CPU is not performing any out-of-bound accesses and that the same code for smaller dimensions is correctly running on the GPU, I guess that for bigger dimensions for some reasons the threads are not correctly scheduled… does someone have any hints about what could be the reasons causing this problem?

p.s.
Data transferred from host to device fits both in the host and in the device memory.

Thanks in advance,

Enrico

MatColgrove · June 20, 2014, 5:39pm

Hi Enrico,

It could be a problem in your code or it could be a problem with the PGI generated device code. Can you send me the code so I can investigate? If so, please send to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

You can try running the CPU version under Valgrind (www.valgrind.org) to see if anything pops out.

Mat

benry · June 23, 2014, 10:40am

Hi,
thanks for your reply.

I run the CPU version under Valgrind as you suggested and no issues where found apart from a couple of warnings which, as far as I understood, are related to large memory blocks allocation/deallocation, but are not evidences of code errors.

==18340== Memcheck, a memory error detector
==18340== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
==18340== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
==18340== Command: ./test-acc-host
==18340== 
==18340== Warning: set address range perms: large range [0x3aeed080, 0xf12ed080) (undefined)

==18340== Warning: set address range perms: large range [0x3aeed068, 0xf12ed098) (noaccess)

==18340== 
==18340== HEAP SUMMARY:
==18340==     in use at exit: 0 bytes in 0 blocks
==18340==   total heap usage: 31 allocs, 31 frees, 3,312,523,517 bytes allocated
==18340== 
==18340== All heap blocks were freed -- no leaks are possible
==18340== 
==18340== For counts of detected and suppressed errors, rerun with: -v
==18340== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 6 from 6)

Anyhow I am going to send to you the code…

Thanks in advance,

Enrico

Topic		Replies	Views
Large grids return cuStreamSynchronize error 700 Legacy PGI Compilers	7	6054	June 11, 2015
Illegal address during kernel execution Legacy PGI Compilers	2	4603	December 11, 2014
block sizes/if loops give inconsistent results Legacy PGI Compilers	3	2957	March 18, 2013
call to cuStreamSynchronize returned error 700: Illegal addr Legacy PGI Compilers	1	3516	March 5, 2018
Kernels fail to launch after a certain blockDim.x CUDA Programming and Performance	2	947	January 6, 2012
A strange problem Legacy PGI Compilers	5	3605	June 27, 2012
Kernel execution unknown error when number of blocks > 302 CUDA Programming and Performance	6	7108	June 28, 2009
GPU does not work why? Legacy PGI Compilers	9	15459	March 5, 2010
PGI generates kernel that couldn't be launched Legacy PGI Compilers	8	4789	April 24, 2013
error 700: Illegal address during kernel execution Legacy PGI Compilers	6	5055	April 16, 2019

Illegal address during kernel execution for large grids

Related topics