Hi,
for the same code described in another post of mine: From four nested loops to 3D blocks I noticed a strange error which may be related to the one found by another user, described in this post: how dose PGI manage collapse clause
The error I am speaking about is the following one:
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
and is occurring only for problem sizes larger than a certain amount.
As shown in my previous post, in my code there are 4 nested loops. The produced executable is running fine, producing correct results, up to a dimension equal to 48x48x48x38, further increasing one of the four dimensions produce the mentioned error.
The same error is produced using both the A and B approaches mentioned in my previous post: From four nested loops to 3D blocks
The same code compiled targeting the host runs fine on the CPU for any problem size (fitting in memory).
Running cuda-memcheck against an executable targeting the GPU using the approach B gave me the following output (which is much longer, but continues with the same kind of errors related to other threads…).
========= CUDA-MEMCHECK
========= Program hit error 201 on CUDA API call to cuCtxAttach
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuCtxAttach + 0x182) [0x144de2]
========= Host Frame:./test-acc [0x12cd9]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
========= Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
========= Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
========= Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
========= Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
========= Host Frame:./test-acc [0x15d46]
=========
========= Program hit error 1 on CUDA API call to cuPointerGetAttribute
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib64/libcuda.so (cuPointerGetAttribute + 0x1b5) [0x13a9e5]
========= Host Frame:./test-acc [0x15d46]
=========
========= Invalid __global__ read of size 8
========= at 0x00000c10 in Deo_123_gpu
========= by thread (479,0,0) in block (0,0,0)
========= Address 0x228b681df8 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
========= Host Frame:./test-acc [0x14b01]
=========
========= Invalid __global__ read of size 8
========= at 0x00000c10 in Deo_123_gpu
========= by thread (478,0,0) in block (0,0,0)
========= Address 0x228b681de8 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
========= Host Frame:./test-acc [0x14b01]
=========
========= Invalid __global__ read of size 8
========= at 0x00000c10 in Deo_123_gpu
========= by thread (477,0,0) in block (0,0,0)
========= Address 0x228b681dd8 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib64/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
========= Host Frame:./test-acc [0x14b01]
...
Given that the same code running on the CPU is not performing any out-of-bound accesses and that the same code for smaller dimensions is correctly running on the GPU, I guess that for bigger dimensions for some reasons the threads are not correctly scheduled… does someone have any hints about what could be the reasons causing this problem?
p.s.
Data transferred from host to device fits both in the host and in the device memory.
Thanks in advance,
Enrico