I am trying to feed an n-by-n-by-n structure into a kernel function. I may encounter an issue of illegal memory access when I try to copy the memory from device to host:
The size of 6291456 is when n = 64, and there are three 8-byte integer/double in each element of the structure, so that 64x64x64x3x8 = 6291456.
The puzzling thing is that I may encounter the issue alternately at runtime. Or sometimes it runs fine but then it fails when I tried to use cuda-memcheck.
I used to have the bug of the true size of the structure is not correctly extracted, but this time I read in the structure and extract the dimensions using size(grid, dim) where dim is the dimension to be extracted.
The error messages from cuda-memcheck consist mostly of messages as below:
========= Invalid __global__ read of size 8
========= at 0x0002f628 in photon_module_shoot_photon_
========= by thread (31,0,0) in block (0,0,0)
========= Address 0x103164011f0 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
...
Your help is greatly appreciated as I was not familiar with memcheck and could not decipher the error messages. It is also greatly appreciated if I am allowed to send in the code and error messages if possible.
I have forwarded your question for a possible answer.
I anticipate they will ask if you can provide a program that demonstrates. Should you not want to share with the Forum, you can send it to trs@pgroup.com.
An “illegal memory access” is a generic error similar to a seg fault on the host where a bad address is referenced on the device. I can occur under several circumstances with the most common causes being accessing memory beyond the end of an array, accessing a host pointer on the device, a single object using more than 2GB of memory without the -Mlarge_array flag, or using too much heap or stack space on the device.
Note that the error is most likely occurring in the kernel before the memcpy (as seen in the cuda-memcheck output), not in the memcpy itself.
Given the cuda-memcheck error and the fact that the code succeeds sometimes, my best guess is that you have an out-of-bounds error.
Is this OpenACC or CUDA Fortran? In OpenACC I’d recommend running the binary with Valgrind (www.valgrind.org) to see if it finds any memory errors. For CUDA Fortan, compile in emulation mode (-Mcuda=emu) and then run under Valgrind.
Indeed, the error happened in the kernel, before the real memory copy took place. I was just confused about which way I should use to avoid out-of-bounds error.
I have provided my CUDA Fortran code and error messages to trs@pgroup.com. In the mean time, I will try to use the way of emulation mode + Valgrind to test as suggested by Mat.
I took a look at the code. I think the problem is that you’re using an uninitialized value as the index into an array. Hence the failure is intermittent depending upon what value happens to be in the variable. I sent more details in my email response.