Using cudaMemCheck

Hi all,

I have mentioned in previous post about my fortran code generating NaN (not a n number) error in the middle if execution. I have used cudamemcheck tool to diagnose. I am not familiar with it so I am posting what I get when running memcheck on my executable code (Quick5.exe): 12 severe errors.
I am compiling the code using PGF 13.9 fortran compiler (and cuda toolkit 5.0) with micro-soft VS 2010.

========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaLaunch + 0x1a9) [0x234c9]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (kernels_getreynvarqnj_kernel_ + 0x2a0) [0x4a510]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x1dc0) [0x87a50]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Invalid local write of size 8
========= at 0x00000190 in kernels_getreynvarqnj_kernel_
========= by thread (4,12,0) in block (0,2,0)
========= Address 0x00fffc08 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuLaunchKernel + 0x1b2) [0xe042]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll [0x3706]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaLaunch + 0x1a9) [0x234c9]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (kernels_getreynvarqnj_kernel_ + 0x2a0) [0x4a510]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x1dc0) [0x87a50]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Invalid local write of size 8
========= at 0x00000190 in kernels_getreynvarqnj_kernel_
========= by thread (3,12,0) in block (0,2,0)
========= Address 0x00fffc08 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuLaunchKernel + 0x1b2) [0xe042]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll [0x3706]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaLaunch + 0x1a9) [0x234c9]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (kernels_getreynvarqnj_kernel_ + 0x2a0) [0x4a510]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x1dc0) [0x87a50]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Invalid local write of size 8
========= at 0x00000190 in kernels_getreynvarqnj_kernel_
========= by thread (2,12,0) in block (0,2,0)
========= Address 0x00fffc08 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuLaunchKernel + 0x1b2) [0xe042]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll [0x3706]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaLaunch + 0x1a9) [0x234c9]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (kernels_getreynvarqnj_kernel_ + 0x2a0) [0x4a510]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x1dc0) [0x87a50]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Invalid local write of size 8
========= at 0x00000190 in kernels_getreynvarqnj_kernel_
========= by thread (1,12,0) in block (0,2,0)
========= Address 0x00fffc08 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuLaunchKernel + 0x1b2) [0xe042]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll [0x3706]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaLaunch + 0x1a9) [0x234c9]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (kernels_getreynvarqnj_kernel_ + 0x2a0) [0x4a510]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x1dc0) [0x87a50]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Invalid local write of size 8
========= at 0x00000190 in kernels_getreynvarqnj_kernel_
========= by thread (0,12,0) in block (0,2,0)
========= Address 0x00fffc08 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuLaunchKernel + 0x1b2) [0xe042]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll [0x3706]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaLaunch + 0x1a9) [0x234c9]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (kernels_getreynvarqnj_kernel_ + 0x2a0) [0x4a510]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x1dc0) [0x87a50]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Program hit error 30 on CUDA API call to cudaThreadSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuProfilerStop + 0xa0432) [0xbfc12]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaThreadSynchronize + 0x218) [0x1e1b8]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (cudathreadsynchronize_ + 0x12) [0xaa312]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x1dc8) [0x87a58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Program hit error 30 on CUDA API call to cudaLaunch
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuProfilerStop + 0xa0432) [0xbfc12]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaLaunch + 0x2a5) [0x235c5]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (kernels_getreynvarak_kernel_ + 0x36e) [0x4a88e]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x207e) [0x87d0e]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Program hit error 30 on CUDA API call to cudaThreadSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuProfilerStop + 0xa0432) [0xbfc12]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaLaunch + 0x2a5) [0x235c5]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (kernels_getreynvarak_kernel_ + 0x36e) [0x4a88e]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x207e) [0x87d0e]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Program hit error 30 on CUDA API call to cudaThreadSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuProfilerStop + 0xa0432) [0xbfc12]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaThreadSynchronize + 0x218) [0x1e1b8]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (cudathreadsynchronize_ + 0x12) [0xaa312]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x2086) [0x87d16]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= Program hit error 30 on CUDA API call to cudaMemcpy
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuProfilerStop + 0xa0432) [0xbfc12]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\cudart64_50_35.dll (cudaMemcpy + 0x2ae) [0x27dae]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (pgf90_dev_copyout + 0x4c) [0xa727c]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (reyneq3_ + 0x21d3) [0x87e63]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (vcycle_ + 0x3c29) [0x98239]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (fullmult_ + 0x74d) [0x989fd]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (initcasepreadapt_ + 0x2e8) [0x6bb58]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (MAIN_ + 0x7ca4) [0x67954]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (main + 0x70) [0x10e0]
========= Host Frame:C:\Users\Dolf\Desktop\quick 5 test results\run\Quick5.exe (__tmainCRTStartup + 0x136) [0x11e6e6]
========= Host Frame:C:\Windows\system32\KERNEL32.DLL (BaseThreadInitThunk + 0x1a) [0x1832]
========= Host Frame:C:\Windows\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x5d609]

========= ERROR SUMMARY: 12 errors

any ideas? which ones are the 12 errors I need to fix?

thanks,
Dolf

Hi Dolf,

The out-of-bounds errors are bad and should be fixed. Easiest thing to do would be to compile in emulation mode (-Mcuda=emu) and add bounds checking (-Mbounds). Hopefully this will show the same error and the exact spot where it occurs.

Program hit error 30 on CUDA API call to cudaThreadSynchronize

I believe this means an out-of-bounds error in shared memory so may just be continuation of the same error.

  • Mat

what does that mean?
how come I have that error even if I am applying checking for the right threads in the beginning of the kernel subroutine?


attributes (global) subroutine GetReynVarqnj_kernel(nx,ny,ndx,ndy, &
iqpo,p,hnew,hjmin,hjmax,cohjmx,s,l,kd,zdatLow,qndatLow)

implicit none
integer :: i, j, k
integer, value :: nx,ny,ndx,ndy,s,l,kd,iqpo
real(8) :: qnj(nx,ny)
real(8) :: zdatLow(s), qndatLow(s)
real(8) :: zdatMid(l), qndatMid(l)
real(8) :: zdatHigh(kd), qndatHigh(kd)
real(8) :: p(nx,ny)
real(8) :: hnew(ndx,ndy),hjmin(ndx,ndy),hjmax(ndx,ndy), &
cohjmx(ndx,ndy)
integer :: n(2)

i = (blockidx%x -1) * blockDim%x + threadidx%x
j = (blockidx%y -1) * blockDim%y + threadidx%y

n(1) = size(p,1) - 1
n(2) = size(p,2)

if (i .ge. 2 .AND. i .le. n(1) ) then
if (j .ge. 2 .AND. j .le. n(2) ) then

Easiest thing to do would be to compile in emulation mode (-Mcuda=emu) and add bounds checking (-Mbounds).

I did this option in release mode with bounds check. But it take awfully long time to run (stuck in the middle). maybe I should do it in debug instead? should I run under cuda-memcheck? or just by itself??

Thanks,

Dolf

what does that mean?

An out-of-bounds error means that you are accessing memory (either read or write) beyond the number of elements in the array. At best, this is benign, at worst this will give you wrong answers or cause memory access violation.

how come I have that error even if I am applying checking for the right threads in the beginning of the kernel subroutine?

You have other arrays besides p, the out-of-bounds reference could be coming from one of these. Check the sizes of the other arrays (s, l, kd, ndx, ndy) and if they are being accessed out size these ranges.

maybe I should do it in debug instead?

Debug mode uses emulation mode as well so would be no better. On device debugging will be available early next year.

or just by itself??

You can go back to the old debug method, i.e. print statements. However, printing from devices isn’t formatted so you can have the output from several threads intermixed. Though, that would be my next step.

  • Mat