Unspecified launch failure


I’ve recently run into the following error:

Unspecified launch failure

from cudaGetErrorString(cudaGetLastError())

I realize that this is usually to be considered a “segmentation fault,” but I can’t explain it that way either. Here are snippets of the code in question:

The kernel:

    attributes(global) subroutine fft_kernel( Nq1, Nq2, Ngrid, Na, Nmode, Nind, Nline, AqqRealDev, AqqImgDev, phaseDev, TermIndexDev, AindDev )

        implicit none
        integer, parameter :: nspace = 3
        integer, value  :: Ngrid, Na, Nmode, Nind, Nq1, Nq2, Nline
        integer :: ii, jj, kk, inz
        real*4                              :: phasefactor
        real*4, dimension(Ngrid, Na, Na) :: phaseDev
        real*4, dimension(Nline,0:2*nspace) :: TermIndexDev
        real*4, dimension(-Nind:Nind) :: AindDev
        real*4, dimension(Nmode,Nmode,Nmode) :: AqqRealDev
        real*4, dimension(Nmode,Nmode,Nmode) :: AqqImgDev
        real*4                          :: tmp

        inz = blockIdx%x * blockDim%x + threadIdx%x
        ii = TermIndexDev(inz, 1)
        jj = TermIndexDev(inz, 2)
        kk = TermIndexDev(inz, 3)

        phasefactor = phaseDev(Nq1, TermIndexDev(inz, 4), TermIndexDev(inz, 6)) + phaseDev(Nq2, TermIndexDev(inz, 5), TermIndexDev(inz, 6))

        tmp = AindDev(TermIndexDev(inz, 0)) * cos(phasefactor)
        AqqRealDev(ii, jj, kk) = AqqRealDev(ii,jj,kk) + tmp

    end subroutine fft_kernel

I’ve verified that the code executes up until the last assignment to AqqRealDev. I tried changing the last few lines to:

    tmp = AindDev(TermIndexDev(inz, 0)) * cose(phasefactor)
    tmp2 = AqqRealDev(ii,jj,kk)
    tmp3 = tmp2 + tmp
    AqqRealDev(ii,jj,kk) = tmp3

If I comment out the last line, the code executes without error. If I run it as above, I get the unspecified launch failure again.

Ideas? Have I missed something glaringly obvious?

Further information…

I compiled with device emulation mode and made sure that everything was okay in the debugger. All of my device variables have reasonable memory addresses… they’re indexable… If I run the compiled Fortran CUDA code, it executes the kernel, returns from it, and then seg faults when it’s copying from device back to host. So, I can only assume that at some point it’s still seg faulting, but I can’t figure out where. It looks like it’s maybe an off-by-one or something random somewhere. I’ll post more after I narrow it down.

As always, tips are appreciated.