cryptic 'invalid device function'... when returning value from shared mem

I’m seeing a cuda error from “cudaGetLastError” of “invalid device function” when I have a kernel that uses a function returning a pointer, which resides in shared memory [the data it points to does not]. I am not using a lot of shared memory – I can declare (and write to, so the compiler doesn’t optimize it away) a 14kb array of shared memory [probably more, but it did fail at 16k, as it should].

I’m not exactly sure why this happens; it seems to mess up other kernels that don’t use the code (even when the kernel returning the value from shared mem never returns).

If there are any nVidia employees who would like to look at it, I would be happy to email all of the code for a [possible] bug. It’s very possible there are errors in my code, but the error should be made less cryptic, and kernels shouldn’t be able to mess eachother up.

A snippet of the code that doesn’t work is:

__device__ Range *copy_range_ps(Range *r) {

    __shared__ Range *next;

    SERIAL {

        next = NEXT_RANGE(r, num_bounds);

        *next = *r;

        next->history_.clear();

        next->previous_ = r;

    }

    __syncthreads();

    PAR_FOR(a, num_bounds) {

        next->values_[a] = r->values_[a];

    }

    __syncthreads();

    return next;

}

If I change it to the following, it stops killing other kernels.

/// copies a range on the "stack".

__device__ Range *copy_range_ps(Range *r) {

    Range *next;

    next = NEXT_RANGE(r, num_bounds);

   SERIAL {

        *next = *r;

        next->history_.clear();

        next->previous_ = r;

    }

    __syncthreads();

    PAR_FOR(a, num_bounds) {

        next->values_[a] = r->values_[a];

    }

    __syncthreads();

    return next;

}

The following calling code seems to be causing the problem; interestingly, other call sites don’t seem to affect it.

__device__ Range *restore_checkpoint_u(int *msg_enc, int *bits_enc) {

    __shared__ Range *next;

    POSTSYNC_SERIAL(next_cp = (next_cp - 1);

                    *bits_enc = next_cp->bits_enc_);

    next = copy_range_ps(next_cp->previous_);

    follow_history_u(next_cp->history_, next);

    return next;

}

Thanks in advance.

Also, any other kernel fails, including an empty
global void emptykernel() { }
called with 1 block x 1 thread. cudaGetLastError does not fail before the kernel is invoked.

I also forgot to mention that doing a cudaMemcpyFromSymbol before any kernels run results in ‘invalid device symbol’.

To add to the confusion, I did find a temporary workaround, in a completely different function (which was not being called by the copy range function, and I could comment out one call site of the copy range and it would work, but I really do need the copy range…) lucky that I was keeping this project in svn…

deleting the two “if (…) atomicMax / atomicMin” blocks below seems to make it work. Fortunately, I don’t seem to need it immediately, but it would be nice to find out what was wrong. The pointers are not in shared memory, and I have other atomic function code in other kernels which seems to work fine.

__device__ void follow_history_u(History history, Range *r) {

    for (int hist_idx = 0; hist_idx < history.get_count(); hist_idx++) {

        SERIAL {

            r->history_.add(history[hist_idx]);

        }

        PAR_FOR(a, num_bounds) {

            if (history[hist_idx]) {

                r->values_[a].lower_bound_ = __fadd_rn(r->lb(a), -0.5f);

                if (r->values_[a].lower_bound_ <= 0.f) {

                    atomicMax(&(r->min_idx_), a);

                }

            }

            r->values_[a].lower_bound_ = __fmul_rn(r->lb(a), 2.f);

            if (r->lb(a) > 1.f) {

                atomicMin(&(r->upper_bound_idx_), a);

            }

        }

    }

}