I’m seeing a cuda error from “cudaGetLastError” of “invalid device function” when I have a kernel that uses a function returning a pointer, which resides in shared memory [the data it points to does not]. I am not using a lot of shared memory – I can declare (and write to, so the compiler doesn’t optimize it away) a 14kb array of shared memory [probably more, but it did fail at 16k, as it should].
I’m not exactly sure why this happens; it seems to mess up other kernels that don’t use the code (even when the kernel returning the value from shared mem never returns).
If there are any nVidia employees who would like to look at it, I would be happy to email all of the code for a [possible] bug. It’s very possible there are errors in my code, but the error should be made less cryptic, and kernels shouldn’t be able to mess eachother up.
A snippet of the code that doesn’t work is:
__device__ Range *copy_range_ps(Range *r) {
__shared__ Range *next;
SERIAL {
next = NEXT_RANGE(r, num_bounds);
*next = *r;
next->history_.clear();
next->previous_ = r;
}
__syncthreads();
PAR_FOR(a, num_bounds) {
next->values_[a] = r->values_[a];
}
__syncthreads();
return next;
}
If I change it to the following, it stops killing other kernels.
/// copies a range on the "stack".
__device__ Range *copy_range_ps(Range *r) {
Range *next;
next = NEXT_RANGE(r, num_bounds);
SERIAL {
*next = *r;
next->history_.clear();
next->previous_ = r;
}
__syncthreads();
PAR_FOR(a, num_bounds) {
next->values_[a] = r->values_[a];
}
__syncthreads();
return next;
}
The following calling code seems to be causing the problem; interestingly, other call sites don’t seem to affect it.
__device__ Range *restore_checkpoint_u(int *msg_enc, int *bits_enc) {
__shared__ Range *next;
POSTSYNC_SERIAL(next_cp = (next_cp - 1);
*bits_enc = next_cp->bits_enc_);
next = copy_range_ps(next_cp->previous_);
follow_history_u(next_cp->history_, next);
return next;
}
Thanks in advance.