trouble of using std::tuple in cuda device code compiled with nvcc

I believe CUDA 9.2 supports std::tuple according to

But I often encounter weird memory errors when using nvcc 9.2 or 10 to compile cuda device code using std::tuple while the same code works fine when compiled with clang 7 with the cuda 9.2 or 10. cuda-gdb can only trace the errors back to the kernel launch but not further.

The other weird thing is that substituting std::tuple with thrust::tuple can avoid the error entirely. But thrust::tuple is not a variadic template so there’s limitation in using it in my application.

I haven’t have the time to isolate a minimal reproducing example because the code is using Kokkos, which in turns use CUDA backend. But here’s an example of the issue:

The main point of this post for me is asking if there’s any known limitation of nvcc with std::tuple especially when shared memory is involved. If someone can comment on this, I’d also like to know why using std::tuple vs thrust::tuple makes a difference. Thanks in advance!