My card definitely supports the arch I compiled for, sm_70. I am working on a large scientific program that uses relocatable device code (done under CMake), and have not encountered this issue before. There is one small kernel I have which is using an automatically generated device move constructor from “= default”. Upon launch, I get a 0x92 error from cudaLaunchKernel, i.e. “invalid device function”.
I’m really confused by this. If every source file was compiled for sm_70, how could I possibly get this error at run time? No other kernels seem to suffer from this. I’m stumped on how to debug this.
This other post investigated the generated fatbin files to verify the correct arch code was generated. I’ve done the same and do see sm_70 is what I’ve got in the linked library I generated:
thebeast% cuobjdump -ptx lib/libopenmc.so
Fatbin elf code:
================
arch = sm_70
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
OK, nice. Now I try dumping my sass for the whole shared library I’ve created. A related kernel that I’m NOT running into the issue with is here, nice. Here are the first few lines:
Function : _ZN6openmc30run_move_constructor_on_deviceINS_19IncoherentElasticXSEEEvPT_
.headerflags @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
/*0000*/ MOV R1, c[0x0][0x28] ; /* 0x00000a0000017a02 */
/* 0x003fde0000000f00 */
/*0010*/ @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ; /* 0x000000fffffff389 */
/* 0x000fe200000e00ff */
/*0020*/ IADD3 R1, R1, -0x20, RZ ; /* 0xffffffe001017810 */
/* 0x003fde0007ffe0ff */
/*0030*/ S2R R0, SR_LMEMHIOFF ; /* 0x0000000000007919 */
/* 0x00321e0000003700 */
/*0040*/ ISETP.GE.U32.AND P0, PT, R1, R0, PT ; /* 0x000000000100720c */
/* 0x003fde0003f06070 */
/*0050*/ @P0 BRA 0x70 ; /* 0x0000001000000947 */
/* 0x003fde0003800000 */
/*0060*/ BPT.TRAP 0x1 ; /* 0x000000040000795c */
/* 0x003fde0000300000 */
/*0070*/ IADD3 R0, R1, RZ, RZ ; /* 0x000000ff01007210 */
So, I’m looking for one called run_move_constructor_on_device<IncoherentInelasticAE>. Grepping for a regex that might reveal the mangled name for that has not turned up anything. What’s strange is that the template instantiation that generated the above SASS code (run_move_constructor_on_device<IncoherentElasticXS>) is in the same source file where I should be instantiating run_move_constructor_on_device<IncoherentInelasticAE>. A possible compiler bug?
Here’s what the kernel looks like.
// This sets up vtables in device memory
template<typename T>
__global__ void run_move_constructor_on_device(T* __restrict__ dist)
{
// TODO: it may be faster to allocate a buffer in memory here.
static_assert(std::is_move_constructible<T>::value,
"Polymorphic objects to be put on device must be move constructible.");
char buffer[sizeof(T) + alignof(T)];
char* aligned_buffer =
buffer + alignof(T) - reinterpret_cast<intptr_t>(buffer) % alignof(T);
T* tmp = new (aligned_buffer) T(std::move(*dist));
new (dist) T(std::move(*tmp));
}
Yes I know that polymorphism is not optimal to use. But all this does is set up the vtables on the device. I could also share some details on what “T” here is, but am not sure it would help. Source code is here. Unfortunately it is big and I can provide guidance as necessary.
PS: I need to push some new commits to this to reproduce this error.
thebeast% nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0