I’ve put error check for all CUDA calls and found that the program failed at the first CUDA call, see below:
cudaErrorUnknown: unknown error
If we use cuda-memcheck, the program further returned error 999, e.g., Program hit cudaErrorUnknown (error 999) due to “unknown error” on CUDA API call to cudaHostAlloc.
The error is gone if we 1) remove launch_bound(512, 2) or 2) change the launch bound parameters to 256, 2.
I am unable to connect the dots at this point. I cannot recall encountering
cudaErrorUnkown in more than a dozen years of CUDA programming.
What CUDA version is this, and what is the driver version? What GPU are you running on and for which architecture are you compiling the code? Can you post a minimal reproducer code (a small program others can cut & paste, compile, and run) that demonstrates the issue?
A general approach to weird error scenarios is to update to the latest CUDA version and driver package to make sure one is not exposed to bugs that are known and fixed. Then double check that the target architecture(s) specified at compile time covers the GPU(s) one intends to run on, because the compiler default for the architecture rarely matches one’s actual requirements and one would want to eliminate possible issues that might arise from JIT compilation.
why call div instruction do not need stack save&restore?
Many modern calling conventions on processors of all kinds are mostly register-based and use the stack only in particular circumstances (e.g. many function arguments, or
struct arguments and / or returns), i.e. rarely. As far as I am aware, calling conventions for GPUs are not publicly documented.
if call happen, register should be pushed . in call div , there is no register saved in div callee function, only pass argument to call div
That is a misconception. Register-based calling conventions typically reserve a number of registers to pass arguments to the called subroutine, and these do not have to be preserved. A typical number would be 4 to 7. Additional registers may be defined as scratch space that the subroutine likewise does not need to preserve. For subroutines with modest register requirements usage (a typical number would be 6 to 8) it is therefore possible that no registers need to be saved at the start of a subroutine and restored later. The CALL / RETURN mechanism itself may use the stack to store the return address, but some architectures use a link register instead.
As I stated, the calling convention(s) for NVIDIA GPUs are not (to my knowledge) publicly documented. They may change over time between CUDA versions, might be different for different GPU architectures, and there might more than one calling convention depending on whether a function is a built-in function or a user-defined one. With some targeted experiments this information could be reverse engineered.
It might be instructive to study the calling conventions used for x86-64 or ARM, both of which are register based.