Hi Forum,
I am trying to use asm code to implement ldg128 and stg128 for cuda global memory access, and here is my main code for this part:
__device__ __forceinline__ void ldg128(const __half2* addr, __half2 ®0, __half2 ®1, __half2 ®2, __half2 ®3){
asm volatile(
"ld.global.nc.v4.b32 {%1, %2, %3, %4}, [%0];\n"
: "=r"(__HALF2_TO_UI(reg0)),
"=r"(__HALF2_TO_UI(reg1)),
"=r"(__HALF2_TO_UI(reg2)),
"=r"(__HALF2_TO_UI(reg3))
: "l"(addr)
);
}
__device__ __forceinline__ void stg128(__half2* addr, __half2 ®0, __half2 ®1, __half2 ®2, __half2 ®3) {
asm volatile(
"st.global.v4.b32 [%0], {%1, %2, %3, %4};\n"
:
: "l"(addr),
"r"(__HALF2_TO_UI(reg0)),
"r"(__HALF2_TO_UI(reg1)),
"r"(__HALF2_TO_UI(reg2)),
"r"(__HALF2_TO_UI(reg3))
);
}
To compile, I have included
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_fp16.h>
#include <cuda_fp16.hpp>
and add my cuda arch in the Cmake (so functions like __hfma. __hfma2 works fine). When I want to compile the code above, an error occured saying:
common.cu(41): error: identifier "__HALF2_TO_UI" is undefined
common.cu(41): error: an asm operand must have scalar type
common.cu(42): error: an asm operand must have scalar type
common.cu(43): error: an asm operand must have scalar type
common.cu(44): error: an asm operand must have scalar type
May I know where did I do wrong? I cat the /usr/local/cuda/include/cuda_fp16.hpp and found the definition of __HALF2_TO_UI, but seems that the code cannot find that…
Thank you for your help!
Chengzhe