I am computing 2^x (aka ex2
). For the rest of my CUDA kernel, I am using half2 vectorized instructions, so I would like to use the half2 ex2 instruction. I am using an NVIDIA A10 GPU.
In the ptx manual, it says the ex2.approx.f16x2
can run ex2
on a half2 data type.
Using CUDA
I was thinking maybe the h2exp2()
cuda function would call the ex2.approx.f16x2
instruction.
See the below code example:
#include <cuda_fp16.h>
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void myKernel_half2(__half2* x){
auto block = cg::this_thread_block();
int32_t gid = block.group_index().x;
int32_t tid = block.thread_index().x;
int32_t idx = gid*256 + tid;
half2 input = x[idx];
half2 output;
output = h2exp2(x[idx]);
x[idx] = output;
}
I compile the above with -gencode=arch=compute_86,code=sm_86
and this is the SASS that I see:
MOV R1, c[0x0][0x28]
S2R R2, SR_CTAID.X
MOV R5, 0x4
ULDC.64 UR4, c[0x0][0x118]
S2R R3, SR_TID.X
LEA R2, R2, R3, 0x8
IMAD.WIDE R2, R2, R5, c[0x0][0x160]
LDG.E R0, [R2.64]
HADD2.F32 R4, -RZ, R0.reuse.H0_H0
HADD2.F32 R0, -RZ, R0.H1_H1
MUFU.EX2 R4, R4
MUFU.EX2 R0, R0
FFMA R5, R4, 5.9604644775390625e-08, R4
FFMA R6, R0, 5.9604644775390625e-08, R0
F2FP.PACK_AB R5, R6, R5
STG.E [R2.64], R5
EXIT
Unfortunately, this SASS is not calling ex2.approx.f16x2
, and it is doing something more complicated. It converts the numbers to fp32, runs two calls to EX2 in fp32, and converts back to fp16.
Using PTX
I also tried it using inline PTX assembly.
#include <cuda_fp16.h>
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void myKernel_half2(__half2* x){
auto block = cg::this_thread_block();
int32_t gid = block.group_index().x;
int32_t tid = block.thread_index().x;
int32_t idx = gid*256 + tid;
half2 input = x[idx];
half2 output;
asm ("ex2.approx.f16x2 %0, %1;" : "=f"(output) : "f"(input));
x[idx] = output;
}
However, this fails with a compile error:
<source>(19): error: an asm operand must have scalar type
asm ("ex2.approx.f16x2 %0, %1;" : "=f"(output) : "f"(input));
^
<source>(19): error: an asm operand must have scalar type
asm ("ex2.approx.f16x2 %0, %1;" : "=f"(output) : "f"(input));
^
2 errors detected in the compilation of "<source>".
Compiler returned: 1
Is there a way that I can get my CUDA kernel to use the ex2.approx.f16x2
instruction?