I am computing 2^x (aka ex2). For the rest of my CUDA kernel, I am using half2 vectorized instructions, so I would like to use the half2 ex2 instruction. I am using an NVIDIA A10 GPU.
In the ptx manual, it says the ex2.approx.f16x2 can run ex2 on a half2 data type.
Using CUDA
I was thinking maybe the h2exp2() cuda function would call the ex2.approx.f16x2 instruction.
See the below code example:
#include <cuda_fp16.h>
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void myKernel_half2(__half2* x){
auto block = cg::this_thread_block();
int32_t gid = block.group_index().x;
int32_t tid = block.thread_index().x;
int32_t idx = gid*256 + tid;
half2 input = x[idx];
half2 output;
output = h2exp2(x[idx]);
x[idx] = output;
}
I compile the above with -gencode=arch=compute_86,code=sm_86 and this is the SASS that I see:
MOV R1, c[0x0][0x28]
S2R R2, SR_CTAID.X
MOV R5, 0x4
ULDC.64 UR4, c[0x0][0x118]
S2R R3, SR_TID.X
LEA R2, R2, R3, 0x8
IMAD.WIDE R2, R2, R5, c[0x0][0x160]
LDG.E R0, [R2.64]
HADD2.F32 R4, -RZ, R0.reuse.H0_H0
HADD2.F32 R0, -RZ, R0.H1_H1
MUFU.EX2 R4, R4
MUFU.EX2 R0, R0
FFMA R5, R4, 5.9604644775390625e-08, R4
FFMA R6, R0, 5.9604644775390625e-08, R0
F2FP.PACK_AB R5, R6, R5
STG.E [R2.64], R5
EXIT
Unfortunately, this SASS is not calling ex2.approx.f16x2, and it is doing something more complicated. It converts the numbers to fp32, runs two calls to EX2 in fp32, and converts back to fp16.
Using PTX
I also tried it using inline PTX assembly.
#include <cuda_fp16.h>
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void myKernel_half2(__half2* x){
auto block = cg::this_thread_block();
int32_t gid = block.group_index().x;
int32_t tid = block.thread_index().x;
int32_t idx = gid*256 + tid;
half2 input = x[idx];
half2 output;
asm ("ex2.approx.f16x2 %0, %1;" : "=f"(output) : "f"(input));
x[idx] = output;
}
However, this fails with a compile error:
<source>(19): error: an asm operand must have scalar type
asm ("ex2.approx.f16x2 %0, %1;" : "=f"(output) : "f"(input));
^
<source>(19): error: an asm operand must have scalar type
asm ("ex2.approx.f16x2 %0, %1;" : "=f"(output) : "f"(input));
^
2 errors detected in the compilation of "<source>".
Compiler returned: 1
Is there a way that I can get my CUDA kernel to use the ex2.approx.f16x2 instruction?