I’m able to use store fp16 values in a CUDA texture (via a surface mapping the underlying array), and sample them back, as in the following snippet:
__device__ void setTextureValue(const float4 value, int32_t x, int32_t y, int32_t z) {
surf3Dwrite<ushort4>(
make_ushort4(
__half_as_ushort(__float2half_rn(value.x)),
__half_as_ushort(__float2half_rn(value.y)),
__half_as_ushort(__float2half_rn(value.z)),
__half_as_ushort(__float2half_rn(value.w))),
surface_,
x * sizeof(ushort4),
y,
z);
}
__device__ float4 sampleTextureValue(float x, float y, float z) const {
return tex3D<float4>(texture_, x + 0.5f, y + 0.5f, z + 0.5f);
}
This works, and makes use of the texture unit to perform hardware interpolation. However, I think it’s not as efficient as it could be. The first issue I see is that there is no way to return a __half4
value from the texture sampling, forcing conversion from the float4
back into a __half4
for further downstream processing. This is cheap enough that it’s not a huge deal (at the moment) though.
The second (arguably more substantial) issue is that I think potential compute is being wasted. In particular, Pascal introduced FPUs which are able to process either one FP32 value or a __half2
value with a single instruction - i.e. each FPU can process 2 fp16 values at once. I’m not sure whether the same is true for whatever compute units are performing the hardware interpolation, but assuming it is, what code would I need to write to take advantage of it during texture interpolation?
Assuming interpolation is performed on 4x __half2
values, the return type would be either float8
(assuming the upsampling to float
needs to happen for output), or more ideally uint4
such that the underlying interpolated __half2
can be recovered with simply a reinterpret_cast<__half2>(sampled_uint4[0 /* or 1 or 2 or 3 */])
.