Convert FP32 to FP16 by CPU and Transfer FP16 Copy to GPU


How can I convert my matrix in FP32 to FP16 and just transfer converted version to GPU? My CPU is Xeon(R) Gold 6126 and GPU is V100.

I want to reduce memory usage and bandwidth. Because I don’t have enough space to keep both FP32 and FP16 copy.

In a .cu file:

#include <cuda_fp16.h>
size_t size = ...;
float  *fp32matrix = new  float[size];
__half *fp16matrix = new __half[size];
for (size_t  i = 0; i < size; i++) fp16matrix[i] = __float2half(fp32matrix[i]);
delete[] fp32matrix; 
__half *dmatrix;
cudaMalloc(&dmatrix, size*sizeof(dmatrix[0]));
cudaMemcpy(dmatrix, fp16matrix, size*sizeof(dmatrix[0]), cudaMemcpyHostToDevice);

Newer x86-64 CPUs have SIMD instructions for float <-> half conversions (with selectable rounding mode) which should be a lot faster for bulk conversions than calling __float2half(), the host version of which for x86-64 platforms appears to be implemented using integer instructions (per disassembly of the generated code).

To use the SIMD instruction VCVTPS2PH and its counterpart VCVTPH2PS, check the CPUID feature flags for the F16C extension. Intrinsics for these have likely been defined and made available via immintrin.h, but I have not needed this functionality and have not looked for them.

From what I can find on the internet, the Xeon(R) Gold 6126 processor is based on the Skylake architecture and should have support for the F16C extension.

Intel’s intrinsic guide shows the following intrinsics:

__m128 _mm_cvtph_ps (__m128i a) // vcvtph2ps
__m256 _mm256_cvtph_ps (__m128i a) // vcvtph2ps
__m128i _mm_cvtps_ph (__m128 a, int sae) // vcvtps2ph
__m128i _mm256_cvtps_ph (__m256 a, int sae) // vcvtps2ph


Thank you. I used this method in my code and it is working fine (a little slow). I have a problem with matrixes that are larger than 48000. I am seeing this error:

Segmentation fault (core dumped)

I have modified this line :

for (size_t  i = 0; i < size; i++) fp16matrix[i] = __float2half(fp32matrix[i]);

to 64_bit integer also, to be able to handle large index numbers.

My current input is in double.

for (int64_t ic=0; ic<size_hm; ic++){

I tested it, it works for me with large matrix sizes. I’m not able to explain why you are getting a seg fault. Usual debugging for a seg fault is to first identify the line of code that is causing the seg fault.

Note that size_t is a 64-bit integer unsigned type on relevant platforms for CUDA. For the platforms where it isn’t, switching to int64_t won’t help for indexing purposes.

$ cat
#include <cuda_fp16.h>

int main(){
  size_t size = 96000;
  double  *fp64matrix = new  double[size];
  __half *fp16matrix = new __half[size];
  for (size_t  i = 0; i < size; i++) fp16matrix[i] = __double2half(fp64matrix[i]);
  delete[] fp64matrix;
  __half *dmatrix;
  cudaMalloc(&dmatrix, size*sizeof(dmatrix[0]));
  cudaMemcpy(dmatrix, fp16matrix, size*sizeof(dmatrix[0]), cudaMemcpyHostToDevice);

$ nvcc -o t2091
$ compute-sanitizer ./t2091
========= ERROR SUMMARY: 0 errors

Thanks. Your code is for a vector. Does it work for a matrix also?

I think I understand my code problem. Lets test it.

Any proper n-dimensional matrix that uses contiguous storage (thus, a single allocation) can be treated as a 1D vector for the purpose of data transfer.

If, however, you refer to a commonly encountered data structure consisting of a vector of row pointers to separately allocated rows (or column pointers pointing to columns), that is not possible, and such a data structure is not a matrix in the strict sense of the world and should not be called that.