32 bit float compressed to 16 bit in CUDA

Hi,

I want to compress my 32 bits float to 16 bits or less. The float data are all in the range of 0.0 and 1.0.

Of course CUDA supports 16 bits float since CUDA 7.5. However, my project only supports CUDA 5.5/6.0. I have tried to convert my project to 7.5/8.0/9.1, the compilation was OK, but the result was wrong without any error message reported.

In my application, infinity, negtive-zero and etc could be ignored.

I have tried the following ways:

  1. Convert 32 bits float to IEEE standard by samples in the link: Convert int to 16bit float (half precision floating point) in c++ - Stack Overflow

But unfortuntely, the time comsumption is very bad.

  1. Convert 32 bits float to a 16 bits integar by the following code samples:
    #define SCALE_FACTOR 1024
    float fTest = 0.8;
    short int nCompress = fTest * SCALE_FACTOR;
    float fDecompress = __fdividef(nCompress, SCALE_FACTOR);

The time consumption is also not very good.

Is there any better way to compress the 32 bits float to 16 bits with high performance at CUDA 5.5/6.0? Thanks very much.

I assume you are aware that “compression” by simple FP32 → FP16 or FP32-> INT16 conversion will lose information, i.e. this is lossy compression.

(1) What GPU do you have?

(2) Do you need to convert the data on the host, on the device, or both?

(3) I am reasonably sure that CUDA 6.0 already had these device intrinsics: unsigned short __float2half_rn(float f); float __half2float(unsigned short h).

(4) Your current compression/decompression scheme should be very fast, so if this is too slow for your purposes, compression is probably not appropriate for your use case. If you replace ‘__fdividef(nCompress, SCALE_FACTOR)’ with '(1.0f/SCALE_FACTOR)*nCompress` it should be faster still.

This indicates with high probability that your code contains a latent bug (e.g. a violation of the programming model or a race condition) which you may want to find.

njuffa, thanks for your kind reply.

Since 0.0001 is sufficient for me, also I will still use 32 bits float to do calculation, the accuracy lost is OK for my application.

(1) I have GTX 1060.

(2) I do not need to convert the data on the host. Since the bandwidth of GTX 1060 is not high, also my project is memory bandwidth limited, i.e., I have to copy data to global memory and back to shared/register regularly. The NVPROFILER tool suggested me to optimize those memory copy statements. I think this is not a bad idea to compress my large data structure with tens of float values.

(3) In CUDA 6.0, I did not find the header file “cuda_fp16.h” which exists in CUDA 7.5/8.0/9.1, I’m not sure which header file I should include.

(4) I’ll try that. But why does the intrinsic function faster than your formula?

(5) In my programming model (particle transport simulation), each thread process one particle until it was absorbed by the matter. There is no data exchange among threads. It is my headache to find the issue in CUDA 7.5/8.0/9.1.

Intrinsics __float2half_rn() and __half2float() used to be in device_functions.h, as I recall, and that file is a CUDA header file automatically included when you compile code in a .cu file with nvcc.

(1.0f/SCALE_FACTOR) should be evaluated at compile-time, resulting in a constant, so ‘(1.0f/SCALE_FACTOR)*nCompress’ should turn into a single multiply instruction, just like the compression step.

Your latest statement that the code is memory bound seems to conflict with the earlier statement that inserting the simple compression / decompression steps shown resulted in a slow down. Use of half-precision data should be effective in improving memory bound computation.

Note that effective bandwidth will also depend on access patterns and possibly on access width (with wider accesses favored over narrower accesses). The cheapest loads are those that don’t have to be performed, so try to minimize data movement.

Got it, thanks very much. But it seems difficult to minimize data movement in my scene. I’ll explain the whole process as follows if you are interested.

My model was divided into two kernel functions.

1> The first kernel function generate 10^8 particles with attributes (structure) like:

struct ParticleInfo
{
    float energy;
    float weight;
    float velocityX;
    float velocityY;
    float velocityZ;
    float positionX;
    float positionY;
    float positionZ;
    int voxelPositionX;
    int voxelPositionY;
    int voxelPositionZ;
    int particleType;
    int absVoxelPosition;
    float laminMfp; 
    float prob;
    float voxelDensity;
    int voxelMaterial;
    int evtType;
    // ~20 variables ...... 
};

Because the total particle number is very large, I have to divide 10^8 into 65536 * 1526. The kernel function was invoked:

generateParticle<<<iNumBlock, iNumThreadPerBlock>>>(ParticleDeque *pParticleQueue, 1);

where iNumBlock*iNumThreadPerBlock = 65536 which is approved to be efficient in this application.

In some cases 65536 particles will be put into the golbal memory pointed to pParticleInfo. HOWEVER, some threads will not generate a valid particle, so in most cases about 50000 particles will be generated (randomly). In order to avoid some threads idle in the next transport kernel function, I invoke the above kernel function in the following way:

int nMaxParticle = 0;
static int nInvokeTime = 0;
while(true) {
    generateParticle<<<iNumBlock, iNumThreadPerBlock>>>(*pParticleQueue, 1);
    cudaDeviceSynchronize();
    ++nInvokeTime;

    cudaMemcpy(&nMaxParticle , &(pParticleQueue->queueSize), sizeof(int), cudaMemcpyDeviceToHost);
    if(nMaxParticle < 65536) {
        if(nInvokeTime < 1526)
            continue;
        else
            break;
    }
}

In this manner, the global memory “pParticleInfo” is not accessed effectively because I take it as a contineous cicular deque only when a valid particle was generated by one thread.

#define MAX_EVENT_SIZE 65536*2
struct ParticleDeque
{
    float energy[MAX_EVENT_SIZE];
    float weight[MAX_EVENT_SIZE];
    // the same data to ParticleInfo, abbreviated
    unsigned int queueIndex;
    int queueSize;
    bool queueDirty[MAX_EVENT_SIZE]; 
};

__device__ bool SaveTransportInfoToQueue(ParticleDeque &particleDeque, ParticleInfo const &particleInfo)
{
    unsigned int index = atomicInc(&(particleDeque.queueIndex), MAX_EVENT_SIZE - 1);
    particleDeque.particleEnergy[index] = particleInfo.energy;
    // code to copy info from deviceParticleInfo to deviceParticleQueue, abbreviated
    particleDeque.queueDirty[index] = true;
    atomicAdd(&(particleDeque.queueSize), 1);
    return true;
}

__global__ void generateParticle(ParticleDeque *pParticleQueue, int nDequeDim)
{
    const int iThreadID = threadIdx.x + blockDim.x * blockIdx.x;
    // ParticleInfo may not be in register,
    // but in local memory because its size is ~20*4 = 80 bytes
    ParticleInfo particleInfo; 
    memset(&particleInfo, 0, sizeof(ParitcleInfo));

    // code generate particle info, abbreviated
    
    int evtType = 0;
    // code to judge evtType, abbreviated

    if(0 != evtType){
        // some particles are invalid
        if(particleInfo.energy > 0.001f && particleInfo.weight > 0.001f) {
                SaveTransportInfoToQueue(pParticleQueue[0], particleInfo);
        }
    }
    return;
}

2> The second kernel function will get 65536 particles from the above circular queue by invoking:

int nMaxParticle = 0;
static int nInvokeTime = 0;
while(true) {
    generateParticle<<<iNumBlock, iNumThreadPerBlock>>>(ParticleInfo *pParticleInfo, 1);
    cudaDeviceSynchronize();
    ++nInvokeTime;

    cudaMemcpy(&nMaxParticle , &(pParticleQueue->queueSize), sizeof(int), cudaMemcpyDeviceToHost);
    if(nMaxParticle < 65536) {
        if(nInvokeTime < 1526)
            continue;
        else
            break;
    }
    transport<<<iNumBlock, iNumThreadPerBlock>>>(pParticleQueue, 1);
    cudaDeviceSynchronize();
}
__device__ void CopyQueueToTransportInfo(particleQueue &particleQueue, QMCGPUPartcleTransportInfo &particleInfo, int index)
{
    particleInfo.energy = particleQueue.energy[index];
    // copy data from particleQueue to particleInfo, abbreviated
}

__device__ bool GetTransportInfoFromQueue(ParticleQueue &particleQueue, ParticleInfo &particleInfo)
{
    // initialize
    __shared__ int nRequestNum;
    __shared__ int nLock;
    __shared__ unsigned int nThreadIDQueue[THREAD_NUM]; 
    if(0 != nRequestNum) {
        nRequestNum = 0;
    }
    if(1 != nLock) {
        nLock = 1;
    }
    nThreadIDQueue[threadIdx.x] = -1;
    __syncthreads();

    // get number required
    nThreadIDQueue[threadIdx.x] = atomicAdd(&nRequestNum, 1);
    __syncthreads();

    // update global memory
    __shared__ unsigned int nQueueIndex;
    if(0 != atomicExch(&nLock, 0)) {
        atomicSub(&particleQueue.queueSize, nRequestNum);
        atomicMax(&particleQueue.queueSize, 0);
        nQueueIndex = atomicSub(&particleQueue.m_iQueueIndex, nRequestNum);
    }
    __syncthreads();

    // update local memory
    unsigned int index;
    if(nRequestNum == blockDim.x) {
        index = nQueueIndex - threadIdx.x - 1;
    }
    else {
        index = nQueueIndex - nThreadIDQueue[threadIdx.x] - 1;
    }
    index = index & (MAX_EVENT_SIZE - 1);

    CopyQueueToTransportInfo(particleQueue, particleInfo, index);
    particleQueue.m_bQueueDirty[index] = false;
    //particleQueue.m_iEvtType[index] = -2;
    particleQueue.energy[index] = 0.f;

    return true;
}

__global__ void transport(DeviceParticleQueue* pDeviceParticleQueue, int nDim)
{
    const int iThreadID = threadIdx.x + blockDim.x * blockIdx.x;
    ParticleInfo pDeviceParticleInfo;

    if(!GetTransportInfoFromQueue(pDeviceParticleQueue, pDeviceParticleInfo)) {
        return;
    }

    // Each thread does transport individually, abbreviated

    return;
}

I haven’t worked with particle systems in many years, and don’t recall best practices for this use case. Your solutions looks like it is designed for a CPU, with all that queuing goodness. I wonder whether you could improve the parallel processing by simply always processing all the particles, and simply mark the ones that should be ignored in a given processing step.

That way, data access patterns should be regular. Some data will be loaded unnecessarily, and there will be a bit of thread divergence, but overall such an approach may still come out ahead if the percentage of ignored entries isn’t too high.

You may want to consider re-structuring your data a bit to to potentially fit the SIMD processor and memory access patterns…

Instead of letting each thread contain 1 ParticleInfo struct ( takes a lot of registers).

It could be benificial to flip it:

// example:

// postitions contiguous in memory:
x --> [,,,,,,,,,,,,,,,]
y --> [,,,,,,,,,,,,,,,]

// speed vectors, in contiguous memory
v_x --> [,,,,,,,,,,,,,,,]
v_y --> [,,,,,,,,,,,,,,,]

// on gpu:

__global__ void someKernel(ParcticleInfo partInfo)
{
	
	int i = threadIdx.x + blockIdx.x*blockDim.x;

	float pos_x = partInfo.x[i];
	float pos_y = partInfo.x[i];

	float v_x = partInfo.vx[i];
	float v_y = partInfo.vx[i];
	// compute great things ...
}

I think what Jimmy Pettersson is advocating is an SOA (structure of arrays) arrangement of the data, which has been a standing recommendation in high-performance computing since before GPUs arrived on the scene. I recall Intel pushing it heavily at Intel Developer Conferences in the late 1990s.

Thanks for all your suggestions.

I recalled that in the first version, I used ParticleInfo* as the input of generateParticle, and later in the posted version, I used SOA as follows which really improved the performance of the programme.
struct ParticleDeque
{
float energy[MAX_EVENT_SIZE];
float weight[MAX_EVENT_SIZE];
// the same data to ParticleInfo, abbreviated
unsigned int queueIndex;
int queueSize;
bool queueDirty[MAX_EVENT_SIZE];
};