coalescing memory in short to float conversion

mborgerd · January 22, 2009, 11:29pm

Is there a technique that I can use to speed up int16_t to float conversion?
The memory access is the trouble – 16 bits loaded and 32 bits stored.
Is there any way to solve this?

seibert · January 23, 2009, 12:19am

Cast your int16_t pointer to int32_t pointer, have each thread load one 32 bit int, split into two 16 bit ints, convert them to float, and write them into a shared memory buffer. Then call __syncthreads() and have a second loop write the floats out in order.

seibert · January 23, 2009, 12:50am

This might be more clear in code (not tested):

// Assumptions: 

//	* input and output contain n elements

//	* n is even

//	* kernel is invoked with sizeof(float)*2*blockDim.x bytes of shared memory

//	* gridDim.x * blockDim.x >= n/2

__global__ void int16_to_float(int n, int16_t *input, float *output)

{

  extern __shared__ float temp[];

int input_index = blockDim.x * blockIdx.x + threadIdx.x;

  int output_index = 2 * blockDim.x * blockIdx.x;

if (input_index*2 < n) {

	int word = ((int *) input)[input_index]

	int16_t hi = word >> 16;

	int16_t lo = word & 0xFFFF;

	temp[2*threadIdx.x] = (float) lo;

	temp[2*threadIdx.x+1] = (float) hi;

  }

__syncthreads();

if(input_index*2 < n) {

	output[output_index+threadIdx.x] = temp[threadIdx.x]

	output[output_index+blockDim.x+threadIdx.x] = temp[blockDim.x+threadIdx.x];

  }

}

I probably handled the sign bit wrong when splitting the 32-bit int into 16-bit ints, but hopefully this demonstrates how you can use the shared memory to coalesced wrong-sized data types.

mborgerd · January 23, 2009, 3:57pm

Thanks for the quick reply and great info. That got me going on the right track.

I had a bit of trouble with different blocks stomping on each other’s shared memory, but when I dropped the “extern”, things started working.

I’m still learning about the different memory types in cuda.

The function I ended up with transferred the 16 bit integers into a shared block of memory, then sync, then convert 16 bit to float and write output, then sync again.

static __global__

void cvt_i16_to_f32( float *outbuf,const int16_t  * inbuf,int n)

{

	__shared__ int32_t tmp32[256]; // enough to hold 512 int16_t  i.e. max threads/block

	int16_t* tmp16 = (int16_t*)tmp32;

	int k;

	for (k = threadIdx.x + blockIdx.x * blockDim.x;

		 k<n;

		 k += blockDim.x*gridDim.x)

	{

		if ( threadIdx.x < (blockDim.x>>1) ) {

			// if this thread is in the first half of the block, then 

			// copy two shorts as an int

			int32_t * blockstart =(int32_t*)(inbuf + k - threadIdx.x);

			tmp32[threadIdx.x] = blockstart[threadIdx.x];

		}

		__syncthreads();

		// read int16_t from shared memory 

		outbuf[k] = (float)tmp16[threadIdx.x];

		__syncthreads();

	}

}

I still have two questions:

Why did I need to drop “extern” from the shared memory declaration? I thought shared memory was only accessible by threads in the same block.
Am I correct in assuming that I need the second __syncthreads() call if there are fewer threads in the grid than elements to convert (i.e. need to loop)?

Here’s the profiler analysis of three similar functions. The shared memory approach is quite a bit faster than the simple approach!!

The profile is for 1e4 calls to FUNC<<<512,64>>> to convert 32768 elements

Method				  #Calls	  GPU usec	CPU usec	%GPU time   gld coalesced		   gld uncoalesced		 gst coalesced

cvt_i16_to_f32_simple   10000	   292177	  421562	  58.87	   0					   4.68116e+07			 1.17029e+07

demux_i16_to_f32		10000	   108166	  233376	  21.79	   1.46285e+06			 0					   1.17028e+07

cvt_i16_to_f32		  10000	   95956	   216620	  19.33	   1.46286e+06			 0					   1.17029e+07

cvt_i16_to_f32_simple does a simple element-by-element copy/cast from the int16_t input to the float output

cvt_i16_to_f32 (above) copies a block of 16 bit integers to shared memory then converts

demux_i16_to_f32 converts each 16 bit integer to a float, but it does it in pairs: the even and odd elements are in different output buffers

Topic		Replies	Views
Convert array of shorts to array of floats CUDA Programming and Performance	1	1232	August 15, 2008
Memory bank access during int to short conversion CUDA Programming and Performance	2	712	June 23, 2011
Memory access coalescing Vs. the compiler CUDA Programming and Performance	2	2497	July 23, 2007
moving data between Device Global to Device Shared CUDA Programming and Performance	7	5494	February 12, 2009
coalesced read short integer cuda CUDA Programming and Performance	5	1472	October 21, 2010
load int8 shared memory data into fp16 wmma::fragment CUDA Programming and Performance	0	531	August 7, 2019
hwo to make float2 and float4 data coalesced? CUDA Programming and Performance	1	3614	May 27, 2008
Fast way (on device) to convert from byte to float CUDA Programming and Performance	6	11768	August 20, 2007
Library for conversion of cuda datatypes? CUDA Programming and Performance	9	2680	October 16, 2019
Fast type casting (short -> float) CUDA Programming and Performance	2	2276	October 16, 2008

coalescing memory in short to float conversion

Related topics