coalescing memory in short to float conversion

Is there a technique that I can use to speed up int16_t to float conversion?
The memory access is the trouble – 16 bits loaded and 32 bits stored.
Is there any way to solve this?

Cast your int16_t pointer to int32_t pointer, have each thread load one 32 bit int, split into two 16 bit ints, convert them to float, and write them into a shared memory buffer. Then call __syncthreads() and have a second loop write the floats out in order.

This might be more clear in code (not tested):

// Assumptions: 

//	* input and output contain n elements

//	* n is even

//	* kernel is invoked with sizeof(float)*2*blockDim.x bytes of shared memory

//	* gridDim.x * blockDim.x >= n/2

__global__ void int16_to_float(int n, int16_t *input, float *output)

{

  extern __shared__ float temp[];

int input_index = blockDim.x * blockIdx.x + threadIdx.x;

  int output_index = 2 * blockDim.x * blockIdx.x;

if (input_index*2 < n) {

	int word = ((int *) input)[input_index]

	int16_t hi = word >> 16;

	int16_t lo = word & 0xFFFF;

	temp[2*threadIdx.x] = (float) lo;

	temp[2*threadIdx.x+1] = (float) hi;

  }

__syncthreads();

if(input_index*2 < n) {

	output[output_index+threadIdx.x] = temp[threadIdx.x]

	output[output_index+blockDim.x+threadIdx.x] = temp[blockDim.x+threadIdx.x];

  }

}

I probably handled the sign bit wrong when splitting the 32-bit int into 16-bit ints, but hopefully this demonstrates how you can use the shared memory to coalesced wrong-sized data types.

Thanks for the quick reply and great info. That got me going on the right track.

I had a bit of trouble with different blocks stomping on each other’s shared memory, but when I dropped the “extern”, things started working.

I’m still learning about the different memory types in cuda.

The function I ended up with transferred the 16 bit integers into a shared block of memory, then sync, then convert 16 bit to float and write output, then sync again.

static __global__

void cvt_i16_to_f32( float *outbuf,const int16_t  * inbuf,int n)

{

	__shared__ int32_t tmp32[256]; // enough to hold 512 int16_t  i.e. max threads/block

	int16_t* tmp16 = (int16_t*)tmp32;

	int k;

	for (k = threadIdx.x + blockIdx.x * blockDim.x;

		 k<n;

		 k += blockDim.x*gridDim.x)

	{

		if ( threadIdx.x < (blockDim.x>>1) ) {

			// if this thread is in the first half of the block, then 

			// copy two shorts as an int

			int32_t * blockstart =(int32_t*)(inbuf + k - threadIdx.x);

			tmp32[threadIdx.x] = blockstart[threadIdx.x];

		}

		__syncthreads();

		// read int16_t from shared memory 

		outbuf[k] = (float)tmp16[threadIdx.x];

		__syncthreads();

	}

}

I still have two questions:

  1. Why did I need to drop “extern” from the shared memory declaration? I thought shared memory was only accessible by threads in the same block.

  2. Am I correct in assuming that I need the second __syncthreads() call if there are fewer threads in the grid than elements to convert (i.e. need to loop)?

Here’s the profiler analysis of three similar functions. The shared memory approach is quite a bit faster than the simple approach!!

The profile is for 1e4 calls to FUNC<<<512,64>>> to convert 32768 elements

Method				  #Calls	  GPU usec	CPU usec	%GPU time   gld coalesced		   gld uncoalesced		 gst coalesced

cvt_i16_to_f32_simple   10000	   292177	  421562	  58.87	   0					   4.68116e+07			 1.17029e+07

demux_i16_to_f32		10000	   108166	  233376	  21.79	   1.46285e+06			 0					   1.17028e+07

cvt_i16_to_f32		  10000	   95956	   216620	  19.33	   1.46286e+06			 0					   1.17029e+07

cvt_i16_to_f32_simple does a simple element-by-element copy/cast from the int16_t input to the float output

cvt_i16_to_f32 (above) copies a block of 16 bit integers to shared memory then converts

demux_i16_to_f32 converts each 16 bit integer to a float, but it does it in pairs: the even and odd elements are in different output buffers