Serialized warp when accessing ushort4 items

On the CUDA profiler (v 4.0) I’m getting serialed warps when acessing different fields of a ushort4 variable. He’s the general idea. I’m trying to temporally filter four floating point values per pixel. However, to save bandwidth, I’m storing the floats as ‘half’ in the ushort method as recommended/supported by CUDA. Since I’m storing 4 values per pixel, I need a ushort4 per pixel.
I then use another unsigned int array (call it ‘lookup’) whose values are 0,1,2 or 3, and I want to extract (respectively) the .x, .y, .z and .w field of that ushort4 for conversion to a float given that lookup number. Since I can’t use the lookup value as a field index, I abstract the ushort4 pointer to a ushort*, and then use an index scheme, like so (I’m blowing over stuff before):

	// Pointer to lookup data
	unsigned int *ptr=(unsigned int*)((char*)lookup + UMUL(blockIdx.y,lpitch));
	// Read lookup value and convert to uint since we'll use it as an index
	unsigned int lookupdata=fptr[ix];
	// Pointer to input luminance data
	ushort4 *us4ptr=(ushort4*)((char*)lum+blockIdx.y*lumpitch);
	// Load luminance data element for this input pixel
	ushort4 lumdata=us4ptr[ix];
	// Pointer to channel chroma x data
	// Write data across all four channels
	ushort4 xdata = us4ptr[ix];
	// Pointer to channel chroma y data
	// Write data across all four channels
	ushort4 ydata = us4ptr[ix];
	// Select channel and convert Lum/x/y to RGBA
	unsigned short *lumptr=(unsigned short*)&lumdata;
	unsigned short *xptr=(unsigned short*)&xdata;
	unsigned short *yptr=(unsigned short*)&ydata;
//	float3 xyY=make_float3(__half2float(xptr[lookupdata]),__half2float(yptr[lookupdata]),__half2float(lumptr[lookupdata]));
	float3 xyY=make_float3(__half2float(xptr[1]),__half2float(yptr[1]),__half2float(lumptr[1]));

When I compile and run as shown above, where every pixel will ‘select’ the second field (1th index), it runs without serialization. However, if I uncomment the line above it where the index can change from pixel to pixel, I get lots of serialization (175000 on 21200 instructions with no branch divergence). Why is this the case? I assume it has something to do with accessing a ‘half’ int (16 bits) instead of the full one. I tried to get around this problem by casting the pointers as unsigned int and using the lowest bit to shift/not shift a 32-bit (unsigned int) load by 16 bits, and using a 1-bit right shifted ‘lookup’ value as the unsigned int index (see below). That didn’t work either.

Anyone with knowledge of how the compiler extracts a ushort out of a ushort4 value out there who can offer me some insight on how to avoid serialization of this operations? Many thanks in advance.

	float3 xyY=make_float3(__half2float((xptr[lookupdata>>1]>>(16*(lookupdata&1)))&0x0000FFFF),__half2float((yptr[lookupdata>>1]>>(16*(lookupdata&1)))&0x0000FFFF),__half2float((lumptr[lookupdata>>1]>>(16*(lookupdata&1)))&0x0000FFFF));

What compute capability is your device?