Really slow global Memory read can't compiler hide low bandwidth?

Hello everybody,

i have a quite small routine which shall convert an array of unsigned short values to a complex array. The imaginary part is filled with 0.

I read in another thread that someone needed about 2 ms to read 16mb of unsigned chars.

But my code is really really slow… it takes about 1 second to convert less than 400 blocks of deata, each containing 256k elements. All data was allocated via cublasAlloc so it should be aligned well. I also tried to use a texture of ushort2 values but it didn’t help (thought this might avoid bank conflicts).

but if i dont write back the data to global memory, the time is about 10 ms. So the global memory writes need all the performance.

Any ideas how to avoid this?

texture<unsigned short, 1, cudaReadModeElementType> usTex;

__global__ void uShortToComplex_Kernel (float *dataOut, int width, int height){

  

	int offset = blockIdx.x * width + threadIdx.x;	

	dataOut[offset<<1] = tex1Dfetch (usTex, offset);

	dataOut[(offset + 1)<<1] = 0;  

	__syncthreads();  

}
extern "C" void uShortToComplex (float *data, float *dataOut, int width, int height){

	cudaBindTexture (0, usTex, data, width * height * sizeof (float));

	uShortToComplex_Kernel<<<height, width>>> (dataOut, width, height);

	cudaUnbindTexture (usTex);

}

Thank you,

xlro

I bet __syncthreads() is killing your performance. Instead of that you should restructure your program in a way each thread will write an unique memory adress ( so multiple threads won’t collide writing the same value )…

It doesn’t appear that you’re “changing” anything in the loop that depends on other outcomes of the loop as such, to speed this up you need to:

Put the _syncthreads in your PROGRAM not in the kernel.

cast the ushort array to ushort4 and access it as ushort.x y z w and don’t use float, make a Complex type that is float2 as well.

Also, when I got the 2ms it’s from two allocated global memory areas, not from a texture.

Also, you know you don’t actually use the “height” you pass in ?

thanks for your replies!

i just changed some things and have two working versions of the code…

The times went down a little bit by using float4 and ushort4 (also tried the combinations ushort4-float2, ushort2-float). But another speedup would be really nice because the time is still much too large (0.75 seconds without texture and 0.70 seconds with texture).

the version without textures:

__global__ void uShortToComplex_Kernel (ushort4 * dataIn, float4 *dataOut, int width){

	int offset = blockIdx.x * width + threadIdx.x;

	

	ushort4 tmp =  dataIn[offset];

	

	float4 out1;

	out1.x = tmp.x;

	out1.y = 0;

	out1.z = tmp.y;

	out1.w = 0;

	float4 out2;

	out2.x = tmp.z;

	out2.y = 0;

	out2.z = tmp.w;

	out2.w = 0;

	dataOut[offset << 1] =  out1;

	dataOut[(offset << 1) + 1] = out2;

}

extern "C" void uShortToComplex (float *data, float *dataOut, int width, int height){

	uShortToComplex_Kernel<<<height, width / 4>>> ((ushort4 *) data, (float4 *) dataOut, width / 4);

}

and with textures:

texture<ushort4, 1, cudaReadModeElementType> usTex;

__global__ void uShortToComplex_Kernel (ushort4 * dataIn, float4 *dataOut, int width){

	int offset = blockIdx.x * width + threadIdx.x;

	

	ushort4 tmp = tex1Dfetch (usTex, offset);

	

	float4 out1;

	out1.x = tmp.x;

	out1.y = 0;

	out1.z = tmp.y;

	out1.w = 0;

	float4 out2;

	out2.x = tmp.z;

	out2.y = 0;

	out2.z = tmp.w;

	out2.w = 0;

	dataOut[offset << 1] = out1;

	dataOut[(offset << 1)+ 1] = out2;

}

extern "C" void uShortToComplex (float *data, float *dataOut, int width, int height){

	cudaBindTexture (0, usTex, data, width * height * sizeof (float));

	uShortToComplex_Kernel<<<height, width / 4>>> ((ushort4 *) data, (float4 *) dataOut, width / 4);

	cudaUnbindTexture (usTex);

}

replaced the “Kernel-call” by

uShortToComplex_Kernel<<<height, (width >> 2)>>> ((ushort4 *) data, (float4 *) dataOut, (width >> 2));

now both versions are 0.05 seconds faster

thought the compiler could optimize this?!?

coalescing again…
you should hard code a multiply-of-32 block size, write your result to shared memory, sync, and write things to global coalesced. it’s a 2x ~ 8x difference, but people so often overlook it

This one should be faster, because you should get both coalescent reads and writes :

__global__ void uShortToComplex_Kernel (ushort2 * dataIn, float4 *dataOut, int width){

	int offset = blockIdx.x * width + threadIdx.x;

	

	ushort2 tmp =  dataIn[offset];

	dataOut[offset] = float4(tmp.x, 0, tmp.y, 0);

}

extern "C" void uShortToComplex (ushort *dataIn, float *dataOut, int width, int height){

	uShortToComplex_Kernel<<<height, width / 2>>> ((ushort2 *) dataIn, (float4 *) dataOut, width / 2);

}

My 2cents words…