Problems with coalescing memory accesses

Hi!

I’m new to CUDA-programming and just wanted to try something, but i’ve problems with non-coalesced memory accesses. I wanna write a game a little bit like a mix of scrabble and mahjong.

I’ve got a grid of chars and wanna do some computing. I select one char, compare it with the eight surrounding chars and then write a char-value back at a new grid.

Here’s an example:


–ooo-- --a— x: char to compare

–oxo-- ------ o: comparison value

–ooo-- ------ a: output


That’s the Kernel-code I’ve got yet:

[codebox]#include “function.h”

global void function(unsigned char* source,

				unsigned char* target, 

				size_t spitch, 

				size_t tpitch,

				unsigned short sh,

				unsigned short th)

{

// declare variables

unsigned long sp, tp;

unsigned char pos;

unsigned char out[4];

__shared__ unsigned char sector[BLOCK_SIZE_Y][CHARS_PER_THREAD * BLOCK_SIZE_X];

unsigned char col = threadIdx.x;

unsigned char row = threadIdx.y;

// avoid bank conflicts on shared memory	

col *= CHARS_PER_THREAD;

	// -2, cause of over-lapping

unsigned long x = blockIdx.x * (blockDim.x * CHARS_PER_THREAD - 2) + col;

unsigned long y = blockIdx.y * (blockDim.y - 2) + row;

sp = y*spitch + x;

// load a sector of the grid in shared memory

// each thread loads CHARS_PER_THREAD chars



// avoid out of range

if(y < sh)

{

	#pragma unroll

	for(unsigned char i=0; i < CHARS_PER_THREAD; i++)

		// avoid out of range

		if((x + i) < spitch)

			sector[row][col + i] = source[ sp + i];

}



// synchronize to be sure, all chars are loaded before computing

__syncthreads();



// compute output value, each thread computes CHARS_PER_THREAD output values

#pragma unroll

for(unsigned char i=0; i < CHARS_PER_THREAD; i++)

{	

	out[i] = 0;

	unsigned char middle = sector[row+1][col+1 + i];

		   // here goes my algorithm (comparisons and computations of output)

}



// write out values to target, each thread writes CHARS_PER_THREAD chars



// avoid out of range

if(y < th && row < (BLOCK_SIZE_Y - 2))

{

	tp = y*tpitch + x;

	#pragma unroll

	for(unsigned char i=0; i < CHARS_PER_THREAD; i++)

		// avoid out of range

		if((x + i) < tpitch && (col + i) < (CHARS_PER_THREAD * BLOCK_SIZE_X - 2))

			target[ tp + i] = out[i];

}

}[/codebox]

Code-Snippet of allocation and kernel-invocation:

[codebox]// allocate device memory

cutilSafeCall( cudaMallocPitch( (void **)&d_so, (size_t*)&d_spitch, sw, sh ) );

cutilSafeCall( cudaMallocPitch( (void **)&d_to, (size_t*)&d_tpitch, tw, th ) );

// copy host memory to device

cutilSafeCall( cudaMemcpy2D(d_so, d_spitch, so, slf, sw, sh,

							cudaMemcpyHostToDevice) );

// execute the kernel and check if kernel execution generated an error

dimBlock = dim3(BLOCK_SIZE_X, BLOCK_SIZE_Y);

dimGrid = dim3( (tw + (PX_PER_THREAD*dimBlock.x-2) - 1) / (PX_PER_THREAD*dimBlock.x-2),

						 (th + (dimBlock.y-2) - 1) / (dimBlock.y-2));

function<<<dimGrid, dimBlock>>>(d_so, d_to, d_spitch, d_tpitch, sh, th);

cutilCheckMsg("Kernel execution failed");

cudaThreadSynchronize();[/codebox]

If I’m running the program in CUDA visual profiler, I get many non-coalesced loads and stores. I’ve read the Reference Manual, Programming Guide and Best Practise Guide,

but I still don’t get it. I’m using a 9800 GTX+, Windows XP32 and CUDA 2.3. Anybody could help me?

I have some suggestions.
non-coalesced in reading data from global memory
in your case, you should use texture memory
non-coalesced in writting data to global memory
you have better use uchar4 than uchar1 (1 thread processes 4 elements and store in uchar4, finally copy back data to global memory)

Yeah, I know, the problems are here:

sector[row][col + i] = source[ sp + i];

and here:

target[ tp + i] = out[i];

Isn’t there a way to do this without textures?

If you dont want to use texture memory. you can directly read data from global memory without un-coalescing

assume that 1 thread processes 4 elements (instead of 1 thread processes 1 elements).

so you should cast the uchar1 to uchar4, and then copy data to shared memory.

to confirm that all threads finish copying data from global memory, you need to use the __syncthreads() function.

finally read data from shared memory (if not optimized, the bank conflicting will occurs)

I’m now trying it with textures, but I get a unspecified launch failure using the tex2D-function.

Kernel-Code:

[codebox]texture<char, 2, cudaReadModeElementType> tex;

global void kernel(unsigned char* target,

			   unsigned short sw, 

			   unsigned short sh,

			   unsigned short tw,

			   unsigned long tlf,

			   unsigned short th)

{

unsigned long tp;

unsigned char middle, pos, out;

unsigned char col, row;

unsigned long x, y;

__shared__ unsigned char sector[BLOCK_SIZE_Y][BLOCK_SIZE_X];

col = threadIdx.x;

row = threadIdx.y;

x = blockIdx.x * blockDim.x + col;

y = blockIdx.y * blockDim.y + row;

if( x < sw && y < sh)

{

	sector[row][col] = tex2D( tex, x, y);   //In this line, there's the launch failure.

		

	out = 0;

	middle = sector[row+1][col+1];

	// here goes my algorithm (comparisons and computations of output)

	tp = y*tlf + x;

	target[ tp ] = out;

}

}[/codebox]

Host-Code:

[codebox] // allocate device memory

    cudaArray *d_so;

cudaChannelFormatDesc desc = cudaCreateChannelDesc(8, 0, 0, 0, cudaChannelFormatKindUnsigned);

cutilSafeCall( cudaMallocArray( &d_so, &desc, sw, sh) );

// copy host memory to device

cutilSafeCall( cudaMemcpy2DToArray( d_so, 0, 0, so, slf, sw, sh, cudaMemcpyHostToDevice) );

// set texture parameters

tex.addressMode[0] = cudaAddressModeClamp;

tex.addressMode[1] = cudaAddressModeClamp;

tex.filterMode = cudaFilterModePoint;

tex.normalized = false;

// bind array to texture

cutilSafeCall( cudaBindTextureToArray( tex, d_so, desc) );

// invoke kernel and check if kernel execution generated an error

dim3 dimBlock = dim3(BLOCK_SIZE_X, BLOCK_SIZE_Y);

dim3 dimGrid = dim3( (tw + (dimBlock.x-2) - 1) / (dimBlock.x-2), (th + (dimBlock.y-2) - 1) / (dimBlock.y-2));

function<<<dimGrid, dimBlock>>>( d_to, sw, sh, tw, tlf, th);

cutilCheckMsg("Kernel execution failed");

cudaThreadSynchronize();

[/codebox]

Edit: I’ve recognized, the problem is not the call of tex2D, but the asignment sector[row][col] = tex2D( tex, x, y). Look’s like it’s not allowed to fetch textures and store it in shared memory. But that’s absurd!