Copying data to shared memory

filip.tn · March 2, 2018, 11:06am

Hello,

My purpose here is to copy data from global memory to shared memory, edit them, and write them back to global memory. The kernel looks as follows:

#define BLOCK_SIZE 32

__global__
void find_defect_coordinates(unsigned char *input, unsigned int width, unsigned int heigth)
{
	unsigned int thread_x = threadIdx.x;
	unsigned int thread_y = threadIdx.y;
	
	unsigned int glob_x = (blockIdx.x * blockDim.x) + thread_x;
	unsigned int glob_y = (blockIdx.y * blockDim.y) + thread_y;

	unsigned char correction = 0;

	__shared__ unsigned char shInput[BLOCK_SIZE][BLOCK_SIZE];

	const int numTiles = width / BLOCK_SIZE;
	for (int t = 0; t < numTiles; t++)
	{
		const int tiledCol = BLOCK_SIZE * t + thread_x;
		shInput[thread_y][thread_x] = input[glob_y*width + (t*BLOCK_SIZE + thread_x)];
	        __syncthreads();

	        correction = shInput[thread_y][thread_x] - 50;
                __syncthreads();

	        input[glob_y*width + (t*BLOCK_SIZE + thread_x)] = shInput[thread_y][thread_x];
	        __syncthreads();
	}
}

The input here is:
input = is a pointer to the picture data in global memory (its sizeis 8192512sizeof(unsigned char))
width = width of the picture (8192)
height = height of the picture (512)

I am a beginner with CUDA. Please could you let me know what I am doing wrong here? After execution of a kernel I get some kind of noisy image. Thank you a lot for your advices.

Robert_Crovella · March 2, 2018, 12:10pm

are you doing proper CUDA error checking? If you don’t know what that is, please google it and start reading.

Have you run your code with cuda-memcheck?

If so, what is the output of both cases? Are any errors reported in either case?

My suggestion, also, if you want help, is to provide a short, complete code, that others can inspect or run. This shouldn’t be your current code, but should be just a simple test code that calls this kernel and shows the problem.

striker159 · March 2, 2018, 12:12pm

Line 26 contains a race condition. Each thread with the same glob_y and same thread_x writes to the same position inside input.

BulatZiganshin · March 2, 2018, 12:19pm

the provided kernel doesn’t have any effect. it’s not a CUDA, but just simple logical error