Copying data to shared memory


My purpose here is to copy data from global memory to shared memory, edit them, and write them back to global memory. The kernel looks as follows:

#define BLOCK_SIZE 32

void find_defect_coordinates(unsigned char *input, unsigned int width, unsigned int heigth)
	unsigned int thread_x = threadIdx.x;
	unsigned int thread_y = threadIdx.y;
	unsigned int glob_x = (blockIdx.x * blockDim.x) + thread_x;
	unsigned int glob_y = (blockIdx.y * blockDim.y) + thread_y;

	unsigned char correction = 0;

	__shared__ unsigned char shInput[BLOCK_SIZE][BLOCK_SIZE];

	const int numTiles = width / BLOCK_SIZE;
	for (int t = 0; t < numTiles; t++)
		const int tiledCol = BLOCK_SIZE * t + thread_x;
		shInput[thread_y][thread_x] = input[glob_y*width + (t*BLOCK_SIZE + thread_x)];

	        correction = shInput[thread_y][thread_x] - 50;

	        input[glob_y*width + (t*BLOCK_SIZE + thread_x)] = shInput[thread_y][thread_x];

The input here is:
input = is a pointer to the picture data in global memory (its sizeis 8192512sizeof(unsigned char))
width = width of the picture (8192)
height = height of the picture (512)

I am a beginner with CUDA. Please could you let me know what I am doing wrong here? After execution of a kernel I get some kind of noisy image. Thank you a lot for your advices.

are you doing proper CUDA error checking? If you don’t know what that is, please google it and start reading.

Have you run your code with cuda-memcheck?

If so, what is the output of both cases? Are any errors reported in either case?

My suggestion, also, if you want help, is to provide a short, complete code, that others can inspect or run. This shouldn’t be your current code, but should be just a simple test code that calls this kernel and shows the problem.

Line 26 contains a race condition. Each thread with the same glob_y and same thread_x writes to the same position inside input.

the provided kernel doesn’t have any effect. it’s not a CUDA, but just simple logical error