Trying to understand the memory model

I am currently trying to implement a handful of image dithering algorithms, which I have already implemented on the CPU in pure C++, and I’m now trying to port my implementation to CUDA. Unfortunately, I’m getting a ton of access violation errors in cuda-memcheck causing the application to crash, or the resulting image to just come out solid gray, which are leading me to believe that I need to better understand the memory model and re-design my memory handling, since my original code did a lot of raw pointer-passing, and a ton of jumping around inside buffers at single-byte granularity, which apparently doesn’t work so well with global memory in CUDA. Porting these from pure C++, I’m used to memory just being memory. I feel like I’m getting tripped up on the various caveats of the partitioning.

I don’t currently have any even semi-working code, because I’ve ripped it apart and put it back together again trying to figure out what was and wasn’t allowed that I don’t even recognize it anymore, and even when it was partially working, the code wasn’t really reducible to a simple example case that would be useful for posting, so right now I want to ask for conceptual help, since there are architectural differences between CPU and GPGPU computing that I need to wrap my head around, then I’ll go back and write a complete barebones test case from scratch, and if that still has issues, I’ll come back with that test case for more specific questions.

So, with that in mind, looking at Yliluoma’s ordered dithering algorithm 1

From a memory allocation standpoint, each color consists of 3 8-bit values for each R,G,B channel, but storing them as RGBA with A=255 makes it a nice, round, 32-bit value. The palette will ideally be 256 colors, but might be less depending on the image. I was originally declaring an array of 256 colors in constant memory, and then passing the actual number of colors to the kernel once the actual palette had been generated and uploaded using cudaMemcpyToSymbol. This array gets accessed at a byte granularity many times per pixel. I was trying to just access the array directly, but would often end up hitting access violations when reading it. Should I be copying the palette into shared memory and using that instead?

Same for the image data itself (allocated by a host call to cudaMalloc and uploaded using cudaMemcpy), should I be copying the pixel data into shared memory one block at a time, operating on that, then copying back to the global memory at the end of each block?

Then, there are the MixingPlan structs. For Yliluoma #1, that’s two colors and a double per pixel, but for other algorithms like Knoll, that can be as large as 64 colors, again per pixel. I’m currently calling the kernel with the image height and width as the block and thread parameters so blockIdx.x corresponds to the y coordinate and threadIdx.x is the x coordinate. I originally tried statically allocating those structs as simple local variables within the kernel, i.e.

typedef struct
{
	RGBA colors[256];
} KNOLLMIXINGPLAN;

__global__ void knollKernel(unsigned char* image, unsigned height, unsigned width)
{
	KNOLLMIXINGPLAN plan;
	. . .

however, more access violations when doing byte-wise stores or loads to arbitrary colors in the array. Do I need to be explicitly allocating this in shared memory? And if so, that means I need an array of structs, one per pixel in the block, right?

There’s also the threshold matrix, which I’ll just treat like the palette, generate host-side, upload to constant memory, then copy to shared memory.

I think that should be enough to get me started, if I can at least understand that much. Then maybe I can at least throw together a test case if I’m still having issues.