Warnings on array of pointers assigned to externed shared memory

Trying to make a templated demosaic resampler for all four RGGB configurations (format) that can utilize different resampler methods (class O, op()). Below is front matter. The compiler tells me on the last line below (tile[0][shridx]=…; lot’s more code that isn’t included) "Can’t tell what pointer points to, assuming global memory space (which is wrong). I would have thought that line 22 should make it pretty clear (later it gives me the same warnings when trying to use the other pointers…).

I’ve gotten this error before from bad syntax, but I don’t understand the issue here. I know I can use pointers to shared memory (done it before), and know I can do local fixed arrays. Can I not do both simultaneously?

template<class O>
__global__ void demosaic( unsigned int *RGBA, const size_t RGBApitch, const float *src, const size_t src_pitch,
			float *Y, const size_t Ypitch, ushort2 *UV, const size_t UVpitch,
			const uint2 imgsize, const unsigned int format, float normscale, O op )
{
	// Shared memory for color planes
	extern __shared__ float shrdata[];
	// Number of horizontal color samples/tile
	unsigned int shrwidth = blockDim.x+4;

	const int ix = UMAD(blockIdx.x,blockDim.x,threadIdx.x);
	int iy = INT_DOUBLE(UMAD(blockIdx.y,blockDim.y,threadIdx.y));
	float2 idata, *f2ptr;
	float *tile[4];

	// Offset into each shared memory color plane
	unsigned int temp = threadIdx.x+2;
	unsigned int shridx = UMAD(2+threadIdx.y,shrwidth,temp);
	// Stride (in samples) per color plane
	unsigned int blocksize = UMUL(shrwidth,threadIdx.y+4);
	// Pointers to the four color planes (UL, UR, LL, LR)
	tile[0] = &shrdata[0];
	tile[1] = &tile[0][blocksize];
	tile[2] = &tile[1][blocksize];
	tile[3] = &tile[2][blocksize];
	//////////////////////////////////
	// Load pairs of 'even' row pixels
	//////////////////////////////////
	// Get pointer to input image
	f2ptr=(float2*)((char*)src + UMUL(iy,src_pitch));
	// Are we in the valid range?
	if (ix<imgsize.x) { 
		// Load middle pixel values in pairs
		idata = f2ptr[ix];
		// Save UL pixel to shared memory
		tile[0][shridx]=idata.x;

Also tried

tile[0] = shrdata;
tile[1] = &shrdata[blocksize];

to be more explicit, but this didn’t change anything.

Googled ‘array of pointers shared memory cuda’, but kept getting links to partitioning the allocation of dynamically sized shared memory into different sections, which is obviously what I’m trying to do here (not even different types), but I can’t seem to make an array of pointers to these color planes within the shared memory.

If I make these four local variables (not an array), the compiler seems happy…

float *ULtile, *URtile, *LLtile, *LRtile;
...
	ULtile = shrdata;
	URtile = &ULtile[blocksize];
...
	ULtile[shridx]=idata.x;

…but don’t understand why I should have to do this.

The idea is that I’ll use a variable for the first index to get to the correct pointer, but I’d have to use an if/then block to choose the correct pointer using the method just described.

Are you specifying an architecture (-arch) of at least sm_20? It’s pretty easy to produce the “cannot tell” warning on pre-sm_20 architectures.

As allanmac points out these warnings are specific to sm_1x. The underlying issue is that in C/C++ “a pointer is a pointer is a pointer”, but at the hardware level, there are no generic pointers in sm_1x, only memory-space specific pointers. So the compiler tries to track the memory space under the hood. This works well when there is just a single level of dereferencing, but it can already break down when there are two levels of dereferencing.

Here “tile” is a pointer to an array of pointers to float, that is, two levels of de-referencing. Once the compiler loses track of the memory space, it defaults to the global memory space, and warns about the fact that it is operating on that assumption (which could well be wrong, in which case the machine code would be wrong and the program doesn’t work correctly).

In sm_20 and later architectures, the hardware was enhanced to support generic pointers, also adding conversion instructions for converting pointers between generic and memory-space specific pointers as the latter may have performance advantages. So now there is a good match between the C/C++ world view and the hardware world view. If your GPU is compute capability 2.0 or higher, use the appropriate -arch or -codegen flag during compilation, and these warnings should disappear.

The other issue with creating a dynamically accessed fixed-size array is it’s going to reside in local memory instead of in registers. An STL/LDL operation is probably not what you want. Check the verbose output of your compile to see how many bytes of “lmem” are being used.

I would suggest either:

  1. Keep it simple and always recalc your "offset" index with a multiply or multiply-add of your const blocksize.
  2. Get medieval and use the PRMT opcode to dynamically select one of 4 precalculated 16-bit tile offsets in shared.
  3. (what else?)

Of course, if the array is not dynamically accessed then you have nothing to worry about. :)

I compiling for 1.1 and 2.0 (thought I was also building for 1.2 and 3.0, but forgot to copy that line in the Release properties to the Debug side). And yes, when I saw this before, it would have been back on a compute 1.1 card (back when I was stylin’ with my 8600 GT) when I started out. We still have a C1060 that we run/test on.

Many thanks to both of you for explaining this. Surprised I haven’t run into this before. The PRMT does indeed look medieval; I’d probably take #1 before that, but I’ll play around with it.