Pointer analysis workaround

So I’m trying to write a device function that uses an explicit stack. This requires me to occasionally read/write pointers to global memory as in:

#include <iostream>

#include <cstdio>

class Stack

{

	public:

	

		unsigned int* in;

		unsigned int size;

};

__device__ void myMemcpy( unsigned int* out, const unsigned int* in, 

	unsigned int size )

{

	for( unsigned int i = 0; i < size; ++i )

	{

	

		out[i] = in[i];

	

	}

}

__global__ void foo( Stack* stack, unsigned int* input, unsigned int size )

{

	extern __shared__ unsigned int shared[];

	

	unsigned int stackPointer = 1;

		

	myMemcpy( shared, input, size );

	

	stack[0].in = shared;

	stack[0].size = size;

	

	Stack entry;

	

	unsigned int result = 0xFFFFFFFF;

	

	do

	{

	

		--stackPointer;

		entry = stack[ stackPointer ];		

		

		unsigned int value = entry.in[0];

		

		if( value < 132 && entry.size > 1 )

		{

		

			stack[ stackPointer ].in = entry.in + entry.size / 2;

			stack[ stackPointer ].size = entry.size / 2;

			++stackPointer;

			

		}

		

		if( value > 132 && entry.size > 1 )

		{

		

			stack[ stackPointer ].in = entry.in - entry.size / 2;

			stack[ stackPointer ].size = entry.size / 2;

			++stackPointer;

			

		}

		

		if( value == 132 )

		{

		

			result = ( entry.in - shared );

		

		}

	

	}

	while( stackPointer > 0 );

	

	input[0] = result;	

}

unsigned int fooHost(  )

{

	unsigned int input[1024];

	

	for( unsigned int i = 0; i < 1024; ++i )

	{

		

		input[i] = (unsigned int) i;

	

	}

		

	unsigned int* deviceIn;

	Stack* deviceStack;

	cudaMalloc( (void**) &deviceIn, sizeof( unsigned int ) * 1024 );

	cudaMalloc( (void**) &deviceStack, sizeof( Stack ) * 10 );

	

	cudaMemcpy( deviceIn, input, 

		sizeof( unsigned int ) * 1024, cudaMemcpyHostToDevice );

	

	foo<<< 1, 1, 1024*sizeof( unsigned int ) >>>( deviceStack, deviceIn, 1024 );

	cudaMemcpy( input, deviceIn, 

		sizeof( unsigned int ), cudaMemcpyDeviceToHost );

	cudaFree( deviceIn );

	cudaFree( deviceStack );

	

	return input[0];

}

int main()

{

	unsigned int position = fooHost();

	if( position == 0xFFFFFFFF )

	{

	

		std::cout << "Not found.\n";

	

	}

	else

	{

	

		std::cout << "Found item at location " << position << ".\n";

	}

}

Note that this code doesn’t do anything really, it’s just meant to illustrate my point.

This code works fine when compiling in emulation mode. However, it generates the following warning when compiling for the device:

/tmp/tmpxft_00002c8e_00000000-7_testStack.cpp3.i(49): Advisory: Cannot tell what pointer points to, assuming global memory space

So it seems that reading pointers from memory invalidates some points -to-analysis that the compiler is doing and it can no longer tell which memory space a pointer refers to – that’s fine points-to-anaylsis is really hard. Looking at the PTX code, the statement:

unsigned int value = entry.in[0]

is compiled into

ld.global.u32 %r5, [%rd3+8]; // id:64

A load from global memory instead of a load from shared memory.

Can anyone think of a workaround for this that still allows me to use shared memory? The actual application (quicksort) requires a stack and sees a significant speedup using shared memory. A great long term solution that has probably been suggested before is to allow for a user supplied “memory space tag” to all pointers, so it would possible to manually tell the compiler which memory space you were referring to. But that doesn’t help me now.

My only recourse at this point is to manually edit the PTX, which works but is far from a clean and portable solution.

We’ve just run into the same issue when using jagged arrays on the GPU, however our data luckily is in global memory.

Would typecasting the pointer to a shared pointer work? similar to the example given below:

((shared int)*)someunknownpointer

Not sure if that extra bracked would be needed, but it is supposed to force an operator precedence such that
the compiler understands that is a pointer to a shared int and not a pointer to int where the pointer is in shared
memory, but points to global memory.

Christian

I tried your suggestion as in the following code:

unsigned int* temp = ((__shared__ unsigned int)*) entry.in;

		

unsigned int value = *temp;

and got the following error message:

testStack.cu(49): error: attributes may not appear here

testStack.cu(49): error: expected an expression

testStack.cu(49): error: expected a ";"

I agree that that would be the right way to fix the problem, but I don’t think the compiler understands the concept of explicitly specified memory spaces. I’ll add it to the wish list if it isn’t already there…

try this with 2.1. the behavior has been improved (a number of cases I had that failed no longer now work in 2.1).

Such a feature is just not available. I’ve requested it before but to deaf ears. Btw, i think the syntax would be more like (int * shared), meaning a pointer to shared that is itself local. (shared int * shared) is a shared pointer to shared. Just like with const.

Btw, Gregory, your code is fundamentally flawed. You have to remember all your threads, from all your blocks, are accessing the stack. If this is really the behavior you want, you have to use global memory atomics when manipulating the stack pointer. Of course, this will be very slow.

Also your memcpy is similarly ignoring parallelism. It does 100x the work. Each thread should only be copying a small part of the array, and doing it in a coalesced manner.

Hopefully someone will listen and implement this in a future release of cuda.

Alex, as for the example posted, I realize that it is not high performance at all, it was simply meant to reproduce the problem I was referring to. The complete app is too large to post here. It uses a separate stack for each thread and updates are made to each stack by a single thread in each block and then broadcast to all other threads.

I did find a workaround for this problem that isn’t too hard to implement. The basic idea is for each thread to keep a base pointer to all data structures and then only save offsets from the base pointer on the stack. For the example I gave before, it would look like this:

#include <iostream>

#include <cstdio>

class Stack

{

	public:

	

		unsigned int offset;

		unsigned int size;

};

__device__ void myMemcpy( unsigned int* out, const unsigned int* in, 

	unsigned int size )

{

	for( unsigned int i = 0; i < size; ++i )

	{

	

		out[i] = in[i];

	

	}

}

__global__ void foo( Stack* stack, unsigned int* input, unsigned int size )

{

	extern __shared__ unsigned int shared[];

	

	unsigned int stackPointer = 1;

		

	myMemcpy( shared, input, size );

	

	stack[0].offset = 0;

	stack[0].size = size;

	

	Stack entry;

	

	unsigned int result = 0xFFFFFFFF;

	

	do

	{

	

		--stackPointer;

		entry = stack[ stackPointer ];		

		

		unsigned int value = shared[ entry.offset ];

		

		if( value < 132 && entry.size > 1 )

		{

		

			stack[ stackPointer ].offset = entry.offset + entry.size / 2;

			stack[ stackPointer ].size = entry.size / 2;

			++stackPointer;

			

		}

		

		if( value > 132 && entry.size > 1 )

		{

		

			stack[ stackPointer ].offset = entry.offset - entry.size / 2;

			stack[ stackPointer ].size = entry.size / 2;

			++stackPointer;

			

		}

		

		if( value == 132 )

		{

		

			result = entry.offset;

		

		}

	

	}

	while( stackPointer > 0 );

	

	input[0] = result;	

}

unsigned int fooHost(  )

{

	unsigned int input[1024];

	

	for( unsigned int i = 0; i < 1024; ++i )

	{

		

		input[i] = (unsigned int) i;

	

	}

		

	unsigned int* deviceIn;

	Stack* deviceStack;

	cudaMalloc( (void**) &deviceIn, sizeof( unsigned int ) * 1024 );

	cudaMalloc( (void**) &deviceStack, sizeof( Stack ) * 10 );

	

	cudaMemcpy( deviceIn, input, 

		sizeof( unsigned int ) * 1024, cudaMemcpyHostToDevice );

	

	foo<<< 1, 1, 1024*sizeof( unsigned int ) >>>( deviceStack, deviceIn, 1024 );

	cudaMemcpy( input, deviceIn, 

		sizeof( unsigned int ), cudaMemcpyDeviceToHost );

	cudaFree( deviceIn );

	cudaFree( deviceStack );

	

	return input[0];

}

int main()

{

	unsigned int position = fooHost();

	if( position == 0xFFFFFFFF )

	{

	

		std::cout << "Not found.\n";

	

	}

	else

	{

	

		std::cout << "Found item at location " << position << ".\n";

	}

}

Changes are in bold. EDIT: Nm it seems that bold doesn’t work in code segments…

It was a bit cumbersome to rewrite all of the previous code, but it seems to work like this and does not generate any warnings. I guess the main thing to take away from this is to never save/restore a pointer to global memory until NVIDIA implements explicit memory spaces for pointers…