Array referencing causing freeze works in emulation mode

Hello,

I’m trying to understand how CUDA handles arrays passed in as parameters to device functions. I have a kernel like this:

__global__ void myKernel(int *tt){

TableEntry_t myTable[10][10]

TableEntry_t *otherTable[100]

createEntry(myTable,otherTable);

*(tt) = otherTable[0]->someNum;

}

__device__ void createEntry(TableEntry_t myTable[][10], TableEntry_t* otherTable[100])

{

myTable[0][0].someNum = 7;

...

otherTable[0] = &myTable[0][0];

}

When I run this in emulation mode and use CudaMemcpy, tt has 7. When I run it without emulation, I get unspecified launch failures and garbage values.

__global__ void myKernel(int *tt){

TableEntry_t myTable[10][10]

TableEntry_t *otherTable[100]

myTable[0][0].someNum = 7;

...

otherTable[0] = &myTable[0][0];

*(tt) = otherTable[0]->someNum;

}

That code will return 7 without emulation mode. So what happens to the array when I pass it in in CUDA? I looked through the programming guide, but didn’t see anything specifically on this. Any help?

Thanks,

skiz

Well, an array is really a pointer. And your declaration of myTable on the stack of myKernel makes myTable a pointer to host memory. When you pass it to the kernel, the kernel then tries to dereference host memory => bad things happen.

To handle those kinds of tables as parameters to kernels, you can cudaMalloc, cudaMemcpy and handle index calculation yourself inside the kernel. You want to do this anyways so that you can control the memory access pattern into the table so that it is coalesced. Other options involve using constant memory or a texture fetch, which may be better options depending on your usage pattern, though they cannot be passed as parameters to a kernel.

Hmm … now I’m really confused. I have a main function in a different .cu file calls myKernel:

int main()

{

int *tt;

int *CUDAtt;

cudaMalloc(CUDAtt);

myKernel<<<grid,block>>>(CUDAtt);

cudaMemcpy(tt,CUDAtt);

}

Are you saying that the tables I declare in the myKernel function are still using host memory?

The problem with cudaMalloc’ing a table is that I really want each thread to have its own table to perform a part of a search. I’m not saying this is a good idea, it’s just something I’m trying to do :). I guess I could cudaMalloc one huge table and index it based on thread, but I was trying to not do that.

-skiz

In the first example, when you pass the array via a parameter, it passes a pointer to the host array to every thread. Ignoring the host<->device pointer issue, this doesn’t seem to be what you want since you mention needing one table for each thread.

Your second example which declares the table as a variable of the kernel will allocate one table per thread in device memory (“local memory”). This seems to be what you want. Though I would highly advise allocating the memory yourself with cudaMalloc and doing all the array indexing by hand in the kernel to make sure you get coalesced accesses. Non-coalesced memory reads/writes are very slow.

First, thanks a lot, MisterAnderson, for your help with this. It is greatly appreciated.

Second, I know that the memory issues will become a serious bottleneck in the near future, but right now I’m trying to do a proof of concept and just want to get it working. So I’m just putting my head in the sand for that part right now, and will deal with that later.

Assuming we’re using my second example, where the myKernel function is called from main and a table is created in device memory for each thread, what happens then when I pass the tables in to the createEntry device function? That’s where I’m still having the addressing issue. Basically:

main

{

int* tt, *CUDAtt;

cudaMalloc(CUDAtt)

myKernel<<<>>>(CUDAtt);

cudaMemcpy(tt,CUDAtt)

}

__global__ myKernel(int* CUDAtt)

{

Table1;

Table2;

createEntry(Table1, Table2);

*(CUDAtt) = Table1[0]->someNum; //breaks on this line

}

__device__ createEntry(Table1, Table2)

{

//assigns some values to Table2[0][0]

Table1[0] = &Table2[0][0];

}

Table1 and Table2 are going to be in device memory, then when I pass them into createEntry, they should still be in device memory, right? So then when they get passed into createEntry, does something tricky happen to them (like a copy gets made or something)? Because when I call the code like this, it breaks on the assignment of *(CUDAtt) but when I call it like this:

main()

{ //exactly as above}

__global__ myKernel(int* CUDAtt)

{

Table1[10][10];

Table2[100];

createEntry(Table1, Table2);

Table1[0] = &Table2[0][0];

*(CUDAtt) = Table1[0]->someNum; //no problem

}

__device__ createEntry(Table1, Table2) //exactly as above

{}

it has no problem. That’s what I can’t figure out. Something about the device addressing of Table2 is invalid after the call to createEntry.

-skiz

Ack. I’m sorry, I’ve been reading your kernel wrong all this time… I was thinking that myKernel was a host function…

device function calls are inlined, so there should be no implied copies or anything. Calling the device function should reproduce just what you are doing.

I can think of a few things that might be happening, though. What is the size of TableEntry_t in bytes? Better yet, what is the lmem used reported when you compile with the -cubin opiton? And what is the grid size are you using? And what hardware are you running on?

What I’m getting at is that your table is using a large amount of memory per thread. Maybe you are exhausting the amount of device memory available… If not, I have no idea why you are seeing the behavior you are.

OK, looking over the .cubin stuff, I have the following:

code  {

	name = _Z9GPUSearchPiS_Pt

	lmem = 8072

	smem = 40

	reg = 19

	bar = 0

}

The grid/block size are both the same (1,1,1). I’m just trying to get this working for one instance. There are a ton of lines of code being generated which I have not included. I haven’t really looked at one of these files before, but now is a good time to get started! Is that lmem figure too high? Could that be causing a problem?

I’m using a QuadroFX5600.

-skiz

Well, that amount of lmem will be allocated for each thread. Since you are testing this with a 1,1,1 grid: 8k of lmem isn’t going to max out all the memory on the GPU. I have no idea what is causing the behavior you are seeing.

I don’t think this is valid C / CUDA syntax:

Table1;

Table2;

Unless Table1 and Table2 have been previously declared…

If you post the real code it might help us help you.

Mark

Well, thanks anyway. If anybody else has any input on this it would be much appreciated. Still can’t figure it out.

Sorry, that was just some pseudocode. I was trying to avoid posting the actual code, but I guess I will at this point since I’m not getting anywhere. The

__device__ EnvInterpSpaceHashEntry_t* CreateNewHashEntry(EnvInterpSpaceHashEntry_t StateTable[][25], int* BinTable, EnvInterpSpaceHashEntry_t* HashEntryIn, int* entries, EnvInterpSpaceHashEntry_t* IDTable[MAXSTATES])

{

	int i;

	// don't think that this is needed

	HashEntryIn->stateID = *(entries);

	i = GETHASHBIN(HashEntryIn);

	

	int numElemBin = BinTable[i];

	StateTable[i][numElemBin].stateID = HashEntryIn->stateID;

	StateTable[i][numElemBin].time = HashEntryIn->time;

	StateTable[i][numElemBin].rootPosX = HashEntryIn->rootPosX;

	StateTable[i][numElemBin].rootPosZ = HashEntryIn->rootPosZ;

	StateTable[i][numElemBin].rootOrientY = HashEntryIn->rootOrientY;

	StateTable[i][numElemBin].interp_weight = HashEntryIn->interp_weight;

	StateTable[i][numElemBin].graphNode = HashEntryIn->graphNode;

	StateTable[i][numElemBin].curveCorresp = HashEntryIn->curveCorresp;

	StateTable[i][numElemBin].closestPathPt = HashEntryIn->closestPathPt;

	StateTable[i][numElemBin].constrCounter = HashEntryIn->constrCounter;

	StateTable[i][numElemBin].interpCounter = HashEntryIn->interpCounter;

	BinTable[i]++;

	(*entries)++;

	**********IDTable[HashEntryIn->stateID] = &StateTable[i][numElemBin];***********

	return HashEntryIn;

}

__device__ int addHashTable(EnvInterpSpaceHashEntry_t StateTable[][25], int* BinTable, EnvInterpSpaceHashEntry_t* HashEntryIn, int* entries, EnvInterpSpaceHashEntry_t* IDTable[MAXSTATES])

{

	EnvInterpSpaceHashEntry_t* OutHashEntry; 

	//add the outcome

	if((OutHashEntry = GetHashEntry(StateTable, BinTable, HashEntryIn)) == NULL)

	{

  //have to create a new entry

  OutHashEntry = CreateNewHashEntry(StateTable, BinTable, HashEntryIn, entries,IDTable);  

	}

	return OutHashEntry->stateID;

}

__global__ void GPUSearch(int *entries, int *lastID, unsigned short int *tt)

{

	SearchStateSpace_t pSearchStateSpace;

	CreateSearchStateSpace(&pSearchStateSpace);

	int rootPosX, rootPosZ, rootOrientY;

	discretizeRootPos(WALK_START_ROOT_POS_X, rootPosX);

	discretizeRootPos(WALK_START_ROOT_POS_Z, rootPosZ);

	discretizeAngles(WALK_START_ROOT_ORIENT_Y, rootOrientY);

	gNode startNode;

	gNode goalNode;

	startNode.motNum = 0;

	startNode.posNum = 0;

	goalNode.motNum = 0;

	goalNode.posNum = 0;

	//int entries;

	*(entries) = 0;

	typedef EnvInterpSpaceHashEntry_t  EnvInterpHashTable_t [HASHTABLESIZE][BINSIZE];

	EnvInterpHashTable_t Coord2IDTable;

	int BinTable[HASHTABLESIZE];

	EnvInterpSpaceHashEntry_t *ID2CoordTable[MAXSTATES];

	initBinTable(BinTable);

	EnvInterpSpaceHashEntry_t hashEntryStart;

	SetHashEntry(hashEntryStart, nMotionEndTime+1000, 

      	rootPosX, rootPosZ, rootOrientY, 

      	0, startNode, 0, 0, 0, 0);

	EnvInterpSpaceHashEntry_t hashEntryGoal;

	SetHashEntry(hashEntryGoal, nMotionEndTime + 2000, 

      	0, 0, 0, 

      	0, goalNode, 0, 0, 0, 0);

	int stateIDStart = addHashTable(Coord2IDTable,BinTable,&hashEntryStart,entries,ID2CoordTable);

	int stateIDGoal  = addHashTable(Coord2IDTable,BinTable,&hashEntryGoal,entries, ID2CoordTable);

	*************  *(tt) = Coord2IDTable[1][0].time;  //this line works fine

	//tt = &(ID2CoordTable[0]->time); //this line causes the crash **************

[B]/*  if i do this, the code works fine...

      ID2CoordTable[0] = &Coord2IDTable[1][0];

      *(tt) = ID2CoordTable[0]->time;  */[/B]

	

	*(lastID) = stateIDGoal;

	//*(lastID) = myTemp;

	//*(lastID) = 8;

	

}

So what I can’t figure out is what happens to the array when it gets passed in to addHashTable. For some reason, the addresses are no longer valid after the call returns. Sorry if this is sloppy. I really have to run out right now, but figured I’d post this. Thanks Mark!

EDIT: I didn’t realize that I couldn’t put bold within the CODE tags, so I put asterisks on the lines I’m having problems with.

Sorry, that’s too complex for me to visually debug…

In general when you get a hang, it’s because you are overstepping either shared or global device memory bounds in your kernel. You could pretty easily put in some asserts for emulation mode to narrow it down.

Mark

The problem is it works fine in emulation mode. I get the correct result and am able to cudaMemcpy it out. So I don’t think an assert statement would catch anything because it’s working properly. Is there some way I can check device memory or shared memory while in emulation mode?

The fact that it works OK in emulation really doesn’t mean anything. I have this kind of bug all the time.

An assert or other test could work. For example, if you have an array a in global memory and it is allocated to a size of 4096 floats, and you have a statement like this:

a[i] = foo();

You could precede it with

assert(i >= 0 && i < 4096);

Similarly for array reads.

Mark