Problems caused by doing very intesive calculations generating 479001600 of permutations and calcula

Hey all,
I have a program that works. Unfortunately, I’m trying to scale it up and unfortunately now I have reached the 12th node in the travelling salesperson problem, I run into a unique problem.

My computer will stop displaying for a second and show me that the display driver has stopped working etc and the cuda program will never return a result as you can see in the bottom right of this printscreen:

Chances are I have programed this with an obscure bug and I will go hunting with some RAID, but just wanted to know, if you try to run 65535 blocks each with 512 threads, and try to run those threads x number of times (like I am doing) should the system just say ‘no way, I’m not doing this its too much’ and give up due to it having clever memory checks or something? At the moment i can get the app to work with x being 2, but when set to 3 it stops working, should be able to go up to 15. I will just keep checking for OOB possibilities in the meantime.

For those of you truly dedicated to problemsolving this is my kernel. Yes i already know its not optimized, I’m trying to keep everything simple and working before I fully optimize it with constant memory and minification techniques.

#define NUMBERELEMENTS 12
#define BLOCKS 62370
#define THREADSPERBLOCK 512
//#define PERMUTATIONSPERTHREAD 15 //This is the number that i need to get working
#define PERMUTATIONSPERTHREAD 2 //This is the number I can get working so far.
#define PERMUTATIONS 479001600

global void IntegerToPermutation(short int* lowestBlocks, int* distanceMap)
{
shared unsigned short int cache[THREADSPERBLOCK * PERMUTATIONSPERTHREAD];

int identity[NUMBERELEMENTS];
int permutation[NUMBERELEMENTS];
int fn[NUMBERELEMENTS];


unsigned long int startingID = (threadIdx.x + blockIdx.x * blockDim.x) * PERMUTATIONSPERTHREAD;
int n = NUMBERELEMENTS;


unsigned long int id;
int j, k;

for(unsigned int s=0; s< PERMUTATIONSPERTHREAD; s++)
{
	id = startingID + s;
	// DecimalToFactoradic
	for (int i = 0; i < NUMBERELEMENTS; i++)
	{
		fn[NUMBERELEMENTS-1-i] = id % (i+1);
		id = id / (i+1);
	}

	// Create the identity
	for (int i = 0; i < n; i++) 
		identity[i] = i+1;

	// Calculate this threads permutation
	

	for (int i = 0; i < n; i++)
	{
		j = -1;
		k = -1;
		
		do
		{
			k++;

			while(identity[k] == -1)
			{
				k++;
			}

			j++;
		} while(j < fn[i]);

		
			
		permutation[i] = identity[k];
		identity[k] = -1;
	}
	


	// Calculate the cost of this permutation
	int cost = 0;
	for (unsigned int i=0; i<8; i++)
	{  
		int index2 = (i+1)%8;
		cost += distanceMap[permutation[i]*12+permutation[index2]];
	}
	cache[threadIdx.x * PERMUTATIONSPERTHREAD + s] = cost; // We dont need to store the permutation number as this can be derived from the thread id.

	__syncthreads();

	//perform this last bit the naive and slow way for ease of understanding
	
	if (threadIdx.x == 0)
	{
		int lowestSum = 36000;

		for (unsigned int s=0; s< THREADSPERBLOCK * PERMUTATIONSPERTHREAD; s++)
		{
			if (cache[s] < lowestSum)
				lowestSum = cache[s];
		}
		lowestBlocks[blockIdx.x] = lowestSum;
	}

	// CANNOT find lowest value in lowestblocks here list because cannot sync blocks
	// Must run another kernal in order to do that.

	__syncthreads();
}

}

Thanks in advance!
Stu

You are probably hitting the watchdog timer that aborts kernels running on a GPU with attached display for too long.

aha! thanks for that, never knew about the watchdog timer will look into that. Maybe i can get around it by calling the kernal twice under different circumstances to divide the problem up.

If i disconnect the display from that gfx card will it work again then? i have 2 gfx cards one nvidia one ati so i could just use the ati one.

Disconnected display from nvidia card and proved it with this screenshot.

That allowed me to go from 2 to 3 which is still an improvement but need to get to 15.
Still says the display driver stopped working after about 1-2 seconds.

Stu

Yes, those are the two best strategies: If you can afford having a dedicated GPU for CUDA, that is the way to go.
Otherwise it makes sense to split kernel invocations into multiple smaller ones, usually by having each work on only part of the grid. This gives the GPU a chance to update the screen between kernels and also resets the watchdog timer.

I’m not sure its the watchdog timer as I have tried not having any display plugged into the gfx card. I will continue down the splitting it up route for now.

Stu