Problems caused by doing very intesive calculations generating 479001600 of permutations and calcula

Stu2000 · March 5, 2011, 8:05pm

Hey all,
I have a program that works. Unfortunately, I’m trying to scale it up and unfortunately now I have reached the 12th node in the travelling salesperson problem, I run into a unique problem.

My computer will stop displaying for a second and show me that the display driver has stopped working etc and the cuda program will never return a result as you can see in the bottom right of this printscreen:
External Media

Chances are I have programed this with an obscure bug and I will go hunting with some RAID, but just wanted to know, if you try to run 65535 blocks each with 512 threads, and try to run those threads x number of times (like I am doing) should the system just say ‘no way, I’m not doing this its too much’ and give up due to it having clever memory checks or something? At the moment i can get the app to work with x being 2, but when set to 3 it stops working, should be able to go up to 15. I will just keep checking for OOB possibilities in the meantime.

For those of you truly dedicated to problemsolving this is my kernel. Yes i already know its not optimized, I’m trying to keep everything simple and working before I fully optimize it with constant memory and minification techniques.

#define NUMBERELEMENTS 12
#define BLOCKS 62370
#define THREADSPERBLOCK 512
//#define PERMUTATIONSPERTHREAD 15 //This is the number that i need to get working
#define PERMUTATIONSPERTHREAD 2 //This is the number I can get working so far.
#define PERMUTATIONS 479001600

global void IntegerToPermutation(short int* lowestBlocks, int* distanceMap)
{
shared unsigned short int cache[THREADSPERBLOCK * PERMUTATIONSPERTHREAD];

int identity[NUMBERELEMENTS];
int permutation[NUMBERELEMENTS];
int fn[NUMBERELEMENTS];


unsigned long int startingID = (threadIdx.x + blockIdx.x * blockDim.x) * PERMUTATIONSPERTHREAD;
int n = NUMBERELEMENTS;


unsigned long int id;
int j, k;

for(unsigned int s=0; s< PERMUTATIONSPERTHREAD; s++)
{
	id = startingID + s;
	// DecimalToFactoradic
	for (int i = 0; i < NUMBERELEMENTS; i++)
	{
		fn[NUMBERELEMENTS-1-i] = id % (i+1);
		id = id / (i+1);
	}

	// Create the identity
	for (int i = 0; i < n; i++) 
		identity[i] = i+1;

	// Calculate this threads permutation
	

	for (int i = 0; i < n; i++)
	{
		j = -1;
		k = -1;
		
		do
		{
			k++;

			while(identity[k] == -1)
			{
				k++;
			}

			j++;
		} while(j < fn[i]);

		
			
		permutation[i] = identity[k];
		identity[k] = -1;
	}
	


	// Calculate the cost of this permutation
	int cost = 0;
	for (unsigned int i=0; i<8; i++)
	{  
		int index2 = (i+1)%8;
		cost += distanceMap[permutation[i]*12+permutation[index2]];
	}
	cache[threadIdx.x * PERMUTATIONSPERTHREAD + s] = cost; // We dont need to store the permutation number as this can be derived from the thread id.

	__syncthreads();

	//perform this last bit the naive and slow way for ease of understanding
	
	if (threadIdx.x == 0)
	{
		int lowestSum = 36000;

		for (unsigned int s=0; s< THREADSPERBLOCK * PERMUTATIONSPERTHREAD; s++)
		{
			if (cache[s] < lowestSum)
				lowestSum = cache[s];
		}
		lowestBlocks[blockIdx.x] = lowestSum;
	}

	// CANNOT find lowest value in lowestblocks here list because cannot sync blocks
	// Must run another kernal in order to do that.

	__syncthreads();
}

}

Thanks in advance!
Stu

tera · March 5, 2011, 8:16pm

You are probably hitting the watchdog timer that aborts kernels running on a GPU with attached display for too long.

Stu2000 · March 5, 2011, 8:32pm

aha! thanks for that, never knew about the watchdog timer will look into that. Maybe i can get around it by calling the kernal twice under different circumstances to divide the problem up.

If i disconnect the display from that gfx card will it work again then? i have 2 gfx cards one nvidia one ati so i could just use the ati one.

Stu2000 · March 5, 2011, 8:46pm

Disconnected display from nvidia card and proved it with this screenshot.
External Media

That allowed me to go from 2 to 3 which is still an improvement but need to get to 15.
Still says the display driver stopped working after about 1-2 seconds.

Stu

tera · March 5, 2011, 8:48pm

Yes, those are the two best strategies: If you can afford having a dedicated GPU for CUDA, that is the way to go.
Otherwise it makes sense to split kernel invocations into multiple smaller ones, usually by having each work on only part of the grid. This gives the GPU a chance to update the screen between kernels and also resets the watchdog timer.

Stu2000 · March 5, 2011, 9:41pm

I’m not sure its the watchdog timer as I have tried not having any display plugged into the gfx card. I will continue down the splitting it up route for now.

Stu

Topic		Replies	Views
Crashes - display driver recovers Cuda program causes card to give up. CUDA Programming and Performance	4	3884	June 21, 2011
Too much threads makes computer crashing If this kernell takes a long time to complete, I got a blue CUDA Programming and Performance	7	2125	April 24, 2009
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12925	July 26, 2010
CUDA limit for loops..? too large number of iterations? CUDA Programming and Performance	28	27669	March 20, 2008
Display Driver Stopped responding and has recovered? CUDA Programming and Performance	7	8599	August 11, 2009
Fatal error:the launch timed out and was terminated CUDA Programming and Performance	5	9865	April 19, 2016
Kernel problem, execution stop after ~15min CUDA Programming and Performance	7	1895	November 4, 2016
kernel fails over many iterations CUDA Programming and Performance	1	2300	November 25, 2011
How to choose how many threads/blocks to have? CUDA Programming and Performance	43	53797	June 7, 2022
Driver Crash on TitanX during kernel operation. CUDA Programming and Performance	6	1440	June 23, 2015

Problems caused by doing very intesive calculations generating 479001600 of permutations and calcula

Related topics