Hey all,
I have a program that works. Unfortunately, I’m trying to scale it up and unfortunately now I have reached the 12th node in the travelling salesperson problem, I run into a unique problem.
My computer will stop displaying for a second and show me that the display driver has stopped working etc and the cuda program will never return a result as you can see in the bottom right of this printscreen:
External Media
Chances are I have programed this with an obscure bug and I will go hunting with some RAID, but just wanted to know, if you try to run 65535 blocks each with 512 threads, and try to run those threads x number of times (like I am doing) should the system just say ‘no way, I’m not doing this its too much’ and give up due to it having clever memory checks or something? At the moment i can get the app to work with x being 2, but when set to 3 it stops working, should be able to go up to 15. I will just keep checking for OOB possibilities in the meantime.
For those of you truly dedicated to problemsolving this is my kernel. Yes i already know its not optimized, I’m trying to keep everything simple and working before I fully optimize it with constant memory and minification techniques.
#define NUMBERELEMENTS 12
#define BLOCKS 62370
#define THREADSPERBLOCK 512
//#define PERMUTATIONSPERTHREAD 15 //This is the number that i need to get working
#define PERMUTATIONSPERTHREAD 2 //This is the number I can get working so far.
#define PERMUTATIONS 479001600
global void IntegerToPermutation(short int* lowestBlocks, int* distanceMap)
{
shared unsigned short int cache[THREADSPERBLOCK * PERMUTATIONSPERTHREAD];
int identity[NUMBERELEMENTS];
int permutation[NUMBERELEMENTS];
int fn[NUMBERELEMENTS];
unsigned long int startingID = (threadIdx.x + blockIdx.x * blockDim.x) * PERMUTATIONSPERTHREAD;
int n = NUMBERELEMENTS;
unsigned long int id;
int j, k;
for(unsigned int s=0; s< PERMUTATIONSPERTHREAD; s++)
{
id = startingID + s;
// DecimalToFactoradic
for (int i = 0; i < NUMBERELEMENTS; i++)
{
fn[NUMBERELEMENTS-1-i] = id % (i+1);
id = id / (i+1);
}
// Create the identity
for (int i = 0; i < n; i++)
identity[i] = i+1;
// Calculate this threads permutation
for (int i = 0; i < n; i++)
{
j = -1;
k = -1;
do
{
k++;
while(identity[k] == -1)
{
k++;
}
j++;
} while(j < fn[i]);
permutation[i] = identity[k];
identity[k] = -1;
}
// Calculate the cost of this permutation
int cost = 0;
for (unsigned int i=0; i<8; i++)
{
int index2 = (i+1)%8;
cost += distanceMap[permutation[i]*12+permutation[index2]];
}
cache[threadIdx.x * PERMUTATIONSPERTHREAD + s] = cost; // We dont need to store the permutation number as this can be derived from the thread id.
__syncthreads();
//perform this last bit the naive and slow way for ease of understanding
if (threadIdx.x == 0)
{
int lowestSum = 36000;
for (unsigned int s=0; s< THREADSPERBLOCK * PERMUTATIONSPERTHREAD; s++)
{
if (cache[s] < lowestSum)
lowestSum = cache[s];
}
lowestBlocks[blockIdx.x] = lowestSum;
}
// CANNOT find lowest value in lowestblocks here list because cannot sync blocks
// Must run another kernal in order to do that.
__syncthreads();
}
}
Thanks in advance!
Stu