Deviceemu works but problem on gpu

Dear all,

I am a belgian MSc student in computer science. My final MSc thesis is about machine learning algorithms and their implementation on GPGPU using CUDA.

I 'm working on a global function for the implementation of “decision trees” on the GPU.

I have a problem with the following code :

__device__ float getEntropyOnDevice(unsigned int* matrix, unsigned int numberOfY)

{

	float entropy = 0, Na, N = *(matrix+numberOfY);

	for(int i=0; i<numberOfY-1; i++)

	{

  	Na      	= *(matrix+i);

  	if(Na != 0) entropy   -=  (Na/N * log2f(Na/N));

	}

	return entropy;

}

__global__ void scoreOnDevice( DeviceRecord* attToCompute, uint numberOfY, uint* total_matrix_left, uint* total_matrix_right, float entropy_estimate, uint nbScoreToCompute, Test* d_scores)

{

	uint id = blockIdx.x * blockDim.x + threadIdx.x;

	

	float Na,Nb,N,entropy_split;

	entropy_split = Na = Nb = N = 1;

	if(id < nbScoreToCompute)

	{

  Na  = d_scores[id].total_left = *(total_matrix_left + (id*(numberOfY + 1)) + numberOfY);

  Nb	= d_scores[id].total_right = *(total_matrix_right + (id*(numberOfY + 1)) + numberOfY);

  N  = Na + Nb;

 entropy_split = -(Na/N * log2f(Na/N)) - (Nb/N * log2f(Nb/N));

 float entropy_left  = getEntropyOnDevice(total_matrix_left + id * (numberOfY + 1), numberOfY);

  float entropy_right = getEntropyOnDevice(total_matrix_right + id * (numberOfY + 1), numberOfY);

 if(entropy_estimate == 0 || entropy_split == 0)d_scores[id].score = 0;

  else

  {

  	float information_gain = entropy_estimate - ((Na/N) * entropy_left) - ((Nb/N) * entropy_right);

 	//printf("\n %f ",2 * information_gain / (entropy_split + entropy_estimate)); // -deviceemu

 	// PROBLEM HERE, if I let "information_gain", the line is not executed... :

  	// d_scores[id].score = 2 / (entropy_split + entropy_estimate); // ----> WORKS but don't do what I want

  	d_scores[id].score = 2 * information_gain / (entropy_split + entropy_estimate); // WORKS ONLY on -deviceemu

  	

  }

 d_scores[id].test = attToCompute[id].test;

	}

}

When I compile the code with -deviceemu, the result given is exactly what I wrote but when I do a standard compilation, if I let the line “d_scores[id].score = 2 * information_gain / (entropy_split + entropy_estimate);”, the result goes totally wrong ( d_scores[id].test and d_scores[id].score contain random values )!

When this bug will be fixed, I will be able to post my results about “decision trees” on GPU… I’m sure that some of you will find my problem obvious but please, I really need your help. :argh:

Thank you in advance,

D1mmu.

Is it completely random or semirandom?

hmm… is the d_scores a gpu variable that’s properly accessible? I’d try to work with some very simple examples first, such as memcpy, maybe take the additive inverses of an array, etc. to make sure your memory management code is working correctly.

you said the result is random; are the random values new every time you run the program?

another possibility is if you’re writing past array bounds, or writing to the same element from multiple threads without proper use of __syncthreads() or atomic instructions.

what is your thread and block size? does it work if you try 1 block / 1 thread?

At the moment, I test my code with 9 elements, I call the kernel function with 1 block and 512 threads. And just 9 threads are used with the condition (id < nbScoreToCompute) because nbScoreToCompute = 9

My host code :

uint threads  	= 512;

	uint blocks  	= (nbScoreToCompute / 512) + 1;

	Test *d_scores, *h_scores;

	deviceMalloc((void**)&d_scores, sizeof(Test) * nbScoreToCompute);

	h_scores = (Test*) malloc(sizeof(Test) * nbScoreToCompute);

	Test init;

	init.score    = 0;

	init.test    = 0;

	init.total_left  = 0;

	init.total_right  = 0;

	init.y_value  	= 0;

	init.num_attribute  = 0;

	for(int i = 0; i < nbScoreToCompute; i++)

	h_scores[i] = init;

	hostToDevice(d_scores, h_scores, sizeof(Test) * nbScoreToCompute);

	scoreOnDevice<<< blocks, threads >>>(d_attToCompute, numberOfY, d_total_matrix_left, d_total_matrix_right, entropy_estimate, nbScoreToCompute, d_scores);

	// int deviceToHost(void *to, void *from, int size){return CUDA_SAFE_CALL(cudaMemcpy(to, from, size, cudaMemcpyDeviceToHost));}

	deviceToHost(h_scores, d_scores, sizeof(Test) * nbScoreToCompute); 

	

	Test score;

	score.score = 0;

	

	for(int i =0; i<nbScoreToCompute; i++)

	{

  printf("\n Score : %f",h_scores[i].score);

  printf("\n Test : %f",h_scores[i].test);

  printf("\n Tot Left : %d",h_scores[i].total_left);

  if(h_scores[i].score >= score.score)

  {

  	score = h_scores[i];

  }

	}

If I set the line : d_scores[id].score = 2 / (entropy_split + entropy_estimate); <— I don’t use the variable information_gain here

I have a correct result :

Score : 2.142747

Test : 901.613525

Tot Left : 1

Score : 1.685895

Test : 950.668457

Tot Left : 2

Score : 1.486241

Test : 977.135498

Tot Left : 3

Score : 1.393402

Test : 989.608032

Tot Left : 4

Score : 1.365760

Test : 1034.102051

Tot Left : 5

Score : 1.393402

Test : 1090.000000

Tot Left : 6

Score : 1.486241

Test : 1150.040039

Tot Left : 7

Score : 1.685895

Test : 1190.895020

Tot Left : 8

Score : 2.142747

Test : 1217.234985

Tot Left : 9

If I set the line : d_scores[id].score = 2 * information_gain / (entropy_split + entropy_estimate);

The values previously initialized don’t change… (I was wrong when I said that the values was random, I just forgot to initialize it)

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

I did the following test :

If I call the function with <<< 1,9 >>> instead of <<< 1,512 >>>, the code workds fine… (for 9 elements)
If I call the function with <<< block,maxthread >>> with maxthread <= 320 (256 + 64) the code works fine.
If I call the function with <<< block,maxthread >>> with maxthread > 320 (256 + 64), the bug appears.

The code is running in a GeForce 8600m GS, do you think that the number of thread per block would be different in that kind of graphic card?

I’m going to do the same test on my GeForce 8800 GTS.

do you check for errors after your kernel? (CUT_CHECK_ERROR)

Most likely your register usage does not allow more than 320 threads.

The number of threads per block should always be a maximum of 512. The maximum blocks per SP varies based on the application and GPU model, but blocks are independent so it shouldn’t matter. It’s unlikely a hardware problem or variation though.

I’m not really sure what your code is computing. I’m interested though – if you want to give me a brief on what this entropy calculation is, maybe I can be of more help.

You are storing some “N” in the last row of some matrix? Then you are calculating some ptr with numberOfY + 1 – what’s the + 1 for?

Please comment your code – my code style is kinda different – it would help reading it.

I thought there was some sort of register overflow mechanism.

Nope, only at compile time (you can steer it with -maxregcount).
When you try to run a kernel with too many threads, CUT_CHECK_ERROR will return a too many resources requested error or something like it. Anyway, always perform a CUT_CHECK_ERROR :) The fact that the result is all zeros indicates that the kernel did not run at all. Which is quite likely since using 512 threads per block is only possible when using 10 registers or less on pre GT200 hardware.

Thank you very much for you help! I will use CUT_CHECK_ERROR in the future.

The code works with 320 threads, but it’s very slow and I think I will keep the CPU version of the function. So the GPU version of my “decision tree” is pretty simple : I use a GPU radix sort instead of a CPU Quicksort when I have more than 7000 elements to sort. With 1million elements, the decision tree can be up to 5 times faster!

My program receives a database (a learnSet) with attributes and one Y per records. An Y can be (per example) a pathologie as prostate cancer and so on. The program scans the database and try to build the most efficient decision tree as possible to determine the Y depending on the attributes.

When the decision tree is build, you can load a testSet (a database that not contain the Y) and the programm will set to each record “the most probable Y” by running these records on the decision tree.

To summarize the implementation : you load the database and you build a tree. To build the most efficient tree : you have to find the best test per node. If you find the best tests par nodes, your tree is minimal and optimal. A test is the best value to split the database in two other.

example :

DB :

Att1 Att2 Att3

0 0 1 —> best test = 1 + 2 / 2 = 1.5 —> split < 1.5 = 0 0 1 split >= 1.5 = 0 0 2

0 0 2 0 0 2

0 0 2

To find the best test, you seek in your database the best “score computing” for each attributes and records. In the code I’ve posted, I’m trying to find the best score for each records of an attribute.

(next, the programm compare the best score of each attributes and keep the best).

My english level is not good enough to be more explicit, but if you want more informations, you can read this :

http://www.montefiore.ulg.ac.be/~lwh/AIA/d…e-induction.pdf

And this for the score computing :

http://www.montefiore.ulg.ac.be/~lwh/AIA/aia-26-9-05.pdf