Deviceemu works but problem on gpu

D1mmu · July 19, 2008, 12:01am

Dear all,

I am a belgian MSc student in computer science. My final MSc thesis is about machine learning algorithms and their implementation on GPGPU using CUDA.

I 'm working on a global function for the implementation of “decision trees” on the GPU.

I have a problem with the following code :

__device__ float getEntropyOnDevice(unsigned int* matrix, unsigned int numberOfY)

{

	float entropy = 0, Na, N = *(matrix+numberOfY);

	for(int i=0; i<numberOfY-1; i++)

	{

  	Na      	= *(matrix+i);

  	if(Na != 0) entropy   -=  (Na/N * log2f(Na/N));

	}

	return entropy;

}

__global__ void scoreOnDevice( DeviceRecord* attToCompute, uint numberOfY, uint* total_matrix_left, uint* total_matrix_right, float entropy_estimate, uint nbScoreToCompute, Test* d_scores)

{

	uint id = blockIdx.x * blockDim.x + threadIdx.x;

	

	float Na,Nb,N,entropy_split;

	entropy_split = Na = Nb = N = 1;

	if(id < nbScoreToCompute)

	{

  Na  = d_scores[id].total_left = *(total_matrix_left + (id*(numberOfY + 1)) + numberOfY);

  Nb	= d_scores[id].total_right = *(total_matrix_right + (id*(numberOfY + 1)) + numberOfY);

  N  = Na + Nb;

 entropy_split = -(Na/N * log2f(Na/N)) - (Nb/N * log2f(Nb/N));

 float entropy_left  = getEntropyOnDevice(total_matrix_left + id * (numberOfY + 1), numberOfY);

  float entropy_right = getEntropyOnDevice(total_matrix_right + id * (numberOfY + 1), numberOfY);

 if(entropy_estimate == 0 || entropy_split == 0)d_scores[id].score = 0;

  else

  {

  	float information_gain = entropy_estimate - ((Na/N) * entropy_left) - ((Nb/N) * entropy_right);

 	//printf("\n %f ",2 * information_gain / (entropy_split + entropy_estimate)); // -deviceemu

 	// PROBLEM HERE, if I let "information_gain", the line is not executed... :

  	// d_scores[id].score = 2 / (entropy_split + entropy_estimate); // ----> WORKS but don't do what I want

  	d_scores[id].score = 2 * information_gain / (entropy_split + entropy_estimate); // WORKS ONLY on -deviceemu

  	

  }

 d_scores[id].test = attToCompute[id].test;

	}

}

When I compile the code with -deviceemu, the result given is exactly what I wrote but when I do a standard compilation, if I let the line “d_scores[id].score = 2 * information_gain / (entropy_split + entropy_estimate);”, the result goes totally wrong ( d_scores[id].test and d_scores[id].score contain random values )!

When this bug will be fixed, I will be able to post my results about “decision trees” on GPU… I’m sure that some of you will find my problem obvious but please, I really need your help. External Image

Thank you in advance,

D1mmu.

kristleifur · July 19, 2008, 12:55am

Is it completely random or semirandom?

gatoatigrado · July 19, 2008, 8:02am

hmm… is the d_scores a gpu variable that’s properly accessible? I’d try to work with some very simple examples first, such as memcpy, maybe take the additive inverses of an array, etc. to make sure your memory management code is working correctly.

you said the result is random; are the random values new every time you run the program?

another possibility is if you’re writing past array bounds, or writing to the same element from multiple threads without proper use of __syncthreads() or atomic instructions.

what is your thread and block size? does it work if you try 1 block / 1 thread?

D1mmu · July 19, 2008, 10:06am

At the moment, I test my code with 9 elements, I call the kernel function with 1 block and 512 threads. And just 9 threads are used with the condition (id < nbScoreToCompute) because nbScoreToCompute = 9

My host code :

uint threads  	= 512;

	uint blocks  	= (nbScoreToCompute / 512) + 1;

	Test *d_scores, *h_scores;

	deviceMalloc((void**)&d_scores, sizeof(Test) * nbScoreToCompute);

	h_scores = (Test*) malloc(sizeof(Test) * nbScoreToCompute);

	Test init;

	init.score    = 0;

	init.test    = 0;

	init.total_left  = 0;

	init.total_right  = 0;

	init.y_value  	= 0;

	init.num_attribute  = 0;

	for(int i = 0; i < nbScoreToCompute; i++)

	h_scores[i] = init;

	hostToDevice(d_scores, h_scores, sizeof(Test) * nbScoreToCompute);

	scoreOnDevice<<< blocks, threads >>>(d_attToCompute, numberOfY, d_total_matrix_left, d_total_matrix_right, entropy_estimate, nbScoreToCompute, d_scores);

	// int deviceToHost(void *to, void *from, int size){return CUDA_SAFE_CALL(cudaMemcpy(to, from, size, cudaMemcpyDeviceToHost));}

	deviceToHost(h_scores, d_scores, sizeof(Test) * nbScoreToCompute); 

	

	Test score;

	score.score = 0;

	

	for(int i =0; i<nbScoreToCompute; i++)

	{

  printf("\n Score : %f",h_scores[i].score);

  printf("\n Test : %f",h_scores[i].test);

  printf("\n Tot Left : %d",h_scores[i].total_left);

  if(h_scores[i].score >= score.score)

  {

  	score = h_scores[i];

  }

	}

If I set the line : d_scores[id].score = 2 / (entropy_split + entropy_estimate); <— I don’t use the variable information_gain here

I have a correct result :

Score : 2.142747

Test : 901.613525

Tot Left : 1

Score : 1.685895

Test : 950.668457

Tot Left : 2

Score : 1.486241

Test : 977.135498

Tot Left : 3

Score : 1.393402

Test : 989.608032

Tot Left : 4

Score : 1.365760

Test : 1034.102051

Tot Left : 5

Score : 1.393402

Test : 1090.000000

Tot Left : 6

Score : 1.486241

Test : 1150.040039

Tot Left : 7

Score : 1.685895

Test : 1190.895020

Tot Left : 8

Score : 2.142747

Test : 1217.234985

Tot Left : 9

If I set the line : d_scores[id].score = 2 * information_gain / (entropy_split + entropy_estimate);

The values previously initialized don’t change… (I was wrong when I said that the values was random, I just forgot to initialize it)

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

Score : 0.000000

Test : 0.000000

Tot Left : 0

D1mmu · July 19, 2008, 12:29pm

I did the following test :

If I call the function with <<< 1,9 >>> instead of <<< 1,512 >>>, the code workds fine… (for 9 elements)
If I call the function with <<< block,maxthread >>> with maxthread <= 320 (256 + 64) the code works fine.
If I call the function with <<< block,maxthread >>> with maxthread > 320 (256 + 64), the bug appears.

The code is running in a GeForce 8600m GS, do you think that the number of thread per block would be different in that kind of graphic card?

I’m going to do the same test on my GeForce 8800 GTS.

E.D_Riedijk · July 19, 2008, 5:19pm

do you check for errors after your kernel? (CUT_CHECK_ERROR)

Most likely your register usage does not allow more than 320 threads.

gatoatigrado · July 19, 2008, 5:30pm

The number of threads per block should always be a maximum of 512. The maximum blocks per SP varies based on the application and GPU model, but blocks are independent so it shouldn’t matter. It’s unlikely a hardware problem or variation though.

I’m not really sure what your code is computing. I’m interested though – if you want to give me a brief on what this entropy calculation is, maybe I can be of more help.

You are storing some “N” in the last row of some matrix? Then you are calculating some ptr with numberOfY + 1 – what’s the + 1 for?

Please comment your code – my code style is kinda different – it would help reading it.

gatoatigrado · July 19, 2008, 5:33pm

I thought there was some sort of register overflow mechanism.

E.D_Riedijk · July 19, 2008, 7:31pm

Nope, only at compile time (you can steer it with -maxregcount).
When you try to run a kernel with too many threads, CUT_CHECK_ERROR will return a too many resources requested error or something like it. Anyway, always perform a CUT_CHECK_ERROR :) The fact that the result is all zeros indicates that the kernel did not run at all. Which is quite likely since using 512 threads per block is only possible when using 10 registers or less on pre GT200 hardware.

D1mmu · July 20, 2008, 1:42pm

Thank you very much for you help! I will use CUT_CHECK_ERROR in the future.

The code works with 320 threads, but it’s very slow and I think I will keep the CPU version of the function. So the GPU version of my “decision tree” is pretty simple : I use a GPU radix sort instead of a CPU Quicksort when I have more than 7000 elements to sort. With 1million elements, the decision tree can be up to 5 times faster!

My program receives a database (a learnSet) with attributes and one Y per records. An Y can be (per example) a pathologie as prostate cancer and so on. The program scans the database and try to build the most efficient decision tree as possible to determine the Y depending on the attributes.

When the decision tree is build, you can load a testSet (a database that not contain the Y) and the programm will set to each record “the most probable Y” by running these records on the decision tree.

To summarize the implementation : you load the database and you build a tree. To build the most efficient tree : you have to find the best test per node. If you find the best tests par nodes, your tree is minimal and optimal. A test is the best value to split the database in two other.

example :

DB :

Att1 Att2 Att3

0 0 1 —> best test = 1 + 2 / 2 = 1.5 —> split < 1.5 = 0 0 1 split >= 1.5 = 0 0 2

0 0 2 0 0 2

0 0 2

To find the best test, you seek in your database the best “score computing” for each attributes and records. In the code I’ve posted, I’m trying to find the best score for each records of an attribute.

(next, the programm compare the best score of each attributes and keep the best).

My english level is not good enough to be more explicit, but if you want more informations, you can read this :

http://www.montefiore.ulg.ac.be/~lwh/AIA/d…e-induction.pdf

And this for the score computing :

http://www.montefiore.ulg.ac.be/~lwh/AIA/aia-26-9-05.pdf

Topic		Replies	Views
Emulation/CPU=correct,Execution/GPU=incorrect emulation CUDA Programming and Performance	26	21749	September 2, 2008
emu vs debug, different values CUDA Programming and Performance	48	16132	February 5, 2009
Difference between Device emulation and execution modes CUDA Programming and Performance	0	1124	April 11, 2009
emulation working fine but not GPU CUDA Programming and Performance	4	2189	November 24, 2009
Anomalies with __device__ functions. Or is cuda-gdb playing stupid? CUDA Programming and Performance	0	3531	November 9, 2011
Different results with -Mcuda=emu / -Mcuda with simple code Legacy PGI Compilers	17	15535	December 10, 2009
Cuda code performance CUDA Programming and Performance	14	3321	December 16, 2014
Why cannot run this program CUDA Programming and Performance	7	1054	May 11, 2018
CUDA problem when program runs on device CUDA Programming and Performance	6	1660	August 17, 2009
problem running demos CUDA Programming and Performance	9	8301	January 1, 2009

Deviceemu works but problem on gpu

Related topics