How to make host wait?

Hello everyone.

I’m a beginner in CUDA programming.

I am having a weird problem. When i call the kernell using a limited amount of threads, my code works fine.

But when the number of threads is higher then that limit, i have a if to verify this condition and no processing is made. That’s when the problem occurs. I think that all threads in the block, that only have threads with id higher then my limit, process so quickly that the host prints wrong results for me.

Here is my kernel code:

[codebox]void test(Noh* cabeca, Noh* FPTree) {

int id = blockIdx.x * blockDim.x + threadIdx.x;

if(id <= ID_MAXIMO){

	int j = 0;

	for(int i = cabeca[id].irmao; i != -1; i = FPTree[i].irmao)

		++j;

	cabeca[id].frequencia = j;

}

}[/codebox]

So my question is: Is there a way to force the host to only continue execution when all the threads from all blocks have finished processing?

Thanks for the help!

cudaThreadSynchronize()

you can search this function in Reference Manual

3.2.2.2 cudaError_t cudaThreadSynchronize (void)

Blocks until the device has completed all preceding requested tasks. cudaThreadSynchronize() returns an error if one

of the preceding tasks has failed.

Well, i’ve tried this function, but it still doesn’t work…

Here is the code where I print my results:

[codebox]for(int i = 0; i <= 10; ++i) {

	int k = 0;

	for(int j = hCabeca[i].irmao; j != -1; j = hFPTree[j].irmao)

		++k;

	printf("(%d, %d)\n", i, k);

}

puts("");

puts("2: ");

for(int i = 0; i <= 10; ++i)

	printf("(%d, %d)\n", i, hCabeca[i].frequencia);



puts("");

test<<<33, 512>>>(dCabeca, dFPTree);

cudaThreadSynchronize();

// test<<<65, 256>>>(dCabeca, dFPTree);

// test<<<43, 383>>>(dCabeca, dFPTree);

// test<<<15, 100>>>(dCabeca, dFPTree);

// recuperaFrequentItemsets();

cudaMemcpy(hCabeca, dCabeca, ID_MAXIMO*sizeof(Noh), cudaMemcpyDeviceToHost);

puts("3: ");

for(int i = 0; i <= 10; ++i)

	printf("(%d, %d)\n", i, hCabeca[i].frequencia);

puts("");[/codebox]

This way the result is different than with the <<<43,383>>> uncommented. I think it’s because with the <<<33,512>>> uncommented the number of threads is higher than the limit i defined in the fuction test.

I just suspect this, I’m don’t know real reason…

in general, number of threads per block should be multiple of 32, since basic unit is a warp (32 threads).

question 1: do you copy data from host memory to device memory, dCabeca and dFPTree ?
(you don’t have this code segment in your post)

question 2: how about “test<<<65, 256>>>(dCabeca, dFPTree)”? Is the result correct?

Hummm, I didn’t know about the 32.

question 1: do you copy data from host memory to device memory, dCabeca and dFPTree ?

Yes, I do.

question 2: how about “test<<<65, 256>>>(dCabeca, dFPTree)”? Is the result correct?

No, the result is incorrect. Every time the number of threads is higher then ID_MAXIMO, I get a wrong result. I just don’t know why…

@rodrigomb:

Are you sure that the variable ‘Noh* FPTree’ you are using inside the kernel has been copied to device memory before the kernel is called?

One way to make your code easier to debug would be to use the following piece of code after EVERY call to ‘cuda*’…

cudaError_t check123 = cudaGetLastError;

if(check123 != cudaSuccess) {

  fprintf(stderr, "Cuda-runtime failure! Reason: %s\n", cudaGetErrorString(check123));

  exit(1);

}

from your code

cudaMemcpy(hCabeca, dCabeca, ID_MAXIMO*sizeof(Noh), cudaMemcpyDeviceToHost);

I suppose that you allocate device memory dCabeca of size ID_MAXIMO, then you must modify your kernel

[codebox]global void test(Noh* cabeca, Noh* FPTree)

{

int id = blockIdx.x * blockDim.x + threadIdx.x;	



if(id < ID_MAXIMO){		

	int j = 0;		

	for(int i = cabeca[id].irmao; i != -1; i = FPTree[i].irmao)			

		++j;		

	cabeca[id].frequencia = j;

}	

}[/codebox]

since id starts from 0, and cabeca, FPTree have ID_MAXIMO elemetns.

if it still does not work, then you may post whole code and we can check what happens.

Oh god, that’s it, I didn’t realize that.

I should be getting segmentation fault, but i didn’t catch the cuda error.

Thanks guys, I think I can proceed now!