I’ve been wrapping my head around this problem, but still couldn’t figure out a way to solve this one out.
I have two kernels which are giving back result to each other.
In the first kernel BP1, tmp are updated ( tmp_i=f(msg_i) ). Then in the second kernel the messages msg, are computed (msg_i= f(tmp_i) ).
I had to split the implementation in two kernels, because the two different operations put together in the same kernel didn’t give back the expected results, whereas seperated kernels did. (Even though I had “__syncthreads()” command between the two).
Here nb_iter is the number of iteration.
The problem:
[b]Whenever nb_iter goes over 12 iterations, I don’t get the expected results, ( I compare with a CPU function).
Did anybody already have this kind of problem ?[/b]
If nb_iter==11, I have the same results as the CPU…
I don’t think that this is precision problem, since I only works with integers or float.
It is pretty hard to say anything without at least some indication of what the kernels actually do, how they use the variables they “share” in global memory, and how you are determining that the results are “wrong”.
I have an explicit convection-diffusion solver which repeats three kernels (ghost cell calculation, spatial discretization, and Runge-Kutta stage update) sequentially in conjuction with some device to device memcpy() calls literally millions of times in a loop while keeping the working data set on the device in global memory without incident, so I very much doubt it is the high level structure of the code which is causing you problems.
I can give the code of the kernel, but I was really surprised that it would be wrong only after the 12th iteration and not before.
I know the results are wrong because I am doind a comparison pixel by pixel with a CPU version (the final results is contained in d_disp (disparity_map)). I was wondering if it was a general mistake appearing with loops, that somebody already had.
The code that i’m trying to implement is a belief propagation computation.
Here is the kernels that I used and in which order:
Do you get a progressive deterioration in the agreement between the CPU and GPU versions, or do you get perfect agreement (or at least within the expect accuracy range of single precision floating point) until a certain point, and then it breaks spectacularly?
Oh and I forgot to say, that if I run the code in emulation mode, then the problem doesn’t show up… as usual…
The only data shared in global memory are the float arrays.
tmp_l, tmp_r, tmp_u, tmp_d,
msg_l, msg_r, msg_u, msg_d
All of the following arrays are of size nbcol2nbrow2nblabel. Because for each pixel of the map (of size nbcol*nbrow), we need to record the data cost corresponding to certain label value). (To avoid reaching out of the map, we add a 2 pixel appron, nbcol2=(nbcol+2) and nbrow2=nbrow+2)). nblabel is the number of label that we are working with.
I always get the errors for the same idx, and idy values
I’ve tested in emulation mode see what happens with those (knowing that in emulation, the problem does not appear); and I don’t find what is the difference with other values that are computed correctly.
Yeah you’re right I only need the extern C for the functions that I call in the main program, not for the kernels (which are called in a particular function. But I had lots of trouble linking everything correctly in my first program, so I had put “extern C” before each function (even the kernels). But I can get rid of those now.
Concerning the cudaTreadSynchronize=(); commend, this was just an attempt to see if the error came from a lack of synchonization. I thought that after 12 successives launch from the CPU, the device did not handle the thread termination very well. But it didn’t have any impact on the final result.
Your comparison function is broken. abs() is an integer valued function. you want to use fabs() to compare floats. As it is, you have implicit float->integer truncation happening is several places.
I already had problems between types (but between double and float) when executing on the device (for another kernel). I’m implementing a CPU code that I did not write myself. Even though this kind of comparison/truncation/… works on CPU, it definitely hasn’t the same behavior on the device.
I’m going to correct the code, i’ll let you know if this changes anything.
I got rid of the float-integer truncation, and spent some more time looking more precisely into the code. But I still don’t know what’s going and why it is producing error so abruptly at the 12th iteration…
I tested it with CUDA 2.2 and with a GTX280, after a few adaptations (my colleague was using Visual express 2008) , I finally got the same error again…
If anybody had another hint, I would really appreciate it, thank you !
I checked that the d_cost_data_BP , d_msg_l, d_msg_r, d_msg_u, d_msg_d, are exactly the same as those produced by the CPU.
The error is occuring somewhere during the sum. If I check
cost = d_cost_data_BP [ind] + d_msg_l[ind] ;
or
cost= d_cost_data_BP [ind] + d_msg_r[ind] ;
and so on…
Then I got the exact same result as the CPU (if the CPU function has been changed this way).
It is only when I have the whole expression that I get mistakes.
I don’t know where it comes from, I don’t see any variable that the threads might be sharing/using at the same time (therefore producing the mistake)).