Hello Everyone,
I have been having some problems with programming a Tesla C870. Basically my task is to take a segment of linear CPU code that essentially looks like
[codebox]for(iter = iter_start; iter <= iter_end; iter++)
{
Assignment1=1;
Assigement2=2;
.
.
.
for(i = 0; i < imx-1; i++) {
forassigment1=1;
.
.
.
}
Assignment3=3;
.
.
.
for(i = 0; i < imx; i++) {
forassigment2=2;
.
.
.
}
.
.
.
}
[/codebox]
My idea for speeding up the code with c for CUDA involved taking all of the for loops inside of the outer one and converting them into their own kernel call. So the code looks something like:
[codebox]for(iter = iter_start; iter <= iter_end; iter++)
{
Assignment1=1;
Assigement2=2;
.
.
.
dim3 dimBlock(BLOCK_SIZE);
dim3 dimGrid ( (for1ThCt/dimBlock.x) + (!(for1ThCt%dimBlock.x)?0:1) );
cudaMemcpy(ExyzG, Exyz, SIZE_float*nByteExyz, cudaMemcpyHostToDevice);
for1<<<dimGrid, dimBlock>>>(ExyzG, BxyzG, wxG);
cudaMemcpy(BxyzG, Bxyz, SIZE_float*nByteBxyz, cudaMemcpyDeviceToHost);
.
.
.
for2<<<dimGrid, dimBlock>>>(JxyzG, dtabxG, dtabyG, dtabzG, b0G, b1G, b2G, dtfG, a00G, a01G, a02G, a10G, a11G, a12G, a20G, a21G, a22G, ExyzG);
.
.
.
}
[/codebox]
I get the modified code to compile and to run quickly but the output is wrong, it doesn’t make sense because each of the inner for loops can be paralyzed because a successive iteration in a given for loop does not depend upon previous iterations in the same for loop and I simplified the code down so I only turn parallel a couple of the most basic loops so I am sure that I am using memcpy to communicate with the host correctly. I guess my question was whether my basic philosophy of how to parallelize the code was wrong or not? I am calling many kernels in succession in the same scope and the outer for loop repeats the calls several times. Is there anything I neeed to do with streams or synchronization?
I know I have been vague as to the specifics of my code, but I have attached a screenshot showing the output, the top graph shows the output of the code I tried to turn parallel and the bottom graph shows what the graph should look like. I include this because just based on the difference in output that might give some idea. Both graphs show a sinusoidal oscillation but the starting point, amplitude, and equilibrium displacement change between the two graphs and the scale is wrong. The actual assignments I preform in the for loop have not been changed and I have made sure that the thread allocation before the kernel is sensible. Any help would be appreciated. Thank you!
-Brendon