Kernel call preforming calculation right?

Hello Everyone,

I have been having some problems with programming a Tesla C870. Basically my task is to take a segment of linear CPU code that essentially looks like

[codebox]for(iter = iter_start; iter <= iter_end; iter++)

{

Assignment1=1;

Assigement2=2;

.

.

.

for(i = 0; i < imx-1; i++) {

forassigment1=1;

.

.

.

}

Assignment3=3;

.

.

.

for(i = 0; i < imx; i++) {

forassigment2=2;

.

.

.

}

.

.

.

}

[/codebox]

My idea for speeding up the code with c for CUDA involved taking all of the for loops inside of the outer one and converting them into their own kernel call. So the code looks something like:

[codebox]for(iter = iter_start; iter <= iter_end; iter++)

{

Assignment1=1;

Assigement2=2;

.

.

.

dim3 dimBlock(BLOCK_SIZE);

dim3 dimGrid ( (for1ThCt/dimBlock.x) + (!(for1ThCt%dimBlock.x)?0:1) );

cudaMemcpy(ExyzG, Exyz, SIZE_float*nByteExyz, cudaMemcpyHostToDevice);

for1<<<dimGrid, dimBlock>>>(ExyzG, BxyzG, wxG);

cudaMemcpy(BxyzG, Bxyz, SIZE_float*nByteBxyz, cudaMemcpyDeviceToHost);

.

.

.

for2<<<dimGrid, dimBlock>>>(JxyzG, dtabxG, dtabyG, dtabzG, b0G, b1G, b2G, dtfG, a00G, a01G, a02G, a10G, a11G, a12G, a20G, a21G, a22G, ExyzG);

.

.

.

}

[/codebox]

I get the modified code to compile and to run quickly but the output is wrong, it doesn’t make sense because each of the inner for loops can be paralyzed because a successive iteration in a given for loop does not depend upon previous iterations in the same for loop and I simplified the code down so I only turn parallel a couple of the most basic loops so I am sure that I am using memcpy to communicate with the host correctly. I guess my question was whether my basic philosophy of how to parallelize the code was wrong or not? I am calling many kernels in succession in the same scope and the outer for loop repeats the calls several times. Is there anything I neeed to do with streams or synchronization?

I know I have been vague as to the specifics of my code, but I have attached a screenshot showing the output, the top graph shows the output of the code I tried to turn parallel and the bottom graph shows what the graph should look like. I include this because just based on the difference in output that might give some idea. Both graphs show a sinusoidal oscillation but the starting point, amplitude, and equilibrium displacement change between the two graphs and the scale is wrong. The actual assignments I preform in the for loop have not been changed and I have made sure that the thread allocation before the kernel is sensible. Any help would be appreciated. Thank you!

-Brendon
Screenshot.png