Help in Cuda on a simple matrix problem

Hi,

I intend to implement singular value decomposition in Cuda… I have completed the C code but I am having too many issues in implementing it in Cuda… To start off with, I wish to do the foll…

for(i=k;i<=Row-1;i++)
            {
     p=i*Row+j;
     Q[p]=Q[p]-2.0* t *R[i*Col+k];
}

Just as a test, I called a kernel in order to calculate Q[p]=Q[p]-2.0* t *R[i*Col+k]; onto the device and then performed a cudaMempy from device to the host… But it doesnt work, the values do not change… I even tampered with my idx values, but it doesnt seem to work …

Could someone plz help me out here…

Looks like your kernel did not even launch, Add some error checking after your kernel call, and even better after your memory allocation and copies as well.

Often people specify incorrect block/grid sizes or more shared memory than is available.

It might help if you show the kernel, and the surrounding Cuda code that sets it up and fires it off.

MMB

Wow, thank you so much for replying so quickly … din expect it …

Back to the problem … say for instance my Q matrix has 9 elements… Thus I have 9 threads and one block … I perform a cudaMemcpy from host to device and then call the foll kernel

                            Parallel1<<<dimGrid, dimBlock>>>(Row, Col, i , k , q_d, p, t, r_d, j);   // with dimGrid being 1, and dimBlock being 9

and my kernel does the following…

__global__void Parallel1(int Row, int Col, int i, int k, double *q_d, int p, double t, double *r_d, int j)

{

for(i=k;i<=Row-1;i++)

{

      p=i*Row+j;

      q_d[p]=q_d[p]-2.0*t*Rel_R[i*Col+k];

}

}

Shouldnt it do the same thing as my serial code…

Sorry, I forgot to mention that I am aware the blocks and grids in this case dont make a difference as I am not using idx or idy, but I am just confused about the fact that the Parralel1 function should return the same values for matrix q…

excuse the double post, I just discovered the edit button

My kernel is supposed to perform the following…

for(i=k;i<=Row-1;i++)

{

tempp=i*Row+j;

Q[tempp]=Q[tempp]-2.0temptR[i*Col+k];

}

I’ve tried idx and idy in place of Row and Col, but that doesnt work … Could you please advice me on how to run the above code in parallel…

I would really really apreciate it…

Thanks in advance…