I intend to implement singular value decomposition in Cuda… I have completed the C code but I am having too many issues in implementing it in Cuda… To start off with, I wish to do the foll…
for(i=k;i<=Row-1;i++)
{
p=i*Row+j;
Q[p]=Q[p]-2.0* t *R[i*Col+k];
}
Just as a test, I called a kernel in order to calculate Q[p]=Q[p]-2.0* t *R[i*Col+k]; onto the device and then performed a cudaMempy from device to the host… But it doesnt work, the values do not change… I even tampered with my idx values, but it doesnt seem to work …
Looks like your kernel did not even launch, Add some error checking after your kernel call, and even better after your memory allocation and copies as well.
Often people specify incorrect block/grid sizes or more shared memory than is available.
Wow, thank you so much for replying so quickly … din expect it …
Back to the problem … say for instance my Q matrix has 9 elements… Thus I have 9 threads and one block … I perform a cudaMemcpy from host to device and then call the foll kernel
Parallel1<<<dimGrid, dimBlock>>>(Row, Col, i , k , q_d, p, t, r_d, j); // with dimGrid being 1, and dimBlock being 9
and my kernel does the following…
__global__void Parallel1(int Row, int Col, int i, int k, double *q_d, int p, double t, double *r_d, int j)
Sorry, I forgot to mention that I am aware the blocks and grids in this case dont make a difference as I am not using idx or idy, but I am just confused about the fact that the Parralel1 function should return the same values for matrix q…
excuse the double post, I just discovered the edit button