I’m a bit busy and without see the code deeply, i think that you must review the indexes because in ‘C[i*wC+j]=C[(i-1)*wC+j-1]+1;’ when i = 0, C[j] = C[-wC+j-1] and it is wrong.
When the topology of the problem is in form of matrix, your kernel grid must be two-dimensional too, because is more simple.
If you want, can see the CUDA SDK examples as the vectorAdd example that is easy.
I’m a bit busy and without see the code deeply, i think that you must review the indexes because in ‘C[i*wC+j]=C[(i-1)*wC+j-1]+1;’ when i = 0, C[j] = C[-wC+j-1] and it is wrong.
When the topology of the problem is in form of matrix, your kernel grid must be two-dimensional too, because is more simple.
If you want, can see the CUDA SDK examples as the vectorAdd example that is easy.
currentIndex goes from 0 to MaxNumberOfThreads (that is gridDim.x*blockDim.x), i.e. in ‘if (A[current_Index-1]==B[current_Index-1])’
the thread 1 access to A[0] and B[0], but the threads > 16 of currentIndex, will access out of bounds of A and B. You have to put a condition.
I don’t know if you know it but currentIndex is as the unique identifier of a thread in the grid. When we have more
threads than the dimension of an array or matrix we have to check if the thread identifier exceeds the dimension of that array or matrix.
I repeat you, check the case of vectorAdd (CUDA SDK), in this example 2 arrays are added and the result goes in a result vector.
In this example, i plays the same role as your currentIndex and the programmer checked that i did not exceed the dimensión of A and B that is N. See it!
currentIndex goes from 0 to MaxNumberOfThreads (that is gridDim.x*blockDim.x), i.e. in ‘if (A[current_Index-1]==B[current_Index-1])’
the thread 1 access to A[0] and B[0], but the threads > 16 of currentIndex, will access out of bounds of A and B. You have to put a condition.
I don’t know if you know it but currentIndex is as the unique identifier of a thread in the grid. When we have more
threads than the dimension of an array or matrix we have to check if the thread identifier exceeds the dimension of that array or matrix.
I repeat you, check the case of vectorAdd (CUDA SDK), in this example 2 arrays are added and the result goes in a result vector.
In this example, i plays the same role as your currentIndex and the programmer checked that i did not exceed the dimensión of A and B that is N. See it!
i have checked the example of vectorAdd and the matrixmultiply in the CUDA SDK
but the difference with my algorithm that i didn’t fill in the matrix successively but the cells in the anti-diagonal are fiiled successively
for example
in the first iterative i fill
C[1,1]
_synchthreads();
then
C[1,2],C[2,1]
_synchthreads();
then
C[1,3],C[2,2],C[3,1]
_synchthreads();
…
until i fill C[N,M] the last anti-diagonal
so the difficulty exist in the stage of changing the index to reach the next Cell.
i have checked the example of vectorAdd and the matrixmultiply in the CUDA SDK
but the difference with my algorithm that i didn’t fill in the matrix successively but the cells in the anti-diagonal are fiiled successively
for example
in the first iterative i fill
C[1,1]
_synchthreads();
then
C[1,2],C[2,1]
_synchthreads();
then
C[1,3],C[2,2],C[3,1]
_synchthreads();
…
until i fill C[N,M] the last anti-diagonal
so the difficulty exist in the stage of changing the index to reach the next Cell.
i have checked the example of vectorAdd and the matrixmultiply in the CUDA SDK
but the difference with my algorithm that i didn’t fill in the matrix successively but the cells in the anti-diagonal are fiiled successively
for example
in the first iterative i fill
C[1,1]
_synchthreads();
then
C[1,2],C[2,1]
_synchthreads();
then
C[1,3],C[2,2],C[3,1]
_synchthreads();
…
until i fill C[N,M] the last anti-diagonal
so the difficulty exist in the stage of changing the index to reach the next Cell.
i have checked the example of vectorAdd and the matrixmultiply in the CUDA SDK
but the difference with my algorithm that i didn’t fill in the matrix successively but the cells in the anti-diagonal are fiiled successively
for example
in the first iterative i fill
C[1,1]
_synchthreads();
then
C[1,2],C[2,1]
_synchthreads();
then
C[1,3],C[2,2],C[3,1]
_synchthreads();
…
until i fill C[N,M] the last anti-diagonal
so the difficulty exist in the stage of changing the index to reach the next Cell.