In my test kernel I am declaring three big LOCAL arrays like double x; and double y ,double z… its bad I Know… >.<
My first query:
This is because of the following operations per thread :
(all data type is DOUBLE and i can have 128 or 64 threads / block )
initialize 0th column of a 42 by 13 matrix (matrix x above) in local memory
for(i = 1 to 12)
-a matrix vector product / thread of type – > 42 by 13 (currently in local) times 13 by (i-1) (this 'i-1’TH vector is in constant memory)
-from this I get a vector y of length y
-this vector y updates column i of the above 42 by 13 matrix
is a matrix vector product / thread of type – > 42by13 (UPDATED MATRIX FROM OPERATION 1) times 13by1 (this vector is also in constant memory)
I am already interleaving both of them in one set of for loops as the 2 operation can be performed immediately after one column of first operation gets done, but still i need three bib local arrays :( .
I have only 10 kb of shared memory left, which is not sufficient to store even one column of those arrays as they are double precision type …
I have spent > 10 hrs but I cant find a shared memory solution to above… any algorithms which can help me achieving the above…with help of the free shared memory I have?
There is no need of synchronization at any level of memory access as every thing is on per-thread basis.
My second STRANGE query:
They local arrays are huge almost equivalent to 4.5kb/thread of local memory. But as soon as I try to read from them after they are written them I get “unspecified launch failures” at random parts in my code. If I comment the code which is accessing (reading) some of them then the kernel runs, but the answers are wrong (obviously).
I can store these arrays in global memory and try to access them but it would very difficult rather not possible to coalesce the access for those.
So am curious to know if there is any limit on the local memory / per thread ?
I guess I have redesign my algorithm… in that case.
Thanks all … I know its a long post