2 questions.. shared memory algorithm / Local memory limitations ? limited local memory per thread ?

In my test kernel I am declaring three big LOCAL arrays like double x[546]; and double y[42] ,double z[42]… its bad I Know… >.<

My first query:

This is because of the following operations per thread :

(all data type is DOUBLE and i can have 128 or 64 threads / block )

operation 1)

initialize 0th column of a 42 by 13 matrix (matrix x[546] above) in local memory
for(i = 1 to 12)
{

-a matrix vector product / thread of type – > 42 by 13 (currently in local) times 13 by (i-1) (this 'i-1’TH vector is in constant memory)

-from this I get a vector y of length y[42]

-this vector y updates column i of the above 42 by 13 matrix
}

operation 2)
is a matrix vector product / thread of type – > 42by13 (UPDATED MATRIX FROM OPERATION 1) times 13by1 (this vector is also in constant memory)

I am already interleaving both of them in one set of for loops as the 2 operation can be performed immediately after one column of first operation gets done, but still i need three bib local arrays :( .

I have only 10 kb of shared memory left, which is not sufficient to store even one column of those arrays as they are double precision type …

I have spent > 10 hrs External Image but I cant find a shared memory solution to above… any algorithms which can help me achieving the above…with help of the free shared memory I have?

There is no need of synchronization at any level of memory access as every thing is on per-thread basis.

My second STRANGE query:

They local arrays are huge almost equivalent to 4.5kb/thread of local memory. But as soon as I try to read from them after they are written them I get “unspecified launch failures” at random parts in my code. If I comment the code which is accessing (reading) some of them then the kernel runs, but the answers are wrong (obviously).

One solution:

I can store these arrays in global memory and try to access them but it would very difficult rather not possible to coalesce the access for those.

So am curious to know if there is any limit on the local memory / per thread ?

I guess I have redesign my algorithm… in that case.

Thanks all … I know its a long post

Such huge arrays in local memory, eventhough there is no failure when launching program, can’t stay in shared memory…
They will be offloaded in a specific part of Global Memory…

One way of trying to find out a solution is figuring out if you can work on small part of you arrays that you would be coalescently transfered to each of you (parts shouldn’t be over 16384 Bytes per block) thread Block then working on it and writing small part of the results back, coalescently as well.

Such huge arrays in local memory, eventhough there is no failure when launching program, can’t stay in shared memory…
They will be offloaded in a specific part of Global Memory…

One way of trying to find out a solution is figuring out if you can work on small part of you arrays that you would be coalescently transfered to each of you (parts shouldn’t be over 16384 Bytes per block) thread Block then working on it and writing small part of the results back, coalescently as well.

As Electro says, consider staging the data in shared memory in batches to get the job done.

Thanks for the prompt replies…

Am doing the above operations per thread basis, but yes I can store and receive data in small chunks from and to shared memory. Like working on 14 doubles and writing back to global ~ 7kb shared memory for 64 threads per block.

I will do this today and see how it goes. Thanks again…

Also any idea why I am getting the "UNSPECIFIED LAUNCH FAILURE " , have I reached some local memory limit… ?

Thanks,

NA

I don’t see why you should not be able to get coalesced memory access using global memory if it is possible using local memory.

About ULF I can only speculate without having read the code. Usually it means an out-of-bounds memory access (like a segfault). When do you really get the ULF? Always make sure you check the error from cudaThreadSynchronize() after your kernel launch, otherwise you might get some previous error.

I don’t see why you should not be able to get coalesced memory access using global memory if it is possible using local memory.

About ULF I can only speculate without having read the code. Usually it means an out-of-bounds memory access (like a segfault). When do you really get the ULF? Always make sure you check the error from cudaThreadSynchronize() after your kernel launch, otherwise you might get some previous error.

I don’t see why you should not be able to get coalesced memory access using global memory if it is possible using local memory.

About ULF I can only speculate without having read the code. Usually it means an out-of-bounds memory access (like a segfault). When do you really get the ULF? Always make sure you check the error from cudaThreadSynchronize() after your kernel launch, otherwise you might get some previous error.

I don’t see why you should not be able to get coalesced memory access using global memory if it is possible using local memory.

About ULF I can only speculate without having read the code. Usually it means an out-of-bounds memory access (like a segfault). When do you really get the ULF? Always make sure you check the error from cudaThreadSynchronize() after your kernel launch, otherwise you might get some previous error.