2 questions.. shared memory algorithm / Local memory limitations ? limited local memory per thread ?

nitin.life · July 1, 2009, 9:57am

In my test kernel I am declaring three big LOCAL arrays like double x[546]; and double y[42] ,double z[42]… its bad I Know… >.<

My first query:

This is because of the following operations per thread :

(all data type is DOUBLE and i can have 128 or 64 threads / block )

operation 1)

initialize 0th column of a 42 by 13 matrix (matrix x[546] above) in local memory
for(i = 1 to 12)
{

-a matrix vector product / thread of type – > 42 by 13 (currently in local) times 13 by (i-1) (this 'i-1’TH vector is in constant memory)

-from this I get a vector y of length y[42]

-this vector y updates column i of the above 42 by 13 matrix
}

operation 2)
is a matrix vector product / thread of type – > 42by13 (UPDATED MATRIX FROM OPERATION 1) times 13by1 (this vector is also in constant memory)

I am already interleaving both of them in one set of for loops as the 2 operation can be performed immediately after one column of first operation gets done, but still i need three bib local arrays :( .

I have only 10 kb of shared memory left, which is not sufficient to store even one column of those arrays as they are double precision type …

I have spent > 10 hrs External Image but I cant find a shared memory solution to above… any algorithms which can help me achieving the above…with help of the free shared memory I have?

There is no need of synchronization at any level of memory access as every thing is on per-thread basis.

My second STRANGE query:

They local arrays are huge almost equivalent to 4.5kb/thread of local memory. But as soon as I try to read from them after they are written them I get “unspecified launch failures” at random parts in my code. If I comment the code which is accessing (reading) some of them then the kernel runs, but the answers are wrong (obviously).

One solution:

I can store these arrays in global memory and try to access them but it would very difficult rather not possible to coalesce the access for those.

So am curious to know if there is any limit on the local memory / per thread ?

I guess I have redesign my algorithm… in that case.

Thanks all … I know its a long post

Electro · July 1, 2009, 10:22am

Such huge arrays in local memory, eventhough there is no failure when launching program, can’t stay in shared memory…
They will be offloaded in a specific part of Global Memory…

One way of trying to find out a solution is figuring out if you can work on small part of you arrays that you would be coalescently transfered to each of you (parts shouldn’t be over 16384 Bytes per block) thread Block then working on it and writing small part of the results back, coalescently as well.

…

Electro · July 1, 2009, 10:23am

Such huge arrays in local memory, eventhough there is no failure when launching program, can’t stay in shared memory…
They will be offloaded in a specific part of Global Memory…

One way of trying to find out a solution is figuring out if you can work on small part of you arrays that you would be coalescently transfered to each of you (parts shouldn’t be over 16384 Bytes per block) thread Block then working on it and writing small part of the results back, coalescently as well.

…

Sarnath · July 1, 2009, 11:01am

As Electro says, consider staging the data in shared memory in batches to get the job done.

nitin.life · July 1, 2009, 11:15am

Thanks for the prompt replies…

Am doing the above operations per thread basis, but yes I can store and receive data in small chunks from and to shared memory. Like working on 14 doubles and writing back to global ~ 7kb shared memory for 64 threads per block.

I will do this today and see how it goes. Thanks again…

Also any idea why I am getting the "UNSPECIFIED LAUNCH FAILURE " , have I reached some local memory limit… ?

Thanks,

NA

kynan · August 17, 2010, 1:09pm

I don’t see why you should not be able to get coalesced memory access using global memory if it is possible using local memory.

About ULF I can only speculate without having read the code. Usually it means an out-of-bounds memory access (like a segfault). When do you really get the ULF? Always make sure you check the error from cudaThreadSynchronize() after your kernel launch, otherwise you might get some previous error.

kynan · August 17, 2010, 1:09pm

I don’t see why you should not be able to get coalesced memory access using global memory if it is possible using local memory.

About ULF I can only speculate without having read the code. Usually it means an out-of-bounds memory access (like a segfault). When do you really get the ULF? Always make sure you check the error from cudaThreadSynchronize() after your kernel launch, otherwise you might get some previous error.

kynan · August 17, 2010, 1:19pm

I don’t see why you should not be able to get coalesced memory access using global memory if it is possible using local memory.

About ULF I can only speculate without having read the code. Usually it means an out-of-bounds memory access (like a segfault). When do you really get the ULF? Always make sure you check the error from cudaThreadSynchronize() after your kernel launch, otherwise you might get some previous error.

kynan · August 17, 2010, 1:19pm

I don’t see why you should not be able to get coalesced memory access using global memory if it is possible using local memory.

About ULF I can only speculate without having read the code. Usually it means an out-of-bounds memory access (like a segfault). When do you really get the ULF? Always make sure you check the error from cudaThreadSynchronize() after your kernel launch, otherwise you might get some previous error.

Topic		Replies	Views
Problems with local memory CUDA Programming and Performance	3	875	April 22, 2016
Local memory size CUDA Programming and Performance	8	8017	November 14, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5405	September 6, 2008
temporary memory issues CUDA Programming and Performance	11	5504	March 30, 2008
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	25201	December 21, 2009
Thread Local variable CUDA Programming and Performance	1	1713	September 23, 2009
Local memory array giving illegal access error CUDA Programming and Performance	4	1185	November 20, 2020
Local vs Global memory is local memory access always coalesced ? CUDA Programming and Performance	4	4493	June 30, 2009
How to switch between shared and local memory CUDA Programming and Performance	1	1055	November 17, 2008
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3868	October 14, 2008

2 questions.. shared memory algorithm / Local memory limitations ? limited local memory per thread ?

Related topics