Scope of shared memory in CUDA

Dear All

I had a program working with variable "a" in the video memory and it works fine. Then I want to pass that variable as shared in the below way. The scope of "a" must be in a single block. Then if I run multiple blocks as the example below, "a" must be allocated multiple times one in each block. But each block needs almost all shared memory of a smx. Then despite each smx has 192 cores it runs only one block of 32 threads because it has only shared memory for that block. Is that what happen? 

I asking that because it is given bad results. It seems that it has someway race problems and it is not happening like I said above.

Thanks

Luis Gonçalves

Compute Capability 3.0
CUDA 6.5

__global__ void process1(int nusersvirt,complex1 *codigo,complex1 *out1,
 complex1 *fft_in,int nredund,float *eqin,complex1 *RO,complex1 *FR,complex1 *tran1)
{
	__device__ __shared__ complex1 a[5472];

}

	  dim3 diblock(32,1,1);
	  dim3 grid5(16,14,1);

process1<<<grid5,diblock,171*32*8>>>(z7,codigo,out1,comp1,br,rrceqdevfi,RO,FR,tran27);

line 4: you likely do not need device together with shared

it is not entirely clear what you are saying/ asking

“I want to pass that variable as shared”

i presume “a” originates from global memory, and is read into shared

“must be allocated multiple times one in each block”

i.e. “a” is allocated on a block basis, it is allocated for each block, and its size is not necessarily dependent on the total number of blocks

“But each block needs almost all shared memory of a smx”

perhaps.
an alternative is to loop over the variable, in order to fully cover it
if “a” has length of x, you can cover it with a smaller sized block, by looping within the block
it depends on the reason why you wish to use shared memory in the first place

“Then despite each smx has 192 cores it runs only one block of 32 threads because it has only shared memory for that block”

A: each smx has 192 cores
B: it runs only one block of 32 threads because it has only shared memory for that block

i am not sure there is a direct link between A and B

“it has someway race problems”

you could run racecheck

Before, in a previous program I declared “a” in video memory. In a new upgraded program I want that variable be of type shared (and I remove it from video memory (global memory)). That variable is transitory.

My question is: if I have only one smx, in the program above, what happens?

A: In 192 cores, run 192/32 blocks at the same time and the shared memory “a” is written by all concurrently. In that case how I run each block at the time in order each block have exclusive access to “a”

B: or it is ran only on block at time because there is only shared memory for one block at a time

if “a” was in global memory, and if “a” is transitory, how did you prevent races at global memory level between thread blocks?

the potential for and treatment of races on the variable are more or less the same, regardless of whether the variable is in shared, or in global memory

what is the depth (array length) of a?
do you initialize it at the very beginning with input data?
do all thread blocks access the entire variable array, or only parts thereof?

with regards to B:
to view it differently, the gpu would look at the local and shared memory requirements per block, among other things, and then decide how many blocks can be seated per sm and for a particular sm, at the same time

-“a” in global memory is a[54721614] (with CudaMalloc). The program runs fine

-“a” shared memory

“a” is accessed in blocks of 32(threads)x8 consecutive bytes (each thread accesses 4 bytes of each 8 (real or imaginary part, floating) a time). a

struct
{
float r;
float i;
}complex1;

complex1 a[171*32]

thread[0] accesses a[0+i32].r
thread[1] accesses a[1+i
32].r

-“a” in global memory is a[54721614] (with CudaMalloc). The program runs fine

i presume this is followed by a h2d memory copy soon afterwards?

if you use shared memory, you still need the initiating h2d memory copy, and a read from global to shared

also, i am not sure about your ‘coupling’
essentially, i would think that, if you use a structure, you must read in a structure; otherwise, you need to deep copy at some point

(i can not believe i am about to reference the debugger)
you could also use the debugger and peak at the data in shared memory, to see whether it is valid/ garbage

The shared (or global memory in the first program) memory is initiated within the threads, it is transitory memory. I think I do not need to copy from anywhere.

The shared memory is organized in 32 x complex1,32 x complex1,(x171)
Each thread (of the block) have a complex1 in each group of 32 x complex1

“The shared (or global memory in the first program) memory is initiated within the thread”

understood

process1<<<grid5,diblock,171328>>>

you are both allocating statically and dynamically

either extern or do not extern

"Each thread (of the block) have a complex1 in each group of 32 x complex1 "

i am still thinking whether threads are not stepping one each others’ toes

“171328”

where does the 171 come from?

you generally allocate shared, according to the requirements per block, regardless of the number of blocks (grid dimensions)

dim3 diblock(32,1,1);

shared complex1 a[32 * 8]; // assuming sizeof(complex1) == 8

I put

process1<<<grid5,diblock>>>

instead of

process1<<<grid5,diblock,171328>>>

and worked.

I do not reach the performance I want but that is another problem.

Thanks

make sure you are not over allocating shared memory within blocks

shared complex1 a[32 * 8];

versus

complex1 a[171*32]

etc

otherwise, it would degrade the number of thread blocks running per sm

taking advantage of the topic… a variable declared in the shared memory has same content for all blocks? For example:

__global__ void kernel(){
    __shared__ int var;
    var = -1;
    if(blockIdx.x == 0 && threadIdx.x == 0){
        var = 0;
    }
    else if(blockIdx.x == 1 && threadIdx.x == 0){
        var = 1;
    }
    printf("Block: %d thread %d valor %d\n", blockIdx.x, threadIdx.x, var);
}
int main(){
    kernel <<< 2, 1024 >>> ();
    cudaDeviceSynchronize();
    return 0;
}

If blockIdx.x == 0 then all threads of block 0 var is 0 at same time that blockIdx.x == 1 then all threads of block 1 var is 1 ? or the content var is changed for both blocks?

I really need

share complex1 a[171*32]

Each thread needs a array of complex1 with 171 lenght

I have a version with very good results but I was trying to improve more