malloc shared memory to 1.1 device and cudaDeviceMapHost

I have two problems with cuda.

The first one is how to malloc into kernel shared memory. I did it extern but maybe I didn’t do it right and I have wrong result. For example I want to malloc an array with 10 points and I did

extern shared float a;

a[idx] = idx;
global[idx] = a[id];

after that I print the global array and I had right result for the first 4 points and the others 6 was zero without __syncthreads() i had right result.
From the other if I did
shared float a[10];

a[idx] = idx;
global[idx] = a[id];

the print hasdright results.

The other problem is with cudaHostAllocMapped to cudaHostAlloc. I did mapped a float variable with cudaHostAlloc and after I used cudaHostGetDevicePointer(). Into kernel gave to float variable a num and after kernel I print this variable and the result is 0 and at the second run of the same kernel the result is this num. When I put a sleep(1) between kernel and printf the result was the num from the first kenrel run and no the zero

what happen???

sorry for my english

For [font=“Courier New”]extern shared[/font] declarations you have to modify the way you call the kernel, and give the shared memory size to be allocated as the third parameter after gridsize and blocksize. Your code could look like this:

__global__ void mykernel(float *global, unsigned int n)


    extern __shared__ float a[];

    unsigned int idx = threadIdx.x;

a[n-1-idx] = n-1-idx;


    global[idx] = a[idx];


const unsigned int n = 50;

mykernel<<<1, n, n*sizeof(float)>>>(global, n);

I’ve changed the the assignment to [font=“Courier New”]n-1-idx[/font] to make the [font=“Courier New”]__syncthreads()[/font] necessary and thus the kernel a little more interesting.

The third kernel’s argument why is useful? I don’t understand exactly how the shared is now dynamic? If you did a[idx]= idx the __syncthread() it will be was useless?

Without the third argument, no memory will actually be allocated for the variable declared [font=“Courier New”]extern shared[/font]. With the third argument, it will be allocated dynamically at kernel launch time (see appendix B.2.3 of the Programming Guide).

In the previous example where [font=“Courier New”]a[idx] = idx[/font] was used, each thread was just reading back the value it had written itself, so no [font=“Courier New”]__syncthreads()[/font] was necessary.

Ok I run it with this and I wiil see the results.

Do you know what happen at my second problem?

I’m not sure I understand your problem. However it seems you need to put [font=“Courier New”]cudaThreadSynchronize()[/font] before using the mapped memory from the host side to make sure the kernel has executed before you try to access the result.

Thank you for your help.

For the second problem I did a new thread ( and I someone answer me

If I want into kernel to have more than one shared array can I malloc it?

No, there is no malloc for shared memory. You have to manually split one array into multiple ones. Appendix B.2.3 of the Programming Guide shows how to do this.

I don’t understand exactly how work extern if you want more than one dynamic array. If I want two dynamic arrays (one array struct and the other an integer array) how I will do it?

as the below:

__global__void mykernel (int N, int M)


  extern __shared__ struct *array0;

  extern __shared__ int *array1;

int* array1 = (int*)&array0[N];





  int M,N; //size of arrays


size_t shared_size = N*sizeof(struct) + M*sizeof(int);

mykernel <<<1,10,shared_size>>>(N,M);