malloc shared memory to 1.1 device and cudaDeviceMapHost

pagratios · March 20, 2011, 10:07pm

I have two problems with cuda.

The first one is how to malloc into kernel shared memory. I did it extern but maybe I didn’t do it right and I have wrong result. For example I want to malloc an array with 10 points and I did

extern shared float a;

a[idx] = idx;
__syncthreads();
global[idx] = a[id];

after that I print the global array and I had right result for the first 4 points and the others 6 was zero without __syncthreads() i had right result.
From the other if I did
shared float a[10];

a[idx] = idx;
__syncthreads();
global[idx] = a[id];

the print hasdright results.

The other problem is with cudaHostAllocMapped to cudaHostAlloc. I did mapped a float variable with cudaHostAlloc and after I used cudaHostGetDevicePointer(). Into kernel gave to float variable a num and after kernel I print this variable and the result is 0 and at the second run of the same kernel the result is this num. When I put a sleep(1) between kernel and printf the result was the num from the first kenrel run and no the zero

what happen???

sorry for my english

tera · March 21, 2011, 1:42am

For [font=“Courier New”]extern shared[/font] declarations you have to modify the way you call the kernel, and give the shared memory size to be allocated as the third parameter after gridsize and blocksize. Your code could look like this:

__global__ void mykernel(float *global, unsigned int n)

{

    extern __shared__ float a[];

    unsigned int idx = threadIdx.x;

a[n-1-idx] = n-1-idx;

    __syncthreads();

    global[idx] = a[idx];

}

const unsigned int n = 50;

mykernel<<<1, n, n*sizeof(float)>>>(global, n);

I’ve changed the the assignment to [font=“Courier New”]n-1-idx[/font] to make the [font=“Courier New”]__syncthreads()[/font] necessary and thus the kernel a little more interesting.

pagratios · March 21, 2011, 1:53am

The third kernel’s argument why is useful? I don’t understand exactly how the shared is now dynamic? If you did a[idx]= idx the __syncthread() it will be was useless?

tera · March 21, 2011, 2:27am

Without the third argument, no memory will actually be allocated for the variable declared [font=“Courier New”]extern shared[/font]. With the third argument, it will be allocated dynamically at kernel launch time (see appendix B.2.3 of the Programming Guide).

In the previous example where [font=“Courier New”]a[idx] = idx[/font] was used, each thread was just reading back the value it had written itself, so no [font=“Courier New”]__syncthreads()[/font] was necessary.

pagratios · March 21, 2011, 9:50am

Ok I run it with this and I wiil see the results.

Do you know what happen at my second problem?

tera · March 22, 2011, 1:09pm

I’m not sure I understand your problem. However it seems you need to put [font=“Courier New”]cudaThreadSynchronize()[/font] before using the mapped memory from the host side to make sure the kernel has executed before you try to access the result.

pagratios · March 22, 2011, 1:16pm

Thank you for your help.

For the second problem I did a new thread ([url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA) and I someone answer me

pagratios · March 27, 2011, 11:49am

If I want into kernel to have more than one shared array can I malloc it?

tera · March 27, 2011, 11:55am

No, there is no malloc for shared memory. You have to manually split one array into multiple ones. Appendix B.2.3 of the Programming Guide shows how to do this.

pagratios · April 15, 2011, 4:53pm

I don’t understand exactly how work extern if you want more than one dynamic array. If I want two dynamic arrays (one array struct and the other an integer array) how I will do it?

as the below:

__global__void mykernel (int N, int M)

{

  extern __shared__ struct *array0;

  extern __shared__ int *array1;

int* array1 = (int*)&array0[N];

...

}

main()

{

  int M,N; //size of arrays

  ....

size_t shared_size = N*sizeof(struct) + M*sizeof(int);

mykernel <<<1,10,shared_size>>>(N,M);

...

}

Topic		Replies	Views
shared memory problems CUDA Programming and Performance	12	4987	October 2, 2010
dynamic memory creation in kernel? CUDA Programming and Performance	1	3113	May 29, 2007
Shared memory and running time Results not reproducible CUDA Programming and Performance	10	1731	August 24, 2009
efficient static arrays in kernel CUDA Programming and Performance	2	2322	March 31, 2009
Error message on allocating __shared__ memory in kernel, Cuda 5.0 CUDA Programming and Performance	8	2057	January 21, 2013
Shared memory between several kernels CUDA Programming and Performance	6	1789	April 6, 2010
How to deal with dynamically allocated 3-dimentional arrays in device's memory? CUDA Programming and Performance	1	767	April 14, 2013
dynamic parallelism and allocating global memory array of type double CUDA Programming and Performance	3	1336	July 24, 2014
[SOLVED] Shared memory variable declaration CUDA Programming and Performance	3	15265	December 23, 2016
malloc of one kernel in another kernel Memory allocated in one kernel can be accessed in another ker CUDA Programming and Performance	5	770	January 23, 2012

malloc shared memory to 1.1 device and cudaDeviceMapHost

Related topics