how can i free(delete) arrays in shared memory??

Hi ghyun,

how can i free(delete) arrays in shared memory??

Can you please be more specific? Do you mean CUDA Fortran shared memory? CUDA-x86? OpenMP? Something else? More detail as to what you are doing would help.

Best Regards,
Mat

I mean cuda fortran shared memory-

At first, i thought shared memory will be deleted when subroutine is end.
But, shared memory was not deleted and was overwrited.
I solved problems in my code by refining index, but i worried about memory capacity.
I tried
real,allocatable,shared :: A(:)
allocate(A(n)) ; deallocate(A) : does not worked.
How should i use allocated memory in shared memory?


And i have another question.


module kernel
	contains
	attributes(global) subroutine reduction_thread(array,blocksum)
		real,device:: blocksum(:),array(:)
		real,shared:: temp(512)
		integer :: i,id,tid,bid,nn,n,m,mm
		id = (blockidx%x-1)*blockdim%x + threadidx%x
		bid=blockidx%x  ;  tid=threadidx%x
		n=size(array)  ;  m=griddim%x*blockdim%x
		temp(tid)=0. ;  i=0
     ! nn=blockdim%x/2   !! **4

		do while(i.le.10)
			if(id+i*m.le.n) temp(tid)=temp(tid)+array(id+i*m)
			i=i+1
			call syncthreads()
		enddo
		nn=blockdim%x/2   !! **3
		do while(nn.ge.1)
			if(tid.le.nn) temp(tid)=temp(tid)+temp(tid+nn)
			nn=nn/2
			call syncthreads()
		enddo
		if(tid.eq.1) blocksum(bid)=temp(1)	
	end subroutine

	attributes(global) subroutine timecheck(blocksum)    !! **1
		real,device :: blocksum(:)       
	end subroutine

	subroutine reduction(array)
		real,device :: array(:)
		real,allocatable,device :: blocksum(:)
		integer :: n
		allocate(blocksum(512))
		n=(size(array)+511)/512
		if(n.gt.512) n=512	
		blocksum=0.
		call cpu_time(t1)
		call reduction_thread(((n,512)))(array,blocksum) !! cannot type chevron 
		call cpu_time(t2);print*,t2-t1,'111'
		call cpu_time(t1)
		call timecheck(((1,1)))(blocksum)
		call cpu_time(t2);print*,t2-t1,'222'     !! **2
		deallocate(blocksum)
	end subroutine
end module

my question is in subroutine timecheck(**1).
There is only job “real,device::blocksum” in (**1).
But elapsed time in this sub. is relatively spent.

  1. Why does it take a long time?(I thought this time is spent in reading device memory.)
    Then I tried several ways to reduce time. and i got 2 questions.
  2. real ← real8
    at first, i expected it takes a longer time because real
    8 is larger mem than real. but it’s faster than old one. then i am confused.
    (my VGA is Geforce 9600gt)
  3. change the location “nn=blockdim%x/2(**3)”
    i changed location (**3)->(**4).
    but code doesn’t work. (this code is not entire code. entire code is reduction. frankly, question(3) is not really important just curiosity.)

Every problems in previous post is due to “call cpu_time(t1)”
I changed cpu_time ← cudaEventRecord, then all results is reasonable.
All i want to know is solved.

Now, i got a question.

Hotspot is

	attributes(global) subroutine reduction_thread(array)
		real, intent(in)  :: array(:)
		real,  shared :: temp(512)
		integer :: i,id,tid,bid,nn,n,m
		id = (blockidx%x-1)*blockdim%x + threadidx%x
		bid=blockidx%x  ;  tid=threadidx%x
		n=size(array)  ;  m=griddim%x*blockdim%x
		temp(tid)=0.
		i=0.
		do while(i<=10)
			if(id+i*m.le.n) temp(tid)=temp(tid)+array(id+i*m) !!!! ***1
			i=i+1
			call syncthreads()
		enddo
		nn=blockdim%x/2
		do while(nn.ge.1)
			if(tid<=nn) temp(tid)=temp(tid)+temp(tid+nn)
			nn=nn/2
			call syncthreads()
		enddo
		if(tid==1) blocksum(bid)=temp(1)	
	end subroutine

Array “array” is device memory.
Array “temp” is shared memory.
copying device mem to shared mem is faster than “host to device”. didn’t?
what’s wrong with this problem?

Hi ghyun,

Every problems in previous post is due to “call cpu_time(t1)”
I changed cpu_time ← cudaEventRecord, then all results is reasonable.

Correct. Since kernel calls are non-blocking, your host code will continue making it appear no time was spent in the kernel. As you discovered, you need to use CUDA events to perform timing.

How should i use allocated memory in shared memory?

While you can’t dynamically allocate memory within a kernel, you can set the size of the shared memory at launch time. Add a third argument to the chevron which is the number of bytes per block each shared memory size is. Then in your kernel you can used an assumed size or automatic shared arrays. For details, please see the section titled “shared data” in the CUDA Fortran Reference Manual (https://www.pgroup.com/doc/pgicudafortug.pdf).

copying device mem to shared mem is faster than “host to device”. didn’t?
what’s wrong with this problem?

Sorry, I’m not clear what your are asking. Are you asking why using shared memory is faster than using device memory directly?

  • Mat

Thank you so much, Mat

When i asking these,

copying device mem to shared mem is faster than “host to device”. didn’t?
what’s wrong with this problem?

I thought this part is a hotspot.

temp(tid)=temp(tid)+array(id+i*m)
(“temp” is shared memory and “array” is device memory)

I guessed using shared memory is slower than using device memory.
But, every articles and manuals says using shared is better.
So, I am embrassed and asked that.


But now, I figure out that problem isn’t due to shared memory and result of “<<<512>>>”.
it is overload from access to one array from 512*512 threads
I adjusted block dimension to be smaller, results is better.

Thanks again!

Hi ghyun,

I guessed using shared memory is slower than using device memory.
But, every articles and manuals says using shared is better.

Yes, shared memory is typically faster since it’s much closer to the individual thread processors and therefor faster access. The caveat is that the amount of shared memory is limited so the more used means less threads and blocks that can be run on a multi-processor.

  • Mat