how to avoid using local memory

sdtougao · July 11, 2011, 8:37am

Hi everyone,

My configuration is: Win7 64bit + VS2008 SP1 + 2 Tesla C2050 + CUDA 4.0 + Nsight 2.0

I want to use the following kernel to convert a float array from log-version to exp-version.

global void log2exp(float *d_Le)
{
const unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

    d_Le[xIndex] = 2.0/(1.0+__expf(d_Le[xIndex]))-1.0;

}

It runs very slow. So I use Nsight to find that the data are stored in local memory.
Can anyone tell me why the data are in local memory? And how can I avoid to put them in the local memory?
Thanks a lot.

sdtougao

avidday · July 11, 2011, 8:53am

It is probably “very slow” because you have mixed single and double precision arithmetic in the kernel, and the double precision is half the peak flops of single precision on your C2050. The local memory usage probably comes from the exp function, and there is nothing you can do about that. A check of the PTX output from the compiler will confirm this.

sdtougao · July 11, 2011, 9:39am

thank you very much, avidday.

Yes, I use PTX and find that it is really in the local memory.

Can you explain in more detail why exp function is in local memory, and I want to use shared memory as follows

__shared__ d_Le_sm[BLOCK_SIZE_1];

for(m=0;m<gridDim.x;m++)

{

	d_Le_sm[threadIdx.x] = d_Le[threadIdx.x + m*blockDim.x];

	__syncthreads();

	d_Le_sm[threadIdx.x] =  __fdividef(2.0,(1.0+__expf(d_Le_sm[threadIdx.x])))-1.0;

	__syncthreads();

	d_Le[threadIdx.x + m*blockDim.x] = d_Le_sm[threadIdx.x];

	__syncthreads();

}

Do you think this will improve the performance?

Thanks,

sdtougao

LSChien · July 11, 2011, 9:50am

I don’t see local memory

nvcc -Xptxas -v -arch=sm_20 source.cu

ptxas info : Function properties for _Z7log2expPf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 18 registers, 40 bytes cmem[0], 4 bytes cmem[16]

Are you sure that local memory appears in this simple kernel?

njuffa · July 11, 2011, 10:01am

I would not expect any local memory usage when this kernel is compiled with compiler default, and I see that Lung Sheng has already confirmed this. What compiler switches are you using, and what is the compiler output after adding -Xptxas -v to the nvcc commandline?

If memory serves, the only math library functions that use a bit of local memory are the trigonometric functions [this is documented in the Programming Guide], and they use local memory only in a “slow path” that is extremely unlikely to be taken in real-life code so there is no performance impact from this limited use of local memory.

sdtougao · July 12, 2011, 3:20am

I don’t see local memory

nvcc -Xptxas -v -arch=sm_20 source.cu

ptxas info : Function properties for _Z7log2expPf
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 18 registers, 40 bytes cmem[0], 4 bytes cmem[16]

Are you sure that local memory appears in this simple kernel?

This is the result from Nsight analysis

sdtougao · July 12, 2011, 3:23am

Sorry, I don’t know why the picture is so small. Please double click on it, and you will see the detail.

Here the kernel LLR2q() is just equal to the above kernel in the example. The local memory is 45481984, so I think the data are stored in the local memory

avidday · July 12, 2011, 4:42am

Which kernel is it? You have posted two. If it is the second kernel, are you building it for debugging or release?

sdtougao · July 12, 2011, 6:07am

The kernel LLR2q() is refer to

global void log2exp(float *d_Le)

{

const unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

d_Le[xIndex] = 2.0/(1.0+__expf(d_Le[xIndex]))-1.0;

}

sdtougao · July 12, 2011, 6:09am

thank you very much, avidday.

Yes, I use PTX and find that it is really in the local memory.

Can you explain in more detail why exp function is in local memory, and I want to use shared memory as follows
__shared__ d_Le_sm[BLOCK_SIZE_1];

for(m=0;m<gridDim.x;m++)

{

	d_Le_sm[threadIdx.x] = d_Le[threadIdx.x + m*blockDim.x];

	__syncthreads();

	d_Le_sm[threadIdx.x] =  __fdividef(2.0,(1.0+__expf(d_Le_sm[threadIdx.x])))-1.0;

	__syncthreads();

	d_Le[threadIdx.x + m*blockDim.x] = d_Le_sm[threadIdx.x];

	__syncthreads();

} 
Do you think this will improve the performance?

Thanks,

sdtougao

When I use this shared memory, the runtime is longer than the original kernel log2exp(), can anyone tell me why? Is there something wrong with my shared_memory_version kernel?

Thanks

Topic		Replies	Views
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5198	September 6, 2008
why my kernel uses local memory? CUDA Programming and Performance	9	3694	August 21, 2015
__local (Shared) memory increases kernel execution time? CUDA Programming and Performance	9	21580	November 30, 2010
mysterious local memory usage in my kernel CUDA Programming and Performance	2	3351	April 5, 2010
How to reduce Local Memory Usage. CUDA Programming and Performance	19	12631	November 30, 2009
Local Memory - What is that? Memory Hierarchies CUDA Programming and Performance	26	22549	December 6, 2007
__shared__ memory confused me. __shared__ memory CUDA Programming and Performance	7	4012	August 1, 2009
How fast is local memory? the doc doesn't say much CUDA Programming and Performance	24	8266	August 20, 2007
Weird local mem usage CUDA Programming and Performance	6	2198	January 13, 2009
problems of local memory and shared memory CUDA Programming and Performance	1	3403	July 10, 2011

how to avoid using local memory

Related topics