how to avoid using local memory

Hi everyone,

My configuration is: Win7 64bit + VS2008 SP1 + 2 Tesla C2050 + CUDA 4.0 + Nsight 2.0

I want to use the following kernel to convert a float array from log-version to exp-version.

global void log2exp(float *d_Le)
{
const unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

    d_Le[xIndex] = 2.0/(1.0+__expf(d_Le[xIndex]))-1.0;

}

It runs very slow. So I use Nsight to find that the data are stored in local memory.
Can anyone tell me why the data are in local memory? And how can I avoid to put them in the local memory?
Thanks a lot.

sdtougao

It is probably “very slow” because you have mixed single and double precision arithmetic in the kernel, and the double precision is half the peak flops of single precision on your C2050. The local memory usage probably comes from the exp function, and there is nothing you can do about that. A check of the PTX output from the compiler will confirm this.

thank you very much, avidday.

Yes, I use PTX and find that it is really in the local memory.

Can you explain in more detail why exp function is in local memory, and I want to use shared memory as follows

__shared__ d_Le_sm[BLOCK_SIZE_1];

for(m=0;m<gridDim.x;m++)

{

	d_Le_sm[threadIdx.x] = d_Le[threadIdx.x + m*blockDim.x];

	__syncthreads();

	d_Le_sm[threadIdx.x] =  __fdividef(2.0,(1.0+__expf(d_Le_sm[threadIdx.x])))-1.0;

	__syncthreads();

	d_Le[threadIdx.x + m*blockDim.x] = d_Le_sm[threadIdx.x];

	__syncthreads();

} 

Do you think this will improve the performance?

Thanks,

sdtougao

I don’t see local memory

nvcc -Xptxas -v -arch=sm_20 source.cu

ptxas info : Function properties for _Z7log2expPf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 18 registers, 40 bytes cmem[0], 4 bytes cmem[16]

Are you sure that local memory appears in this simple kernel?

I would not expect any local memory usage when this kernel is compiled with compiler default, and I see that Lung Sheng has already confirmed this. What compiler switches are you using, and what is the compiler output after adding -Xptxas -v to the nvcc commandline?

If memory serves, the only math library functions that use a bit of local memory are the trigonometric functions [this is documented in the Programming Guide], and they use local memory only in a “slow path” that is extremely unlikely to be taken in real-life code so there is no performance impact from this limited use of local memory.

This is the result from Nsight analysis
1.jpg

Sorry, I don’t know why the picture is so small. Please double click on it, and you will see the detail.

Here the kernel LLR2q() is just equal to the above kernel in the example. The local memory is 45481984, so I think the data are stored in the local memory

Which kernel is it? You have posted two. If it is the second kernel, are you building it for debugging or release?

The kernel LLR2q() is refer to

global void log2exp(float *d_Le)

{

const unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x;

d_Le[xIndex] = 2.0/(1.0+__expf(d_Le[xIndex]))-1.0;

}

When I use this shared memory, the runtime is longer than the original kernel log2exp(), can anyone tell me why? Is there something wrong with my shared_memory_version kernel?

Thanks