shared memory vs local memory

I have a block size of 980 threads = 14x14x5, and a grid size of 414720 blocks = 2529x160.
Local variable: double A, float B, float C, double D
For each thread, I want to compute A = BCD.
If A is shared (size of 980), the computation is slower than if A is local.
The reason I want A to be shared because I need to combine all As later.
Am I doing something wrong? why computation for shared variable is slower than local variable.

Here is my card

Device 0: “GeForce GT 430”
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1023 MBytes (1072889856 bytes)
( 2) Multiprocessors x (48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock Speed: 1.40 GHz
Memory Clock rate: 600.00 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GT 430
[deviceQuery] test results…

A = BCD has far too little work in it to be a reliable benchmark.

Having said that, the compiler completely optimizes away a calculation whose result is only stored in local memory, because that result cannot possibly be accessed from anywhere else. So you are probably comparing an empty kernel to one that does a little work.