access speed of shared memory and global memory

Hi everyone,
I’ve just made a test about the access speed of the shared memory and global memory, the result makes me very surprised, the following is the cuda kernel, in the host, i defined one block and one thread in it.

kernel1: fetch data from global memory and send back to the host
global fun(char data, char result)
{
int bid=blockIdx.x+blockIdx.y
dimGrid.x;
int tid=threadIdx.x+threadIdx.y
dimGrid.x;
int index=0;
char block;
while(index<10000){
block=data[tid];
result[tid]=block;
index++;}
}
kernel2: fetch data from shared memory and send back to the host
global fun(char data, char result)
{
int bid=blockIdx.x+blockIdx.y
dimGrid.x;
int tid=threadIdx.x+threadIdx.y
dimGrid.x;
int index=0;
char block;
shared char sub[1]; sub[0]=‘K’;
while(index<10000){
block=sub[0];
result[tid]=block;
index++;}
}
why kernel1 is faster than kernel2??? Anybody can give me supports?

I can’t speak to the specifics of the example, but if you only had one block and one thread, you’re pretty much missing the point of the GPU. Shared memory is there to ensure that threads within a block can collaborate… one thread per block means no collaboration.