Hi everyone,
I’ve just made a test about the access speed of the shared memory and global memory, the result makes me very surprised, the following is the cuda kernel, in the host, i defined one block and one thread in it.
kernel1: fetch data from global memory and send back to the host
global fun(char data, char result)
{
int bid=blockIdx.x+blockIdx.ydimGrid.x;
int tid=threadIdx.x+threadIdx.ydimGrid.x;
int index=0;
char block;
while(index<10000){
block=data[tid];
result[tid]=block;
index++;}
}
kernel2: fetch data from shared memory and send back to the host
global fun(char data, char result)
{
int bid=blockIdx.x+blockIdx.ydimGrid.x;
int tid=threadIdx.x+threadIdx.ydimGrid.x;
int index=0;
char block;
shared char sub[1]; sub[0]=‘K’;
while(index<10000){
block=sub[0];
result[tid]=block;
index++;}
}
why kernel1 is faster than kernel2??? Anybody can give me supports?