Hello!
my gpu is a geforce 425m, compute capability 2.1
i have a 1D data of char in global memory:
unsigned char* cdata;
cudaMalloc(&cdata,csize);
and a kernel launch with as many number of threads as many elements cdata have,
every thread writes a char type data to cdata[i],
i is equal to the index of the thread (in my case: blockIdx.x*blockDim.x+threadIdx.x)
Is the global memory access coalesced for char type?
For every warp there will be 32*1byte data to write to the global memory.
Therefore i configured the global memory acces cache to use only the 32byte sized L2 cache.
Am i thinking correctly that this should be faster than the use of L1 and L2 cache both?
Thanks,
Gaszton