in cuda, If 32 threads in one warp access exact one same address.
Before core access the data, will it wait for all 128bytes to be filled in cache line? If so, then disable L1 cache will be faster?
in cuda, If 32 threads in one warp access exact one same address.
Before core access the data, will it wait for all 128bytes to be filled in cache line? If so, then disable L1 cache will be faster?