I test l1tex__data_pipe_lsu_wavefronts_cmd_read
on Ampere GPU and make each SM has only one active warp (each warp has a coalesced memory access for continuous 128 Bytes float elements).
Output of metrics:
l1tex__data_pipe_lsu_wavefronts_cmd_read.avg 1
l1tex__data_pipe_lsu_wavefronts_cmd_read.max 1
l1tex__data_pipe_lsu_wavefronts_cmd_read.min 1
Then, I test l1tex__data_pipe_tex_wavefronts_mem_surface
in the program at cuda c programming guide.
l1tex__data_pipe_tex_wavefronts_mem_surface.avg 4
l1tex__data_pipe_tex_wavefronts_mem_surface.max 4
l1tex__data_pipe_tex_wavefronts_mem_surface.min 4
Then, I chanege the channelDest
and kernel implementation as follows, which has a same metrics value.
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
__global__ void copyKernel(cudaSurfaceObject_t inputSurfObj,
cudaSurfaceObject_t outputSurfObj,
int width, int height)
{
// Calculate surface coordinates
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height) {
float data;
// Read from input surface
surf2Dread(&data, inputSurfObj, x * 4, y);
// Write to output surface
surf2Dwrite(data, outputSurfObj, x * 4, y);
}
}
I can understand wavefronts in ncu and transaction per request in nvprof.
Here, I want to know why surface read give a value 4 instead of 1, even if surface read lie on 32 continuous texels.
Thanks in advance.