Different wavefront between global and surface read

I test l1tex__data_pipe_lsu_wavefronts_cmd_read on Ampere GPU and make each SM has only one active warp (each warp has a coalesced memory access for continuous 128 Bytes float elements).

Output of metrics:

l1tex__data_pipe_lsu_wavefronts_cmd_read.avg            1
l1tex__data_pipe_lsu_wavefronts_cmd_read.max            1
l1tex__data_pipe_lsu_wavefronts_cmd_read.min            1

Then, I test l1tex__data_pipe_tex_wavefronts_mem_surface in the program at cuda c programming guide.

l1tex__data_pipe_tex_wavefronts_mem_surface.avg            4
l1tex__data_pipe_tex_wavefronts_mem_surface.max            4
l1tex__data_pipe_tex_wavefronts_mem_surface.min            4

Then, I chanege the channelDest and kernel implementation as follows, which has a same metrics value.

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

__global__ void copyKernel(cudaSurfaceObject_t inputSurfObj,
 cudaSurfaceObject_t outputSurfObj,
 int width, int height) 
{
 // Calculate surface coordinates
 unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
 unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
 if (x < width && y < height) {
 float data;
 // Read from input surface
 surf2Dread(&data, inputSurfObj, x * 4, y);
 // Write to output surface
 surf2Dwrite(data, outputSurfObj, x * 4, y);
 }
}

I can understand wavefronts in ncu and transaction per request in nvprof.
Here, I want to know why surface read give a value 4 instead of 1, even if surface read lie on 32 continuous texels.

Thanks in advance.

1 Like