Different wavefront between global and surface read

user145234 · January 22, 2022, 6:49am

I test l1tex__data_pipe_lsu_wavefronts_cmd_read on Ampere GPU and make each SM has only one active warp (each warp has a coalesced memory access for continuous 128 Bytes float elements).

Output of metrics:

l1tex__data_pipe_lsu_wavefronts_cmd_read.avg            1
l1tex__data_pipe_lsu_wavefronts_cmd_read.max            1
l1tex__data_pipe_lsu_wavefronts_cmd_read.min            1

Then, I test l1tex__data_pipe_tex_wavefronts_mem_surface in the program at cuda c programming guide.

l1tex__data_pipe_tex_wavefronts_mem_surface.avg            4
l1tex__data_pipe_tex_wavefronts_mem_surface.max            4
l1tex__data_pipe_tex_wavefronts_mem_surface.min            4

Then, I chanege the channelDest and kernel implementation as follows, which has a same metrics value.

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

__global__ void copyKernel(cudaSurfaceObject_t inputSurfObj,
 cudaSurfaceObject_t outputSurfObj,
 int width, int height) 
{
 // Calculate surface coordinates
 unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
 unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
 if (x < width && y < height) {
 float data;
 // Read from input surface
 surf2Dread(&data, inputSurfObj, x * 4, y);
 // Write to output surface
 surf2Dwrite(data, outputSurfObj, x * 4, y);
 }
}

I can understand wavefronts in ncu and transaction per request in nvprof.
Here, I want to know why surface read give a value 4 instead of 1, even if surface read lie on 32 continuous texels.

Thanks in advance.

Topic		Replies	Views
Surface memory surface memory does not work CUDA Programming and Performance	9	5564	June 17, 2011
Question about l1tex__data_pipe_lsu_wavefronts.avg Nsight Compute	8	527	April 23, 2025
OpenGL interop: Reading from and writing to surface CUDA Programming and Performance	8	3484	December 16, 2015
why surface? or rather, why texture? CUDA Programming and Performance	0	1008	March 24, 2011
[CLOSED] CUarray pixel wise multiplication by float array CUDA Programming and Performance	8	808	December 29, 2019
Difference in number of wavefronts for strided access to shared-memory and L1 cache in Ampere GPUs Nsight Compute hw	3	931	February 6, 2026
Layered 2D surface object reading problem CUDA Programming and Performance	7	1434	September 8, 2022
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13774	June 2, 2008
Writing Performance of Surface in CUDA CUDA Programming and Performance	4	4551	January 27, 2016
Surface reference faster than Surface object CUDA Programming and Performance	0	760	May 15, 2013

Different wavefront between global and surface read

Related topics