cuda profiler reports high warp serialize

Nuld · February 27, 2010, 4:46am

The CUDA profiler reports that my warp serialize rate is high. But I don’t know why… I only access shared memory with the following patterns that should not cause any bank conflicts.

const unsigned int BLOCK_SIZE=256;

const unsigned int TYPES = 2;

const Uint btid = threadIdx.y * blockDim.x + threadIdx.x;

__shared__ float s_vSum[BLOCK_SIZE];

// access pattern 1

s_vSum[btid] += someValue;

// access pattern 2 (create partial sums but avoid warp divergence)

for (unsigned int maxId = (BLOCK_SIZE>>1); maxId >= 32; maxId >>= 1) {

	__syncthreads();

	if (btid < maxId) {

	  s_vSum[btid] += s_vSum[btid+maxId];

	}

  }

// and the access patterns to the second shared memory array

__shared__ float s_v[BLOCK_SIZE*TYPES];

// access pattern 3

for (unsigned int ntypeIdx = 0; ntypeIdx < TYPES; ++ntypeIdx) {

  s_v[ntypeIdx*BLOCK_SIZE+btid] = 0.0F;

}

// access pattern 4 (ntype is either 0 or 1)

s_v[ntype*BLOCK_SIZE+btid] += someValue;

// access pattern 5 (d_v is in device memory)

d_v[...] = someValue + s_v[btid] + s_v [BLOCK_SIZE+btid];

My GPU is a GTX295.

Any ideas why the profiler is reporting a high warp serialize rate?

Thanks in advance for your help :-)

avidday · February 27, 2010, 5:17am

What block dimensions are you using for this?

Nuld · February 27, 2010, 5:23am

x=16

y=16

z=1

Cygnus_X1 · February 27, 2010, 2:14pm

Warp serialization might come not from shared memory access. If I am not mistaken, warp-divergent branches may increase that counter too.
I am also not sure how block of dimention 16x16 will be mapped onto warps. I remember in some old CUDA compiler & driver, different values of threadIdx.y meant different warps, so in that case half of all your warps are underutilised. However, this could be already corrected/changed. Just check it out!

Nuld · February 28, 2010, 2:05am

My code should only diverge in very rare cases and that is confirmed by the profiler, which reports divergent branches separately and there are only very few (4 divergent branches compared to over 100000 warp serialize reported by the profiler for one kernel call).

When I run the code in emulation mode and print the thread IDs, then they increase as expected in a row-major order (i.e. (x0,y0), … (x15,y0), (x0,y1), …, (x15,y1),…).

Any other ideas? :-)

volzak · May 14, 2010, 10:46am

Hey,

I think there are 16 shared bank conflicts per warp. Take look: your threadblocks are 16x16, all devices with capabilities 1.x has 32 threads per warp and 16 shared banks, your warps are formed by 2 rows of 16 threads and data in shared is aligned to 16, so data[0], [16], [32]… will be in the same bank. Your access pattern (btid = threadIdx.y * blockDim.x + threadIdx.x) force that threads x0,y0 and x0,y1 (whom belongs to the same warp) access to data[0] and data[16]…same warp, same bank…so you’ve got 16 conflicts per warp for each pattern

Topic		Replies	Views
Having problems with warp divergence/serialization profiler: high warp serialize rate although diver CUDA Programming and Performance	4	1663	October 27, 2009
What should I optimize first? Divergence? Serialized Warps? CUDA Programming and Performance	4	7229	April 7, 2009
Bank Conflicts and Serialized Warps CUDA Programming and Performance	6	7802	December 4, 2009
cudaprof "Warp Serialize" counter Strange factor of seven and imprecise documentation? CUDA Programming and Performance	0	1445	January 21, 2010
smem bank conflicts CUDA Programming and Performance	4	5038	September 30, 2008
Optimizing and the 1.1 profiler CUDA Programming and Performance	6	4803	January 4, 2008
Shared memory with compute capability 3.x (in 32-bit mode) or compute capability 5.x and 6.x CUDA Programming and Performance	5	974	November 17, 2017
Unexplicable banks conflicts. Visual profiler and warp serialize. CUDA Programming and Performance	9	6037	May 19, 2011
Warps, Random Shared Memory Access, Bank Conflicts CUDA Programming and Performance	5	8445	September 7, 2007
CUDA 8 RC code generation related to warps updating __shared__ memory CUDA Programming and Performance	6	1463	August 20, 2016

cuda profiler reports high warp serialize

Related topics