It seems rather odd that the Runtime API doesn’t provide stream-ordered variants of any memory allocation functions besides cudaMalloc. Perhaps I am missing something? In order to better understand the behavior of cudaMallocPitch, I wrote a program that makes thousands of calls the function with ra…

What is the stream-ordered equivalent of cudaMallocPitch?

Accelerated Computing CUDA CUDA Programming and Performance

striker159 September 18, 2021, 6:46am 2

The reasoning behind 512 is simple.

For best performance warps have to do coalesced memory access.
Threads can read 16-byte words in a single instruction if the address is 16-byte aligned (e.g. loading int4)

Each warp could theoretically access 32 * 16 = 512 byte in one instruction. The pitch is chosen as multiple of 512 such that it is valid to access each row of pitched memory in this manner.

Topic		Replies	Views
How to determine the base adress alignment and pitch alignment used by 'cudaMallocPitch' ? CUDA Programming and Performance	4	2520	June 9, 2016
Problem with 2D memory copy using pitch CUDA Programming and Performance	6	6477	November 20, 2011
cudaMallocPitch is giving inconsistent result cudaMallocPitch is giving inconsistent r CUDA Programming and Performance	5	6281	June 28, 2008
Understanding Memory Pitch Alignment CUDA Programming and Performance	9	12015	October 13, 2015
Possible CUDA improvements CUDA Programming and Performance	7	6126	July 14, 2008
cudaMalloc3D and friends proper use for whatever data type CUDA Programming and Performance	6	5930	July 14, 2010
question on memory coalescing and alignment CUDA Programming and Performance	0	1894	January 28, 2008
Contexts and cudaMallocHost Same rules? CUDA Programming and Performance	17	11240	November 15, 2008
Bad performance using MallocPitch and Memcpy2D CUDA Programming and Performance	9	2847	May 24, 2017
cuMemallocPitch for 3D allocations? CUDA Programming and Performance	2	7027	June 23, 2008

What is the stream-ordered equivalent of cudaMallocPitch?

Related topics