Does cudaMallocPitch zero pad arrays on the device or should I be doing that?

Hello,

I’ve been studying CUDA the last few days, I have a question about how most people work with 2D arrays.
I am currently just attempting to implement some simple kernels for practice.

Let’s say I would like to multiply two non-square matrices in a similar style to the example on page 27 of the CUDA_C_Programming_Guide. The example assumes that the matrices width and height are both multiples of the block size(using a square block).

I would like to generalize this and multiply matrices that have width and height that are not multiples of the block size. My first attempt was to expand the matrices and pad them with zeros in order to make the dimensions multiples of the block size.

I also noticed in the CUDA_C_Programming_Guide that there is a cudaMallocPitch function that will return a pitch that will maximize coalesced memory reads.

I suppose my questions are:

  1. For 2D arrays I just “flatten” them and use a 1D array, is this good practice?

  2. Is it good practice to zero-pad my arrays so that height and width are multiples of the block size, then use cudaMallocPitch? Or can I just use cudaMallocPitch and it will take care of this for me?

  3. In my simple kernels(matrix add, matrix multiply), often the first thing I do is use the different block and thread variables to calculate an X,Y coordinate. Then I check if this X,Y is within the bounds of my output matrix. This can cause the blocks on the edges of the matrix to have multiple flow paths, is it standard practice to skip this check and just assume the data is multiples of the block size?

Hopefully you can see what I am trying to get at. Thanks in advance for your responses!

EDIT:

I am fairly certain I understand how the concept of pitch works. But am still padding my matrices so that they are multiples of the blockSize in order to avoid having conditional logic in my kernels. I’m wondering if this is better than having conditional logic in the kernel code?