I’m performing matrix multiplication using the cublasSgemm function of CUBLAS. I allocate the memory on the device using cudaMalloc. My code runs fine, but I wonder whether using cudaMallocPitch could make it even faster. So my questions are:
May I accept performance speedup from using cudaMallocPitch instead of cudaMalloc? Or the cublasSgemm function is already written sophisticated enough to handle the non-optimal data alignment issues of cudaMalloc?
If the reply to the first question is yes, can someone show me an example code of how to call cublasSgemm on a memory block allocated via cudaMallocPitch? It’s not obvious to me if the padded area should be filled with zeros and also passed to cublasSgemm, or how to pass the “pitch” info to cublasSgemm.