Passing matrices to kernel Only Allocate for memory on device

If you are trying to speed up a program by using a GPU kernel what does one do in the following situation?

I am doing a problem very similar to the MRI case study (Chapter 7) in the Kirk and Hwu book. I want to pass the GPU kernel several matrices and arrays that are no sent to a routine that takes up a huge percentage of the execution time - on the order of 10-20% of the program. This seems a very appropriate place to use GPU and its associated programming. But, these matrices that I am passing to a kernel (my program like the MRI example will have several kernels) already exist and are populated. I assume I allocate nothing for them, that was already done somewhere else in this program.

However, I must allocate on the host code the arrays or matrices for the device and then copy the existing arrays or matrices that are populated to the newly allocated matrices on the device. Assuming that is correct (tell me if it is not) I must know the array or matrix dimension for memory allocation. That is the case in my situation. I must allocate on the host code with hard values for dimension (not just a guess) and then copy the array or matrix to the new matrix on the device.

Assuming that is correct the question is each time I send the already populated arrays to the kernel they will have different dimensions, that should not make a difference as long as I correctly allocate before I copy arrays and matrices to the device. Does it?


That is one of the most convoluted questions I think I have ever seen. I read the entire post several times and I am still not sure what it is you really are asking.

If you are trying to ask about strategies for managing device memory when you are repeatedly running kernels on the device manipulating matrices, then the best strategy is probably to allocate as much memory as is necessary during initialization to cover the largest possible case, rather than repeatedly allocate and free storage for the matrices. Your might also want to try and arrange the code so that the device functions can re-use as much data already on the device as possible (so for example if you have a common transformation matrix or permutation or something, keep it in device memory for the life of the application rather and copying it over during every call). Copying overhead can and will severely reduce performance, unless the matrices are very large or the FLOP count of the kernel very high. In my experience, for double precision O(N^3) operations, the minimum square matrix sizes need to be of the order of 500x500 before enough of the transfer overhead is amortized to give good speed up over the classes of CPUs I usually use.

Okay, let me break it down. This was what may be called a leading question. There are many questions here.

In order:

When speeding up an existing program with matrices that already exist one must allocate on the device and copy existing matrices to device. No need to allocate for existing matrices?

Since I am allocating for the file, I must allocate each matrix on the device the same amount of memory that the host matrix has? Can it be a larger amount to save allocating each time, ie choose the largest matrix and size and allocate for that. you will never run short of memory that way. Is that efficient or should I allocate the exact size of memory each time program enters the GPU kernel.

I certainly cannot allocate less memory that the host matrix?


Obviously not. If you have source memory on the host, there is no need to allocate addition memory for CUDA (if you are working with standard pageable memory, which I guess you are).

I have no idea what “allocating for the file” means, but you can obviously copy a small matrix or array into a larger allocation on the device. Which is precisely what I suggested in my previous reply to you. Allocate as much memory as you will ever need (or all of the available memory on the GPU if you won’t need it for anything else) as an initialization step, then just use the same allocations for the lifetime of the application.

Obviously not.