Hello:
I am fairly new to cuda, and new to c++. I plan to write a program in c++ using the matrix template library (MTL). I want this program to call functions on the device. Specifically, I want to be able to pass a Matrix object created using MTL to the device. I understand that this is not possible now. Correct? Is there any work around? Anything at all? I understand that full C++ integration is planned for the future, but does anyone know if this will be 6 months or 6 years?
Has anyone used MTL and cuda together?
Thanks,
John
I am not familiar with MTL, but I assume it a pure template library, i.e. it exists as source files containing various template classes.
I think one of the difficulties is the lack of dynamic memory allocation on device. If you want to call a function in kernel code, you need to create some kind of object. Without dynamic memory allocation, it is usually impossible. However, you can always allocate memory in host code, and then construct that memory (by calling a dummy init function, which basically does what constructors do) in device code. It works, but programming in this way is quite frustrating.
Another problem is that most code written for CPU is not suitable for GPU. GPU has too many cores, and need even more to work efficiently. You may need to tune the code, or even rewrite it, to make it efficient on GPU.
Yes, MTL is a template library. For example, MTL has a container called a RowMatrix. In a cpp file, after some include statements, one would have something like:
using namespace mtl;
// M is a row-major matrix
dense2D M(10, 10);
M[0][2] = 4
All I want to do is copy the data in matrix M over to the device so I can access its elements there. After initialising device memory for d_M, I assume that I could not simply use something like:
cublasSetVector(10*10, sizeof(double), M, 1, d_M, 1);
Is that correct? If correct, how could I copy the data in M over to the device without first copying it into a regular array and then copying the array to the device? Any ideas?
For copying data from CPU memory to GPU memory, you should store the data in CPU memory in a contiguous address space. I guess the class desnse2D would also store the elements in M in a 1-Dimension array. So what you need to do is actually copy that array to GPU memory.
Assume that the implementation of dense2D is like this:
template <class T>
class dense2D{
T * data;
size_t dim_x;
size_t dim_y;
}
Note that you can use only the raw array in kernel code. You will not be able to invoke any member functions of M in kernel code.
After performing the required operations in GPU, you can copy the data back to CPU memory by using cublasGetVector. Again, you should copy to M.data, instead of &M.