2D matrix in device memory

Here’s the problem:

I wanna create a 2D matrix in the device memory to avoid the frequent memory exchange, However I didn’t find a proper way to do so. The reason I need a 2D structure is because there are something that 1D structure can not provide, for example:

if M is a 2D matrix, then I can use M[i] directly as a vector; this can’t happen with a 1D matrix, it WILL cost additional memory copy.

I also can’t use shared memory because it’s too small.

So, is there anyway?

Here’s the problem:

I wanna create a 2D matrix in the device memory to avoid the frequent memory exchange, However I didn’t find a proper way to do so. The reason I need a 2D structure is because there are something that 1D structure can not provide, for example:

if M is a 2D matrix, then I can use M[i] directly as a vector; this can’t happen with a 1D matrix, it WILL cost additional memory copy.

I also can’t use shared memory because it’s too small.

So, is there anyway?

If you’re talking about memory exchange between host and device, this isn’t a problem since you can just load everything onto the device global memory
and when you want to operate on a vector, load that into shared memory before operating on it.

But this way requires repeatedly loading a vector from global memory to shared memory, operating on the vector, then write the result back into global memory
this results in a lot of memory read and writes from global into the shared memory. I’m wondering about this as well. Would this slow down the application at all? or do threads read from the shared memory so quickly, compared to the global memory, that this is worth all the extra memory copies???

If you’re talking about memory exchange between host and device, this isn’t a problem since you can just load everything onto the device global memory
and when you want to operate on a vector, load that into shared memory before operating on it.

But this way requires repeatedly loading a vector from global memory to shared memory, operating on the vector, then write the result back into global memory
this results in a lot of memory read and writes from global into the shared memory. I’m wondering about this as well. Would this slow down the application at all? or do threads read from the shared memory so quickly, compared to the global memory, that this is worth all the extra memory copies???

8

No it won’t. While you imagine that you must form a “sliceable” matrix like this:

int vector0[5] = {0, 1, 2, 3, 4};

int vector1[5] = {5. 6, 7, 8, 9};

int * matrix[2];

matrix[0] = &vector0;

matrix[1] = &vector1;

int * slice1 = matrix[1];

you can just as easily do this with a chunk of linear memory:

int matrix[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

int * slice1 = &matrix[5];

There is no additional overhead associated with obtaining the slice in the second case compared to the first.

8

No it won’t. While you imagine that you must form a “sliceable” matrix like this:

int vector0[5] = {0, 1, 2, 3, 4};

int vector1[5] = {5. 6, 7, 8, 9};

int * matrix[2];

matrix[0] = &vector0;

matrix[1] = &vector1;

int * slice1 = matrix[1];

you can just as easily do this with a chunk of linear memory:

int matrix[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};

int * slice1 = &matrix[5];

There is no additional overhead associated with obtaining the slice in the second case compared to the first.