I wanna create a 2D matrix in the device memory to avoid the frequent memory exchange, However I didn’t find a proper way to do so. The reason I need a 2D structure is because there are something that 1D structure can not provide, for example:
if M is a 2D matrix, then I can use M[i] directly as a vector; this can’t happen with a 1D matrix, it WILL cost additional memory copy.
I also can’t use shared memory because it’s too small.
I wanna create a 2D matrix in the device memory to avoid the frequent memory exchange, However I didn’t find a proper way to do so. The reason I need a 2D structure is because there are something that 1D structure can not provide, for example:
if M is a 2D matrix, then I can use M[i] directly as a vector; this can’t happen with a 1D matrix, it WILL cost additional memory copy.
I also can’t use shared memory because it’s too small.
If you’re talking about memory exchange between host and device, this isn’t a problem since you can just load everything onto the device global memory
and when you want to operate on a vector, load that into shared memory before operating on it.
But this way requires repeatedly loading a vector from global memory to shared memory, operating on the vector, then write the result back into global memory
this results in a lot of memory read and writes from global into the shared memory. I’m wondering about this as well. Would this slow down the application at all? or do threads read from the shared memory so quickly, compared to the global memory, that this is worth all the extra memory copies???
If you’re talking about memory exchange between host and device, this isn’t a problem since you can just load everything onto the device global memory
and when you want to operate on a vector, load that into shared memory before operating on it.
But this way requires repeatedly loading a vector from global memory to shared memory, operating on the vector, then write the result back into global memory
this results in a lot of memory read and writes from global into the shared memory. I’m wondering about this as well. Would this slow down the application at all? or do threads read from the shared memory so quickly, compared to the global memory, that this is worth all the extra memory copies???