Changing the order(permuting) matrices

If you have lets say 400 matrices in contiguous column major format all with the same dimensions, for example 800x1000, what would be the smallest memory footprint method of changing the order based on an integer array which maps each current index to another index?

The inputs would be an array of matrices [800][1000][400] and an integer array of size 400 which maps index i(destination) to index j(source) (it is possible that they may be the same index).

I tried two approaches, one kernel which somewhat vectorizes the loads ( I used float4),

and a host side loop like this:

int copy_from=0;

for(int i=0;i<num_matrices;i++){
    copy_from=indices[i];
    err=cudaMemcpy((float*)(&D_out[i*plane_big_dim]),(float*)(&D_Matrices[copy_from*plane_big_dim]),single_matrix_bytes,cudaMemcpyDeviceToDevice);
    if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}

			
}

It ends up the above host controlled device memcpy is fastest at 3ms on the GTX 980, with my device-only kernel taking about 12 ms, and the equivalent MATLAB CPU version taking 300 ms.

Even though the loop approach is fast enough, I am not loving the fact I need two large buffers, when I would prefer to use less device memory (each buffer is over 1GB and I have other device allocations).

This is not a swap situation, where i->j and j->i, rather a reordering to a custom permutation which is that input integer array of indices.

Any other approach ideas which may reduce my memory requirements?

nothing for nothing - meaning you would likely conduct a trade-off

you now likely attain speed, at the cost of memory (footprint)

you could exchange speed for memory footprint

an elementary example would be to swap, rather than to move

if matrix i needs to go to order/ position j, rather swap j and i, instead of moving i and everything else to a new memory allocation - the buffer

would likely double the time spent on transfers, at a fraction of memory footprint you require now to buffer transfers

then again you might use 2 matrix host buffers, and 2 streams

move current matrix at location a to host buffer, move new matrix at location b into current location at location a, move matrix in host buffer back to location b

if you use streams, you may ‘hide’ the double move, because the direction differs

Yes, that is an idea with exploring. Using streams might work as you suggest, but I wonder if the data size of the copies might be too large to get any parallel operations. Most of my experience with streams has been in the context of small compute kernels using cuBLAS or cuSPARSE, and I noticed that only if the size(launch configuration) of the kernels were relatively small I would get the desired compute overlap.

I will give that strategy a go, and report if I get good results. Thanks for the suggestion.

The other option, while slower, would be to pinned (page locked) CPU memory buffer, and then copy directly from that buffer to the device buffer at 12 GBs via a cudaMemcpyHostToDevice call each iteration of the CPU loop.

But since I am getting my data via a MATLAB mex call, that pointer will be to slow pageable memory, so I will have to allocate another pinned buffer in host memory, copy the MATLAB memory to that host pinned buffer and then work with that.

I wonder if there is any way from the MATLAB side to get a pinned host buffer which CUDA will recognize as such? Somehow I doubt that is possible…

Many times when I have to work implementing/converting a MATLAB script into a CUDA mex, I have to deal with the GPU device memory limitations. Some of the MATLAB scripts I get can use as much as 14 GB of memory in the course of a script, so that leaves me with the issue of how to get that to work with 1/3 of the memory on the GPU.

the idea behind streams was to get the underlying memory copies parallel, such that you can exploit both directions of the pci bus, to pay less of a penalty for exchanging speed for memory

while stream 1 is moving a matrix to the host to buffer it, to free the slot the matrix currently occupies, stream 2 is moving its host buffered matrix back to its new slot on the device

the presumption here is of course that the final destination is on the device

“each iteration of the CPU loop”

you may exploit the asynchronous nature of streams such that you need far less pinned memory - you copy (a block) away while the next matlab mex call is writing to the other block

"Many times when I have to work implementing/converting a MATLAB script into a CUDA mex, I have to deal with the GPU device memory limitations. Some of the MATLAB scripts I get can use as much as 14 GB of memory in the course of a script, so that leaves me with the issue of how to get that to work with 1/3 of the memory on the GPU. "

it is phenomena like this that is pushing me to leave high end ssds on the design table as an option to consider - the option to park some data closeby

there may actually be a better approach

you are copying sequentially:

for(int i=0;i<num_matrices;i++){
copy_from=indices[i];

you could equally copy according to destination; that way you only have 1 redundant matrix copy, and only need a buffer the size of 1 matrix

consider the reordering

1->3
3->2
2->1

now:

allocate a buffer the size of matrix

move 3 into the buffer
move 1 into 3
move 2 into 1
move buffer into 2

essentially, the 1st destination you move into the buffer
thereafter, you move according to newly opened destination
at the end, you move the buffer into the last opened destination

Don’t move the matrices, and instead just use a set of mapping indexes and pointers. Since the arrays are contiguous, this mapping/pointer creation process could be done once, and a set of pointers used to refer to the relevant matrices, in whatever order you wish.

Having said that, your number of 3ms sounds amazingly fast. For float matrices of the dimensions given, you are moving 1.2GB of data in 3ms, which is on the order of 400GB/s throughput, and ~800GB/s if I consider that each copy operation requires a read and a write ( if I have done my arithmetic correctly).

Yes, you are right.
I added a cudaDeviceSynchronize() before the end timer, and that showed a copy time of 30 ms.
The device kernel I wrote takes 12-14 ms, so that method will be the one I use.

I always add a cudaDeviceSynchronize() after all kernel calls, but this time forgot to do so with the cudaMemcpy() calls in the host loop.

Yes, it’s needed in the cudaMemcpy D->D case:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

“For transfers from device memory to device memory, no host-side synchronization is performed.”