Changing the order(permuting) matrices

CudaaduC · January 19, 2015, 2:46am

If you have lets say 400 matrices in contiguous column major format all with the same dimensions, for example 800x1000, what would be the smallest memory footprint method of changing the order based on an integer array which maps each current index to another index?

The inputs would be an array of matrices [800][1000][400] and an integer array of size 400 which maps index i(destination) to index j(source) (it is possible that they may be the same index).

I tried two approaches, one kernel which somewhat vectorizes the loads ( I used float4),

and a host side loop like this:

int copy_from=0;

for(int i=0;i<num_matrices;i++){
    copy_from=indices[i];
    err=cudaMemcpy((float*)(&D_out[i*plane_big_dim]),(float*)(&D_Matrices[copy_from*plane_big_dim]),single_matrix_bytes,cudaMemcpyDeviceToDevice);
    if(err!=cudaSuccess){printf("%s in %s at line %d\n",cudaGetErrorString(err),__FILE__,__LINE__);}

			
}

It ends up the above host controlled device memcpy is fastest at 3ms on the GTX 980, with my device-only kernel taking about 12 ms, and the equivalent MATLAB CPU version taking 300 ms.

Even though the loop approach is fast enough, I am not loving the fact I need two large buffers, when I would prefer to use less device memory (each buffer is over 1GB and I have other device allocations).

This is not a swap situation, where i->j and j->i, rather a reordering to a custom permutation which is that input integer array of indices.

Any other approach ideas which may reduce my memory requirements?

little_jimmy · January 19, 2015, 4:47am

nothing for nothing - meaning you would likely conduct a trade-off

you now likely attain speed, at the cost of memory (footprint)

you could exchange speed for memory footprint

an elementary example would be to swap, rather than to move

if matrix i needs to go to order/ position j, rather swap j and i, instead of moving i and everything else to a new memory allocation - the buffer

would likely double the time spent on transfers, at a fraction of memory footprint you require now to buffer transfers

then again you might use 2 matrix host buffers, and 2 streams

move current matrix at location a to host buffer, move new matrix at location b into current location at location a, move matrix in host buffer back to location b

if you use streams, you may ‘hide’ the double move, because the direction differs

CudaaduC · January 19, 2015, 5:22am

Yes, that is an idea with exploring. Using streams might work as you suggest, but I wonder if the data size of the copies might be too large to get any parallel operations. Most of my experience with streams has been in the context of small compute kernels using cuBLAS or cuSPARSE, and I noticed that only if the size(launch configuration) of the kernels were relatively small I would get the desired compute overlap.

I will give that strategy a go, and report if I get good results. Thanks for the suggestion.

The other option, while slower, would be to pinned (page locked) CPU memory buffer, and then copy directly from that buffer to the device buffer at 12 GBs via a cudaMemcpyHostToDevice call each iteration of the CPU loop.

But since I am getting my data via a MATLAB mex call, that pointer will be to slow pageable memory, so I will have to allocate another pinned buffer in host memory, copy the MATLAB memory to that host pinned buffer and then work with that.

I wonder if there is any way from the MATLAB side to get a pinned host buffer which CUDA will recognize as such? Somehow I doubt that is possible…

Many times when I have to work implementing/converting a MATLAB script into a CUDA mex, I have to deal with the GPU device memory limitations. Some of the MATLAB scripts I get can use as much as 14 GB of memory in the course of a script, so that leaves me with the issue of how to get that to work with 1/3 of the memory on the GPU.

little_jimmy · January 19, 2015, 5:41am

the idea behind streams was to get the underlying memory copies parallel, such that you can exploit both directions of the pci bus, to pay less of a penalty for exchanging speed for memory

while stream 1 is moving a matrix to the host to buffer it, to free the slot the matrix currently occupies, stream 2 is moving its host buffered matrix back to its new slot on the device

the presumption here is of course that the final destination is on the device

“each iteration of the CPU loop”

you may exploit the asynchronous nature of streams such that you need far less pinned memory - you copy (a block) away while the next matlab mex call is writing to the other block

"Many times when I have to work implementing/converting a MATLAB script into a CUDA mex, I have to deal with the GPU device memory limitations. Some of the MATLAB scripts I get can use as much as 14 GB of memory in the course of a script, so that leaves me with the issue of how to get that to work with 1/3 of the memory on the GPU. "

it is phenomena like this that is pushing me to leave high end ssds on the design table as an option to consider - the option to park some data closeby

little_jimmy · January 19, 2015, 1:51pm

there may actually be a better approach

you are copying sequentially:

for(int i=0;i<num_matrices;i++){
copy_from=indices[i];

you could equally copy according to destination; that way you only have 1 redundant matrix copy, and only need a buffer the size of 1 matrix

consider the reordering

1->3
3->2
2->1

now:

allocate a buffer the size of matrix

move 3 into the buffer
move 1 into 3
move 2 into 1
move buffer into 2

essentially, the 1st destination you move into the buffer
thereafter, you move according to newly opened destination
at the end, you move the buffer into the last opened destination

Robert_Crovella · January 19, 2015, 3:05pm

Don’t move the matrices, and instead just use a set of mapping indexes and pointers. Since the arrays are contiguous, this mapping/pointer creation process could be done once, and a set of pointers used to refer to the relevant matrices, in whatever order you wish.

Having said that, your number of 3ms sounds amazingly fast. For float matrices of the dimensions given, you are moving 1.2GB of data in 3ms, which is on the order of 400GB/s throughput, and ~800GB/s if I consider that each copy operation requires a read and a write ( if I have done my arithmetic correctly).

CudaaduC · January 20, 2015, 4:46am

Yes, you are right.
I added a cudaDeviceSynchronize() before the end timer, and that showed a copy time of 30 ms.
The device kernel I wrote takes 12-14 ms, so that method will be the one I use.

I always add a cudaDeviceSynchronize() after all kernel calls, but this time forgot to do so with the cudaMemcpy() calls in the host loop.

Robert_Crovella · January 20, 2015, 5:21am

Yes, it’s needed in the cudaMemcpy D->D case:

[url]CUDA Runtime API :: CUDA Toolkit Documentation

“For transfers from device memory to device memory, no host-side synchronization is performed.”

Topic		Replies	Views
how to speed up? data transfer CUDA Programming and Performance	22	3715	April 5, 2011
CUDA Memory Transpose CUDA Programming and Performance	10	1771	March 23, 2015
A built in way to quickly convert three float arrays into a single float3 array CUDA Programming and Performance	13	3855	December 20, 2013
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16290	January 30, 2011
Memory copy very slow memory copy, image CUDA Programming and Performance	10	12482	April 7, 2011
Jamming lots of little things into a big thing, quickly. We have lots of images. We need them in a s CUDA Programming and Performance	25	3157	November 16, 2010
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
Optimal Use of Streams? CUDA Programming and Performance	16	2168	August 11, 2010
Device to Device cudaMemcpy performance CUDA Programming and Performance cuda	5	9850	March 24, 2021
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9025	March 31, 2015

Changing the order(permuting) matrices

Related topics