concatenate using cuDNN

I am implementing a layer for depth concatenation i.e. concatenating along multiple inputs for format NCHW into along the C dimension. Is there a way to do this using cuDNN library ? I dont find any API that does concatenation like this. I wrote my own cuda kernel to do this for two inputs which works. But I am looking for a faster implementation for this ? Has anyone used existing cuDNN api’s cleverly to do the same ?

/* Device code /
void global concat2Impl(float
in1,
size_t numElems1, // chw for input1
float* in2,
size_t numElems2, // chw for input2
float* out,
size_t maxElems, // nchw for output
int numElemsPerBatch // c
hw for output
)
{
/
Calculate the global linear index, assuming a 1-d grid. /
size_t i = blockDim.x * blockIdx.x + threadIdx.x;
for (; i < maxElems; i += size_t(blockDim.x
gridDim.x))
{
// compute batchIndex and batchOffset
size_t batchIdx = i/numElemsPerBatch;
int batchOffset = i - batchIdxnumElemsPerBatch;
out[i] = (batchOffset < numElems1) ? in1[batchOffset + batchIdx
numElems1] :
in2[(batchOffset-numElems1) + batchIdx*numElems2];
}
}

Maybe you can put the outputs into a bigger memory space. You would have to map/test it out to make sure it works.

example: NCHW of [1,5,5,5] and another of [1,5,5,5] then you can concat it to [1,10,5,5] for the input descriptor of the next layer. you would pass the memory lets say in the activation:

cudnnActivationForward(adesc,alpha,x1desc,x1,(beta),ydesc,y) 
cudnnActivationForward(adesc,x2desc,x2,ydesc,y[125]).

ydesc would probably need to still be [1,5,5,5], but a new descriptor for the next input would read [1,10,5,5]. y would also have to be twice the size of x1 and x2.

If you want to do it to an NHWC you would have to transform (cudnnTransformTensor? I’ve never used it so I am not sure) it to an NCHW. concat it and then change it back to an NHWC if you choose or keep it NCHW (probably a better choice if you are concacting a lot).

If N > 1. Things change because the batches are the same and can’t stack like channels. You would have to split like X1[2,5,5,5] and X2[2,5,5,5] -> X1A[1,5,5,5] X1B[1,5,5,5], X2A[1,5,5,5], X2B[1,5,5,5].
then put them into 4 activation functions.

//xdesc[1,5,5,5]  
//ydesc[1,5,5,5]

cudnnActivationForward(adesc,(alpha),xdesc,X1,beta,ydesc,y)             //(N1)(X 0 to 4)
cudnnActivationForward(adesc,(alpha),xdesc,X1[125],beta,ydesc,y[250])   //(N2)(X 0 to 4)
cudnnActivationForward(adesc,(alpha),xdesc,X2,beta,ydesc,y[125])        //(N1)(X 5 to 9)
cudnnActivationForward(adesc,(alpha),xdesc,X2[125],beta,ydesc,y[375])   //(N2)(X 5 to 9)

Then, the next input tensor should be [2,10,5,5].