Convert 2D matrix from short to float takes a long time

Hello,

The input for the following kernel is a 100(ny) x 2700(nx) complex short elements.
All it does is to convert the 2D matrix into a complex float one.

typedef struct cuShortComplex
{
short x;
short y;
}cuShortComplex;

global void short_to_float (cuShortComplex *pSrc, cufftComplex pDest, int nx, int ny)
{
int ix = threadIdx.x + blockIdx.x * blockDim.x;
int iy = threadIdx.y + blockIdx.y * blockDim.y;
int idx = iy
X0 + ix;

pDest [idx].x = (float)pSrc[idx].x;
pDest [idx].y = (float)pSrc[idx].y

}

dim3 block (32,32);
dim3 grid ((nx+block.x-1)/block.x, (ny+block.y-1)/block.y);
short_to_float <<<grid, block>>> (pSrc, nx, ny);

I checked the time required to run the kernel.
It takes 0.6 - 1.5 msec on Tegra TX2.

This does not make sense.
Why it takes so long ?

Thank you,
Zvika

This might run faster:

__global__ void short_to_float (cuShortComplex *pSrc, cufftComplex *pDest, int nx, int ny)
{
  int ix = threadIdx.x + blockIdx.x * blockDim.x;
  int iy = threadIdx.y + blockIdx.y * blockDim.y;
  int idx = iy*X0 + ix;

  cuShortComplex i = pSrc[idx];
  cufftComplex o;
  o.x = (float)i.x;
  o.y = (float)i.y;
  pDest[idx]=o;

}

please study and learn to use the text formatting tools in the toolbar at the top of the box where you are editing your post. For example, the button that looks like this:

</>

on the far right of the toolbar can be used to identify text that should be presented as code (like mine is, above). Select the text, then press that button.

Hello Robert,

Thank you very much !

It seems the runtime is much better than CPU.
I measured the same scenario on one core of CORE-I7-7500 @ 2.7GHz
It takes ~15msec.

Best regards,
Zvika