If you are available, I meet an acceleration problem. Hope you can give a hand:

I have 2 pointers, and want to relocate data of the first pointer in the second one. Like the first pointer:{ 1,2,3,4,5,6,7,8,9,10} , the second one:{0,0,0,0,0,0,0,0,0,0,0,0,0,0}. After relocating:

the second one becomes:{1,2,3,4,5,0,0,7,8,9,10,0,0}.

I tried to manipulate pointer to relocate the data, including using `cublasScopy`

, `cudaMemcpy`

, but it was too slow. And finally i chose to use global function:

i first write my code likeďĽš

```
__global__ void GetOverlapData(cuFloatComplex* Input, cuFloatComplex* Output, float* tukey, int Unit, int Interval, int num)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
for (int j = 0; j < num; j++)
{
if (i < Unit)
{
Output[i + j * Interval] = complexMul(Input[i + j * Unit], tukey[i]);
}
}
}
__device__ cuFloatComplex complexMul(cuFloatComplex a, float constant)
{
cuFloatComplex result;
result.x = a.x * constant;
result.y = a.y * constant;
return result;
}
```

That is, i used 1-Dimension index of threads, and the answer was right. `complexMul`

is a function i used to calculate a constant times complex number. But I still thought it was not fast enough, so I rewrote this function into 2-Dimension:

```
__global__ void GetOverlapData(cuFloatComplex* Input, cuFloatComplex* Output, float* tukey, int Unit, int Interval, int num)
{
int row = threadIdx.y + blockDim.y * blockIdx.y;
int col = threadIdx.x + blockDim.x * blockIdx.x;
if (row < num && col < Unit)
{
Output[row * Interval + col] = complexMul(Input[row * Unit + col], tukey[Unit]);
}
}
```

Then there occurred problems, i canâ€™t get right answer. And i tried to copy data in `device`

to `host`

and print them, there are some right data, but most data were zeros.