Hi everyone,

I’m looking for a suitable cuBLAS function which perform (double complex) element-wise vector multiplication. I have try to search for an answer and people said that there is no such function like it in cuBLAS. However, there are someone here https://stackoverflow.com/questions/7621520/element-wise-vector-vector-multiplication-in-blas said that using sbmv function and treat the vector as diagonal matrix would work. But cuBLAS lib only provide Dsbmv and Csbmv, therefore I cannot use it.

I wrote a naive custom kernel to perform this operation but the time it need to complete didn’t satisfy me. For example: my kernel took twice the amount of time compared to Zdotc() kernel (which not only do the element-wise multiplication but also do the element sum after that). To be honest, it greatly discourages me to use my own kernel! :))

On the other hand, can anyone suggest me a better algorithm for swapping half spaces within each dimension of a matrix? I’m writting a FFTShift function in CUDA environment which behave similar to matlab fftshift() function. What it does is “swaps half-spaces of matrix X along each dimension”. To simplify the task, I’ve narrowed down the scope to swapping spaces for 3D matrix with even number of elements on each dimension (NX*NY*NZ with NX%2 = NY%2 = NZ%2 = 0). And my strategy to do it is executing 3 kernels consecutively: cuFFTShiftZ() -> cuFFTShiftY() -> cuFFTShiftX() which perform the task on each dimension. Here are my kernels:

```
// ShiftZ simply do the swapping elements of 2 vector with each other (same as cublasZswap)
__global__ void cuFFTShiftZ(cuDoubleComplex *a, cuDoubleComplex *b){
int index = blockIdx.x*blockDim.x + threadIdx.x;
cuDoubleComplex temp;
temp = a[index];
a[index] = b[index];
b[index] = temp;
}
// ShiftY locates elements belong to the first half of dimension Y and swap them
__global__ void cuFFTShiftY(cuDoubleComplex *a){
int index = blockIdx.x*blockDim.x + threadIdx.x;
int total = NX*NY;
int yindex = index % total;
cuDoubleComplex temp;
if (yindex < (total / 2)){
temp = a[index];
a[index] = a[index + (total / 2)];
a[index + (total / 2)] = temp;
}
}
// ShiftX locates elements belong to the first half of dimension X and swap them
__global__ void cuFFTShiftX(cuDoubleComplex *a){
int index = blockIdx.x*blockDim.x + threadIdx.x;
int xindex = index % NX;
cuDoubleComplex temp;
if (xindex < (NX / 2)){
temp = a[index];
a[index] = a[index + (NX / 2)];
a[index + (NX / 2)] = temp;
}
}
```

And I’m going execute them as below:

```
// d_a is the input matrix which is stored with NX as leading dimension
// NX, NY, NZ in this example is 30, 64, 64
dim3 numThreads = 128;
dim3 numBlocks = NX*NY*NZ / numThreads;
cuFFTShiftZ <<<numBlocks / 2, numThreads >>>(d_a, d_a + (NX*NY*NZ / 2)); // half-length of the array
cuFFTShiftY <<<numBlocks, numThreads >>>(d_a); // perform on full length array
cuFFTShiftX <<<numBlocks, numThreads >>>(d_a); // perform on full length array
```

It works. But there are 3 things which greatly concern me:

- Why with such a simple task like interchanges the elements of 2 vectors, my kernel take twice the amount of time than cublasZswap()? (I used visual profiler to monitor them) How could it, the cuBLAS kernel, be so fast?!
- The ShiftY and ShiftZ kernels waste to much resource. Half of the threads do nothing! Can anyone show me how or what can I do to improve this?
- With this approach, I have to read-write to the array located in global memory 3 times! This is so waste of bandwidth and highly redundancy.

I’ve tried to implement another method: instead of letting one thread update value for 2 elements, I make each thread update value for only one element. Here is my alternate kernel:

```
__global__ void cuFFTShiftX(cuDoubleComplex *a){
int tid = blockIdx.x*blockDim.x*blockDim.y + threadIdx.y*blockDim.x + threadIdx.x;
int tidB = threadIdx.y*blockDim.x + threadIdx.x;
__shared__ cuDoubleComplex sdata[960];
sdata[tidB] = a[tid];
__syncthreads();
if (threadIdx.x < (NX / 2)){
a[tid] = sdata[tidB + (NX / 2)];
}
else{
a[tid] = sdata[tidB - (NX / 2)];
}
}
```

And parameters for kernel execution

```
dim3 numT(30, 32); // NX = 30
cuFFTShiftX <<<128, numT >>>(d_a);
```

However, it runs even slower than the resource-wasted version! I doubt that it could be because of 3 things:

- The call on __syncthreads()
- Bank conflict
- And each warp now only have 30 threads running instead of 32 -> tried
```
dim3 numT(32,30);
```
and change the code above to adapt to it but the result is still the same

Thanks everyone so much for helping me out with this! :)