CUDA and colwise matrix operations

Hello,

i’m trying to optimize my program with CUDA. A lot of Eigen structure is used in the program and only parts should run on the GPU. Some calculations with a Matrix3Xf are colwise and i have a problem to implement it on a GPU with additional other array operations.

__global__ void test(float * x, float* y, float param, int colSize){  // x has the data of Matrix3Xf
    int id = blockIdx.x * blockDim.x + threadIdx.x;

    if ((id % 3 == 0) && (id < colSize * 3)) // stay within the limits of x
    {
        // Eigen::Matrix3Xf x = x.colwise().normalized();
        float norm = sqrt(x[id] * x[id] + x[id + 1] * x[id + 1] + x[id + 2] * x[id + 2]);
        x[id] = x[id] * 1 / norm;

        x[id] = x[id] * 1 / norm;
        x[id+ 1] = x[id + 1] * 1 / norm;
        x[id+ 2] = x[id + 2] * 1 / norm;

        float var = y[id] * param;

        //do something with var and x
        ...
     }
}

If i only allow (id % 3 == 0), i don’t have access to all parts of y. When i try to use float** x, i have a problem to transfer the Matrix3Xf data from host memory to device memory. Any ideas?
My environment is Visual Studio 2017 and CUDA 9.2.

Thanks in advance.

I don’t understand that statement. (I’m not suggesting only allowing every 3rd thread to operate is a good idea.)

Yes, only allowing every 3rd thread to operate isn’t a good idea… i changed that part:

__global__ void test(float * x){      // data of Matrix3Xf x
 int id = blockIdx.x * blockDim.x + threadIdx.x;

if (id < colSize) // stay within the limits x
{
    int col = id * 3;
    float norm = 1 / norm3df(x[col], x[col + 1], x[col + 2]);

    x[col] = x[col] * norm;
    x[col + 1] = x[col + 1] * norm;
    x[col + 2] = x[col + 2] * norm;
}

any ideas how to make it faster?