[beginner] indexing an array in a non continous way Accessing every 3rd element in the array

I am trying to work around with Maya API and CUDA, and my operation involves using individual points on a mesh and doing “stuff” to them. This stuff varies across what my operations are, but in essence I need each point individually, and then there may be function lookups based on that point, matrix multiplications and so on. I looked into cuBLAS and it seems that it is not at all suited for what I need to do since I will be working with millions of small 3x1 vectors. Now, Maya returns me the points as a const float* , so as you would guess the number of elements here are 3 x number of points. This is the fastest method that the API gives me. I am currently able to accomplish my task if I convert this float* to a custom structure array Vector*, whose size is (number of points). I end up looping through the whole float array and copying data, and this is a kind of a performance hit in my case, so I am looking to avoid it. Is there any way I can pass the float* to my kernel, and use the indexing in an intelligent way to access every 3rd element? I could access the element, element+1, element+2 in the same thread and this would still keep the total number of threads at (number of points).

ArrayFire has really smart indexing to optimize situations like this, of random or unique access patterns.

Thanks for the extremely quick reply melonakos. This would surely be beneficial, but if I could also understand the principle behind indexing and dividing arrays for kernels, that would help me code better, right from when I make the decisions on the size of kernels.

I am not entirely sure I understand what you are trying to accomplish. It seems to me you would like a gather operation that extracts every third element from a vector of floats, for processing on the GPU.

You can use a 2D copy when copying from host to device to grab only every third element of the source vector, and compress those into a contiguous vector on the device. You can apply the corresponding scatter when you copy the data back to the host. Your pitch would be the element size on the device, and three times the element size on the host.

cudaMemcpy2D (device_ptr, 1 * sizeof(float), host_ptr, 3 * sizeof(float), sizeof(float), N, cudaMemcpyHostToDevice); // gather
cudaMemcpy2D (host_ptr, 3 * sizeof(float), device_ptr, 1 * sizeof(float), sizeof(float), N, cudaMemcpyDeviceToHost); // scatter

That is very useful. My problem was a little different as in I wanted the whole array to be in the device memory, but I wanted to make a indexing(using block ids and thread ids) scheme that would allow me to access every 3rd element. I figured it out this way:

// Kernel that executes on the CUDA device

__global__ void square_array(float *a, long int N, float amplitudeD)


  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  if (idx<N)


      //a[3*idx] = a[3*idx];

        a[3*idx+1] = amplitudeD * sin (2.0 * 3.1415926 * (a[3*idx] + 10.0)/10.0);

      //a[3*idx+2] = a[3*idx+2];



// main routine that executes on the host

int cudaTest(const float *vertArray, float *outArray, int N, float matrix[4][4], float amplitudeD)       // N is the number of points


float *vertArray_d;  // Pointer to host & device arrays

  size_t size = 3*N * sizeof(float);

  cudaMalloc((void **) &vertArray_d, size);   // Allocate array on device

  cudaMemcpy(vertArray_d, vertArray, size, cudaMemcpyHostToDevice);

// Do calculation on device:

  int block_size = 512;

  int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);   // threads per block

square_array <<< n_blocks, block_size >>> (vertArray_d, N, amplitudeD);

// Retrieve result from device and store it in host array

  cudaMemcpy(outArray, vertArray_d, size, cudaMemcpyDeviceToHost);

// Cleanup


return 0;


Now that the first hurdle is cleared, I would want to implement the matrix/vector algebra part for my kernels. Is there a better way to do that instead of this: