Can anyone give some advice to optimize my kernel?

I have an array with many points, these points belong to different objects, the data of the array is arranged in the order [x,x,x,x,x…y,y,y,y,y… .z,z,z,z,z]. In addition, I know the index of points belong to which object. now I want to calculate the center of gravity of each object. That is, the average value of the x, y, and z coordinates of all points corresponding to this object.
The attached code is written by myself, but the speed is relatively slow. Can anyone give me some suggestions for optimizing efficiency?

#include <iostream>
#include <stdio.h>
#include <math.h>
#include <stddef.h>
#include <Eigen/Dense>
#include <cuda.h>
#include <cuda_runtime.h>
#include <chrono>

static void HandleError(cudaError_t err, const char *file, int line)
  if (err != cudaSuccess)
    printf("%s in %s at line %d\n", cudaGetErrorString(err), file, line);
#define HANDLE_ERROR(err) (HandleError(err, __FILE__, __LINE__))

// one thread for each object
__global__ void getMean(const float *matched_srcs, const int num_src_total, const int *index_srcs,
                        const int *step_srcs, const int num_objs, float *means)
  size_t tid = blockDim.x * blockIdx.x + threadIdx.x;
  if (tid >= num_objs)

  int start, end;
  if (tid == 0)
    start = 0;
    end = step_srcs[0];
    start = step_srcs[tid - 1];
    end = step_srcs[tid];

  float src_x = 0.0;
  float src_y = 0.0;
  float src_z = 0.0;
  int match_num = end - start;
  for (int i = start; i < end; ++i)
    src_x += matched_srcs[i];
    src_y += matched_srcs[i + num_src_total];
    src_z += matched_srcs[i + num_src_total * 2];
  src_x = __fdividef(src_x, match_num);
  src_y = __fdividef(src_y, match_num);
  src_z = __fdividef(src_z, match_num);

  means[tid * 3 + 0] = src_x;
  means[tid * 3 + 1] = src_y;
  means[tid * 3 + 2] = src_z;

int main(int argc, char *argv[])

  // prepare cpu data
  int num_objs = 50;
  int num_src = 1000;
  int num_src_total = num_objs * num_src;

  std::vector<float> host_src_point(num_src_total * 3, 1);

  std::vector<int> host_step_srcs, host_index_srcs;
  for (int i = 0; i < num_objs; ++i)
    host_step_srcs.push_back(num_src * (i + 1));
    for (int j = 0; j < num_src; ++j)

  // prepare gpu data
  float *source_points = nullptr;
  cudaMalloc((void **)&source_points, sizeof(float) * num_src_total * 3);
  cudaMemcpy(source_points,, sizeof(float) * num_src_total * 3, cudaMemcpyHostToDevice);
  float *original_sources = nullptr;
  cudaMalloc((void **)&original_sources, sizeof(float) * num_src_total * 3);
  cudaMemcpy(original_sources,, sizeof(float) * num_src_total * 3, cudaMemcpyHostToDevice);
  int *index_srcs = nullptr;
  cudaMalloc((void **)&index_srcs, sizeof(int) * num_src_total);
  cudaMemcpy(index_srcs,, sizeof(int) * num_src_total, cudaMemcpyHostToDevice);
  int *step_srcs = nullptr;
  cudaMalloc((void **)&step_srcs, sizeof(int) * num_objs);
  cudaMemcpy(step_srcs,, sizeof(int) * num_objs, cudaMemcpyHostToDevice);

  float *matched_srcs = nullptr;
  cudaMalloc((void **)&matched_srcs, sizeof(float) * num_src_total * 3);

  float max_dist = 10.0;

  cudaStream_t stream;
  cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);

  float elapsed = 0;
  cudaEvent_t start, stop;
  HANDLE_ERROR(cudaEventRecord(start, stream));

  int *num_points = nullptr;
  cudaMalloc((void **)&num_points, sizeof(int) * num_objs);
  cudaMemset(num_points, 0, sizeof(int) * num_objs);

  float *means = nullptr;
  cudaMalloc((void **)&means, sizeof(float) * num_objs * 6);

  int thre_per_block = 64;
  int block = 1;

  getMean<<<block, thre_per_block, 0, stream>>>(source_points,


  HANDLE_ERROR(cudaEventRecord(stop, stream));
  HANDLE_ERROR(cudaEventElapsedTime(&elapsed, start, stop));
  std::cout << "the gpu time cost is: " << elapsed << std::endl;

  std::vector<float> ans_srcs(num_objs * 6, 0);

  cudaMemcpy(, means, sizeof(float) * num_objs * 6, cudaMemcpyDeviceToHost);

The main thing I see is inefficient memory access patterns. This has to do with data organization in memory and how each thread accesses data. You may wish to read this blog and see if you can visualize the difference in data organization that will lead to efficient access.

What you want is data organized like this:

--> num objects
|     x00 x01 x02 ...
|     x10 x11 x12 ...
v    x20 x21 x22 ...
points per object

and repeat for y and z data.

the subscripting means: x12 corresponds to x coordinate of point 1 belonging to object 2.

Combined with a one-thread-per object strategy, this will give efficient memory access. Your code needs to be refactored to make the loads from data organized this way as well. Each thread will add the width of the row on each access, to select the next point, in the for-loop.

thanks for your reply, I’ve considered this way before. But in the other hand, the number of points for each obj is very inconsistent. The maximum number of points may be 8000, and the minimum may be only 4 or 5 points. If use cudaMallocPitch or a two-dimensional array, according to my understanding, because of the alignment factor, It will take up a lot of memory space when there are fifty or sixty objects.

Those inconsistent sizes make the task more difficult to parallelize. For either when each thread works on a separate object or for processing one object cooperatively in a warp (4 to 5 points would be much too small to effectively do the later).

When do you know the number of points for an object? Would it make sense to sort the objects (directly in memory or by indirection with storing indices) or put them into different buckets, e.g. buckets for objects of more than 128 points and those of less? Or is this information only available very late?

I had missed that indication in your code. I personally don’t consider 60 objects times 8000 points per object to be a lot of memory usage, but of course if you are out of memory you are out of memory.

However the bigger issue from my perspective is that one thread per object with 50 or 60 objects will be a miserable parallelization strategy.

Taking these objections into account, we would seek to increase the number of deployed threads per object. The next logical size would be the warp. If we assign one warp per object, we can do a segmented reduction. And of course we could easily go up to one threadblock per object with a similar strategy.

A benefit of this is that we no longer have to rearrange data. You can use your same data organization scheme, approximately, and have each warp or threadblock start where each thread starts in your current code.

Given 50 or 60 objects I would go up to a threadblock, not a warp.

You can find other posts on this forum where I have described and demonstrated threadblock-level and warp-level segmented parallel reductions.

Note that this still doesn’t address the concern that different objects will have different workloads depending on their points-length. However retiring threadblocks is a far better strategy in this case than retiring threads.