How to add pointer array value

Hello everyone, it’s siri, I have a question need for your help, thanks.
Suppose I have a pointer array, T**in, which saves many gpu pointers, each pointer points to a vector, and I want to add all pointer vector into one vector, I write following kernel code, but seems not right. So my question is can i add these pointer vectors in parallel?

template
global void sum_gpu_array(T *in, T out, int64_t N, size_t in_size, bool read_dst) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
while (col < N) {
T total(0);
for (int i = row; i < in_size; i+= blockDim.y * gridDim.y) {
const T *tmp = in[i];
if (tmp != nullptr)
total += tmp[col];
}
if (read_dst)
out[col] += total;
else
out[col] = total;
col += blockDim.x * gridDim.x;
}
}

No, this won’t work correctly as written.
This is effectively summing the columns of a matrix.

Your realization has possibly multiple threads working on a single column. That won’t work when you get to the output stage:

out[col] += total;
else
out[col] = total;

because you have multiple threads (in y) all working on the same column, so multiple threads attempting to update out[col], therefore a race condition.

If the width of your matrix (length of vectors) is large enough, I would just do this with a 1D array of threads, each thread working down a column. That is very simple to write, fast, and efficient. Basically optimal.

If the width of the matrix is too small, then you will need to arrange for a parallel reduction to take place in each column. A simplistic approach would be to just use atomics when writing the output. That should at least give you the right answer. For optimal performance, you would need a classical parallel reduction in each colum, google “mark harris parallel reduction” to get started on learning how, although that will not give you the exact implementation for this case.

Furthermore, trying to handle the case where the output vector may or may not be included in the sum will complicate the handling of the parallel reduction. I would simplify that by treating the output as another row of the input matrix.

Thanks Robert, very appreciate for your detail explantation.
Still have some questions,

This method is easy to understand, the width of matrix is 100000, the height of the matrix is 10000, so when a thread working down the column, it needs to process 10000 “add”, is it efficient?

I also trying to do reduction, but not finished. Based on my understanding, I need a total[ROW] register array to save each row’s result, and reduction on it, but I am not sure.

If the width of the matrix is 100,000, then you can spin up 100,000 threads in your kernel launch, each thread will iterate down a single column. 100,000 threads should be efficient on most GPUs. This is probably the best option in that case.

For the second approach, the reduction needed would be per column, not per row. That would be a more complicated code.

[quote=“Robert_Crovella”]

but each thread needs to do 100,00 element in column,maybe not fast enough?

[quote=“Robert_Crovella”]

Sure, thanks, i did reduction per column, seems not an easy job.

If I am reading your previous post correctly, each thread would do 10,000 summations per column. But no matter.

It will still be efficient. To understand why, you would need to have a good handle on how GPUs work.

Thanks Robert.
I understand GPUs work, so if i increase column size to 100, 000, 000, is it also performs same as 10, 000? I think previous has more instructions, so slower than latter one. If not right, please correct me.

No, if you increase column size to 100,000, that is more work to do, so the computation will take longer. Of course it will be slower. That is not the same as asking whether it will be efficient or not. The one-thread-per-column method will still be equally efficient, whether the column size (number of rows) is 10,000 or 100,000, but the 100,000 computation will take longer, probably about 10x longer.

ooo, now it’s clear… but I want to decrease the time of processing per-column data, can i achieve it by using 2D threads, and do reduction on column. Based on my understand, more threads do summations on column in parallel can reduce the processing time…

We seem to be going in circles. More threads don’t always reduce processing time. More threads have a tendency to reduce processing time if the current number of threads is not enough to saturate the machine. Once you have enough threads to saturate the machine, adding more threads doesn’t necessarily/typically reduce processing time.

Because the column summing method is relatively easy to implement if you don’t need a parallel reduction per column, I suggested that if you have enough columns, with one thread per column, to saturate the machine, you should probably just do it that way. 100,000 columns, i.e. 100,000 threads, is enough to saturate almost any GPU except Tesla V100. You need about 170,000 threads to saturate a Tesla V100. So unless you are wanting peak performance on a Tesla V100, you might simply want to stop at 100,000 threads with a relatively simple coding approach as already discussed.

If you want to get the most possible performance on a large GPU like Tesla V100, it might be necessary to perform a parallel reduction per column. It is definitely more complicated code. I already pointed out a problem with your first attempt/realization.

if (tmp != nullptr)
total += tmp[col];
}
if (read_dst)
out[col] += total;

In this code is false which is not properly work to run.

Sure, I just want to confirm reduction can be faster on v100, so I have confidence to go on…

It’s long time… I use reduction to do summary in y, but not right, so need your help to point my error…thanks!

#include<cuda.h>
#include

template<class T, int tile>
global void sum_gpu_array(T *in, T out, int64_t N, size_t in_size) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
T total[4];

while (col < N) {
for (int i = row; i < in_size; i+= blockDim.y * gridDim.y) {
const T *tmp = in[i];
if (tmp)
total[i] = tmp[col];
}
col += blockDim.x * gridDim.x;
}

while (col < N) {
// reduction in y. final result in total[0]
for(int s=2; s>0; s>>=1) {
if (row < s) {
total[row] +=total[s+row];
}
}
if (row == 0)
out[col] = total[0];
col += blockDim.x * gridDim.x;
}
}

int main()
{
size_t fr_size = 200;
float * hin = new float[fr_size];
float d_in0, d_in1, d_in2, d_in3, d_out;
cudaMalloc((void
)&d_in0, fr_size
sizeof(float));
cudaMalloc((void
)&d_in1, fr_sizesizeof(float));
cudaMalloc((void**)&d_in2, fr_size
sizeof(float));
cudaMalloc((void**)&d_in3, fr_sizesizeof(float));
cudaMalloc((void**)&d_out, fr_size
sizeof(float));

auto fillh = [&] (){
for(int i=0; i < fr_size; ++i) {
hin[i] = 1.;
}
};
auto filld = [&](void *mem, void *src, int size) {
cudaMemcpy(mem, src, sizeof(float)*size, cudaMemcpyHostToDevice);
};
fillh();
filld(d_in0, hin, fr_size);
filld(d_in1, hin, fr_size);
filld(d_in2, hin, fr_size);
filld(d_in3, hin, fr_size);

float *harray[4] = {d_in0, d_in1, d_in2, d_in3};
float **darray;

cudaMalloc((void**)&darray, 4sizeof(float));
cudaMemcpy(darray, harray, sizeof(float*)*4, cudaMemcpyHostToDevice);

#define TILE_X 128
#define TILE_Y 2
#define CEIL(x, y) ((x) + (y) -1)/(y)

dim3 blocks = dim3(TILE_X, TILE_Y);
dim3 grids = dim3(CEIL(fr_size, TILE_X), CEIL(4, TILE_Y));
sum_gpu_array<float, TILE_Y><<<grids, blocks>>>(darray, d_out, fr_size, 4);

As already mentioned, for the case where the matrix width is large enough, it should be sufficient performance-wise to simply have each thread handle a matrix column, with a running sum per thread.

If the matrix width is too narrow, this approach may not fully saturate the GPU. In that case, we could possibly seek additional parallelism (more threads) by having multiple threads assigned per column. This would necessitate some form of classical parallel reduction per column.

In the general case (an arbitrary number of threads per column) this becomes somewhat complicated to code up. However we can take a step in that direction by launching a 1D grid of 2D blocks, and have each block assign multiple threads per column. For a number of reasons we want to choose a 32x32 threadblock configuration for this. With this approach, we can use a grid (width) striding loop in the width direction, and a block (height) striding loop in the height direction of the matrix.

The following code gives an example of that. Each threadblock handles 32 columns, and assigns 32 threads (in y) per column. The block-striding loop strides down the column. At the completion of the striding, we perform a warp-shuffle based classical parallel reduction on the partial sums in the column. When the threadblock has finished striding down the assigned columns and computed the final column sums, it writes out column sum results and strides (if needed) in the width direction, to handle the next group of columns.

$ cat t461.cu
#include <iostream>

//these values cannot be changed
const unsigned block_x = 32;
const unsigned block_y = 32;
const dim3 nTPB(block_x,block_y);
const unsigned vertical_stride = block_y;
// above values cannot be changed

template <typename T>
__global__ void column_sums_reduce(const T * __restrict__ in, T * __restrict__ out, size_t width, size_t height){

  __shared__ T sdata[block_y][block_x+1];
  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
  size_t width_stride = gridDim.x*blockDim.x;
  size_t full_width = (width&(~((unsigned long long)(block_x-1)))) +((width&(block_x-1))?block_x:0); // round up to next block
  for (size_t w = idx; w < full_width; w+=width_stride){          // grid-stride loop across matrix width
    sdata[threadIdx.y][threadIdx.x] = 0;
    size_t in_ptr = w+threadIdx.y*width;
    for (size_t h = threadIdx.y; h < height; h+=vertical_stride){ // block-stride loop across matrix height
      sdata[threadIdx.y][threadIdx.x] += (w < width)?in[in_ptr]:0;
      in_ptr += width*vertical_stride;}
    __syncthreads();
    T my_val = sdata[threadIdx.x][threadIdx.y];
    for (int i = warpSize>>1; i > 0; i>>=1)                       // warp-wise parallel sum reduction
      my_val += __shfl_xor_sync(0xFFFFFFFFU, my_val, i);
    __syncthreads();
    if (threadIdx.x == 0) sdata[0][threadIdx.y] = my_val;
    __syncthreads();
    if ((threadIdx.y == 0) && ((w) < width)) out[w] = sdata[0][threadIdx.x];
  }
}

// test matrix configuration
typedef float mt;
const size_t dsw = 974;
const size_t dsh = 388; // for test validity below, this should be an even number, but not required for general usage

int main(){

  mt *h_d, *d_d, *h_r, *d_r;
  h_d = new mt[dsw*dsh];
  h_r = new mt[dsw];
  cudaMalloc(&d_d, dsw*dsh*sizeof(mt));
  cudaMalloc(&d_r, dsw*    sizeof(mt));
  for (size_t i = 0; i < dsw*dsh; i++) h_d[i] = i/dsw;
  for (size_t i = 0; i < dsw; i++) h_d[i] = i;
  cudaMemcpy(d_d, h_d, dsw*dsh*sizeof(mt), cudaMemcpyHostToDevice);
  // any number can be used for the number of blocks in the kernel launch config, for correctness
  // for performance, the number of blocks should be sized according to matrix width/32 (dsw/32), except
  // when that number of blocks exceeds twice the number of SMs on the GPU, the upper limit for performance
  column_sums_reduce<<<(dsw+block_x-1)/block_x,nTPB>>>(d_d, d_r, dsw, dsh);
  cudaMemcpy(h_r, d_r, dsw*sizeof(mt), cudaMemcpyDeviceToHost);
  mt check_val = (dsh-1)*(dsh/2); // we assume dsh is even for easy calculation of check_val
  for (size_t i = 0; i < dsw; i++) if (h_r[i] != (check_val+i)) {std::cout << "mismatch at: " << i << " was: " << h_r[i] << " should be: " << check_val << std::endl; return 0;}
  std::cout << "Success" << std::endl;
}
$ nvcc -o t461 t461.cu
$ cuda-memcheck ./t461
========= CUDA-MEMCHECK
Success
========= ERROR SUMMARY: 0 errors
$

This approach allows us to bring more (32*number of columns) threads to bear on the problem (compared to the trivial one-thread-per-column method), substantially increasing our ability to saturate the GPU for narrower matrix widths.

(Edit: I made a minor change to the tail end of the kernel. It makes no noticeable difference from a performance perspective but makes me feel better because the final write of results data is more efficient.)