Executing kernel having specific thread index

Hi,
I am having a large 1D array a. Then i have another 1D device array b which contains a list of specific thread indices of array a. I want to execute only those kernels having thread indices from array b. How can i achieve this without sacrificing on performance.
Note : size of array b is not small.

Thanks,

one possible approach: just launch the kernel and use the indices from array b in place of the ones you would normally use.

Suppose array b has length of len_b, and array a has length of len_a. Suppose the name of your kernel is my_kernel.

At the beginning of my_kernel you probably do something like this:

size_t idx = threadIdx.x+blockDim.x*blockIdx.x;  // calculate index
if (idx < len_a){
   float my_val = a[idx];  // retrieve value from a
    // do some work on the value from a
    }

Create a new kernel called my_kernel_b, at the beginning of it do:

size_t idx = threadIdx.x+blockDim.x*blockIdx.x;  // calculate index
if (idx < len_b){
    size_t b_idx = b[idx];
    if (b_idx < len_a){
        float my_val = a[b_idx];  // retrieve value from a
        // do some work on the value from a
   }
}

launch like:

my_kernel_b<<<(len_b+255)/256, 256>>>(...);

Is this every bit as efficient as the launch of my_kernel across the entire array a? No, it is not. If len_b is much smaller than len_a, and the values in b are arranged in sorted order, then the my_kernel_b launch should probably run faster than the equivalent my_kernel launch. However the my_kernel_b will still potentially be accessing data that is “scattered”. A way you could address this would be to reorganize your data (create a new array like a, where the indices in b are grouped together). I wouldn’t normally suggest that, but you may find that such a reorg/sort of the data does improve performance enough to offset the cost of the reorg/sort. Only testing can tell you whether that would be worth it. I would not expect it to be worth it in many cases.

a=[100,101,102,200,201,202,300,301,302]
len_a=9
b=[0,3,6]
len_b=3
I am expecting to retrieve [100,200,300]
How is

size_t b_idx = b[idx];

going to work?

The concept is indirection. It’s a basic programming concept not unique or specific to CUDA. Perhaps you should give it a try? I’m not sure how to explain it if you are not able to follow the code I provided. We use an index or array offset provided in b, to select the item to retrieve from a, here:

    float my_val = a[b_idx];  // retrieve value from a
                     ^^^^^
           note use of b_idx here instead of idx

Here is a completely worked example, using the exact code I already provided:

# cat t89.cu
#include <iostream>

__global__ void my_kernel_b(size_t *b, float *a, size_t len_a, size_t len_b, float *r){

  size_t idx = threadIdx.x+blockDim.x*blockIdx.x;  // calculate index
  if (idx < len_b){
    size_t b_idx = b[idx];
    if (b_idx < len_a){
        float my_val = a[b_idx];  // retrieve value from a
        // do some work on the value from a
        r[idx] = my_val;
   }
  }
}


int main(){

  float a[] ={100,101,102,200,201,202,300,301,302};
  size_t b[] = {0, 3, 6};
  size_t len_a = 9;
  size_t len_b = 3;
  float *d_a, *d_r;
  size_t *d_b;
  float *r = new float[len_b];
  cudaMalloc(&d_a, len_a*sizeof(a[0]));
  cudaMalloc(&d_b, len_b*sizeof(b[0]));
  cudaMalloc(&d_r, len_b*sizeof(r[0]));
  cudaMemcpy(d_a, a, len_a*sizeof(a[0]), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, b, len_b*sizeof(b[0]), cudaMemcpyHostToDevice);
  my_kernel_b<<<(len_b + 255)/256, 256>>>(d_b, d_a, len_a, len_b, d_r);
  cudaMemcpy(r, d_r, len_b*sizeof(r[0]), cudaMemcpyDeviceToHost);
  for (int i = 0; i < len_b; i++) std::cout << r[i] << ",";
  std::cout << std::endl;
}
# nvcc -o t89 t89.cu
# compute-sanitizer ./t89
========= COMPUTE-SANITIZER
100,200,300,
========= ERROR SUMMARY: 0 errors
#

Now i understand it. I was taking the hard path. I was trying to launch the kernel on the larger array d_a like this.

my_kernel_b<<<>>>(d_a, d_b, len_a, len_b, d_r)

Thanks for the working example.