mirroring array


I am trying to mirror an array of length N to an array of length 4*N such that every even index is skipped using a CUDA kernel and get weird results. It is probably something simple but a fresh pair of eyes helps.

Any help would be greatly appreciated.

The code follows:

#include <cuda.h>
#include <stdio.h>
#define TPB 1024

// HOST version
void hostMirror(unsigned int N, float *in, float *out){
  int rinCounter = 0;
  for(int i =1; i  < (2*N) ; i+=2){
    out[i] = in[rinCounter];
  rinCounter =0;
  for(int i=(4*N-1) ; i >= (2*N) ; i-=2){
    out[i] = in[rinCounter];

// DEVICE version
void kernelMirror(unsigned int N, float *in, float *out){
  unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
  unsigned int j = 2*i + 1;
  if(j < 2*N){
    out[j] = in[i];
  if(j >= 2*N && j < 4*N){
    out[j] = in[4*N-j];

int main(){
  unsigned int N = 2; // example is small
  float *h_in, *h_out;
  float *d_in, *d_out, *chk;

  h_in = (float*)malloc(N*sizeof(float));
  h_out = (float*)malloc(4*N*sizeof(float));
  chk = (float*)malloc(4*N*sizeof(float));

  cudaMalloc((void**)&d_in, N*sizeof(float));
  cudaMalloc((void**)&d_out, 4*N*sizeof(float));
  memset(chk, 0, 4*N*sizeof(float));
  memset(h_out, 0, 4*N*sizeof(float));

  // load some dummy data
  for(unsigned int i = 0; i < N; ++i){
    h_in[i] = i + 1;
  cudaMemcpy(d_in, h_in, N*sizeof(float), cudaMemcpyHostToDevice);

  // call HOST version
  hostMirror(N, h_in, h_out);

  unsigned int nb = (4*N + TPB -1) / TPB;  // define DEVICE dimension(s)

  // call DEVICE version
  kernelMirror<<<nb,TPB>>>(N, d_in, d_out);
  // copy results back from DEVICE
  cudaMemcpy(chk, d_out, 4*N*sizeof(float), cudaMemcpyDeviceToHost);

  // compare results
  for(unsigned int i = 0; i < 4*N; ++i){
    printf("Idx: %d -- HOST: %0.4f, DEVICE: %0.4f\n", i, h_out[i], chk[i]);

  // free memory

  return 0;

Any ideas out there?

It’s weird that the HOST function and DEVICE kernel results match for the first 2*N elements of result array, but the rest of the elements don’t.

When I run your code with cuda-memcheck I get errors.

My recommendation is that any time you are having trouble with a CUDA code, you use proper CUDA error checking, and run your code with cuda-memcheck. My recommendation is to do that before asking others for help. You’re basically ignoring useful information about your code. Even if you don’t understand the error report, describing it for others may be useful for those trying to help you.

If you need help getting started with debugging the type of problem that your code is generating, as detected by cuda-memcheck, you might want to review this:


It seems evident that your indexing in this line could not possibly be correct:

out[j] = in[4*N-j];

This is in the portion of your kernel code that handles the case where j>= 2N and less than 4N. N is 2, and i is 0,1,2,3… across threads, so j is 1,3,5,7 across threads. Which of these are less than 4N (8) but greater than or equal to 2N (4)? that would be 5, and 7.

So for j values of 5 and/or 7, does 4*N-j generate valid indexing into an input array that has only 2 legal indexes (0 and 1) ?

For N of 2 or 4, using indexing of b>>1[/b] seems to work.

Thank you for the replies. Yes my indexing is wrong.

I will give the [4*N-1-j]>>1 a try

You can save 25% of your memory bandwidth by having a single thread read in[i] and the write both out[2i+1] and out[4N-1-2*i].

You might also find that thinking this way around simplifies the indexing.

Thanks Robert, the indexing works perfectly. A thousand thank yous :)

@tera I agree, saving memory bandwidth is important and will give your suggestion a shot and see what happens.