Slow Device memory access?

Here is a little test where I wanted to benchmark a bitwise AND between two large vectors.

I get really slowwww results from it, slower than if it was ran on a CPU.

Am I doing something wrong? (obviously yes, but what?) (no I’m not compiling with -deviceemu)

here is the code

#include <sys/time.h>

#include <stdio.h>

#define BLOCK_COUNT  100

#define THREAD_COUNT  500

#define ARRAY_SIZE  	32000000 // has to be a multiple of BLOCK_COUNT * THREAD_COUNT

/* --------------------------- target code ------------------------------*/

__global__  void bv_and (int* pTarget, int* pConst, int size)

{

    int i,j;

    int nStep = size/(32 * gridDim.x * blockDim.x);

    int nStart = (threadIdx.x  + blockIdx.x * blockDim.x) * nStep;

    int nEnd = min(nStart + nStep, size/32);

   for (i = nStart; i < nEnd; ++i) {

        pTarget[i] &= pConst[i];

    }

}

/* --------------------------- host code ------------------------------*/

extern "C"

void bv_and_test (void)

{

    cudaError_t         cudaStat;

    int*    cuTarget;

    int*    cuIdx1;

    int*              pTarget;

    int*              pIdx1;

   printf("Allocation host mem\n");

    pTarget = (int*)malloc(ARRAY_SIZE/8 );

    pIdx1   = (int*)malloc(ARRAY_SIZE/8 );

    

    printf("Init host mem\n");

    memset(pTarget, 0xEF, ARRAY_SIZE/8);

    memset(pIdx1, 0xFE, ARRAY_SIZE/8);

    

    printf("Allocation dev mem\n");

    cudaStat = cudaMalloc ((void **)&cuTarget, ARRAY_SIZE/8);

    cudaStat = cudaMalloc ((void **)&cuIdx1, ARRAY_SIZE/8);

   printf("Init dev mem\n");

    cudaStat = cudaMemcpy (cuTarget, pTarget, ARRAY_SIZE/8, cudaMemcpyHostToDevice);

    cudaStat = cudaMemcpy (cuIdx1, pIdx1, ARRAY_SIZE/8, cudaMemcpyHostToDevice);

	cudaThreadSynchronize();

   struct timeval start_time, end_time;

    gettimeofday( &start_time, 0);

    

    printf("Processing\n");

    

	int i;

    for(i=0;i<1;++i) {

    	bv_and<<<BLOCK_COUNT,THREAD_COUNT>>>(cuTarget, cuIdx1, ARRAY_SIZE);

    	cudaThreadSynchronize();

    }

    

    gettimeofday( &end_time, 0);

    

    printf("Processing time: %f (ms)  \n", 

      (float) (1000.0 * ( end_time.tv_sec - start_time.tv_sec) + (0.001 * (end_time.tv_usec - start_time.tv_usec)) )

    );

    

    // Some debug output

    printf("Target before processing:         %X %X %X %X\n", pTarget[0], pTarget[1], pTarget[2], pTarget[3]);

    

    cudaStat = cudaMemcpy (pTarget, cuTarget, ARRAY_SIZE/8, cudaMemcpyDeviceToHost);

   // Some debug output

    printf("Target before processing (begin): %X %X %X %X\n", pTarget[0], pTarget[1], pTarget[2], pTarget[3]);

    printf("Target before processing (end)  : %X %X %X %X\n", pTarget[ARRAY_SIZE/32-4], pTarget[ARRAY_SIZE/32-3], pTarget[ARRAY_SIZE/32-2], pTarget[ARRAY_SIZE/32-1]);

    

    free(pTarget);

    free(pIdx1);

    cudaFree(cuTarget);

    cudaFree(cuIdx1);

}

Cheers, Pascal

  1. Don’t time the first run
    The first run include some loading time, and could be expected to be slow.
  2. Coalescing is very important.
    Just create one thread for each element would be fine for you.
  3. 500 is a very bad choice for block size. Should use multiply of 32 or 64. Try 256.

I will reiterate asadafag’s number 2. Coalescing is VERY VERY important. Expect your performance to increase by a factor of 10-20 with coalesced accesses.

Read the section on coalesced memory accesses in the guide, then read it again until you understand it completely… it is somewhat tricky the first time.

The simplest way to get coalescing with a big 1D array like you have is to use one thread per element and have each thread index “idx = blockDim.x*blockIdx.x + threadIdx.x”. You should be able to convince yourself that this will produced coalesced accesses as long as blockDim.x is a multiple of 32… which is should be anyways.

Indeed, when doing so, I’m able to perform 80 times a bitwise operation between two bit vectors containing 25,600,000 bits in 14.44ms (including Dev2Host), compared to the 160ms it takes on my Core 2 Duo 2.13GHz.

#define BLOCK_COUNT  3125

#define THREAD_COUNT  256

#define ARRAY_SIZE  	25600000 // has to be a multiple of BLOCK_COUNT * THREAD_COUNT

/* --------------------------- target code ------------------------------*/

__global__  void bv_and (int* pTarget, int* pConst, int size)

{

	int i = threadIdx.x  + blockIdx.x * blockDim.x;

	pTarget[i] &= pConst[i];

}

Thanks :-)