Low Bandwidth with simple data copy

jester09 · December 6, 2011, 2:55pm

Hallo everyone!

I have a strange problem and I am hoping, that someone might help me.

I played around with some CUDA code to test the effective bandwith with my “Quaddro 4000” card (peak bandwidth about 90GB/s).

I wrote a kernel, that simply copys data from global-mem to global-mem and was hoping to get the maximal badwidth with I read from the Compute Visual Profiler!

So, when I do so… copying data with a datatype “double” I get the maximal bandwidth of about 77 GB/s… OK so far I think!

BUT when I copy the same kernel and now FLOAT data my bandwith goes down and I get only about 60 GB/s!

Why that?! I played around with the threads per block and also copied more or less data, but no better value! What am I doing wrong?

Thanks for any help!!

Jester

Here is the code I am using:

#define CU_ERR_HANDLE( err ) (HandleError( err, __FILE__, __LINE__ ))

static void HandleError(cudaError_t err,const char *file,int line) 

{

    if (err != cudaSuccess) 

    {

        printf("%s in %s at line %d\n", cudaGetErrorString( err ),file, line);

        //exit(EXIT_FAILURE);

    }

}

void init_rand(float* data, int n)

{

    for (int i = 0; i < n; ++i)

        data[i] = rand() / (float)RAND_MAX ;

}

__global__ void strided_copy(float* A,  float* C, int M)

{

    int i = (blockDim.x * blockIdx.x + threadIdx.x);

    if (i < M) 

    {

        C[i] = A[i];

    }

}

int main()

{

    int N = 10000000;

    int size = N * sizeof(float);

int threadsPerBlock = 512;

    int numOfBlocks     = (N+threadsPerBlock-1)/threadsPerBlock;

float* h_A;

    float* h_C;

    float* d_A;

    float* d_C;

h_A = (float*)malloc(size);

    h_C = (float*)malloc(size);

init_rand(h_A, N);

// Data to Device

    CU_ERR_HANDLE(cudaMalloc((void**)&d_A, size));

    CU_ERR_HANDLE(cudaMalloc((void**)&d_C, size));

    CU_ERR_HANDLE(cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice));

// KERNEL

    strided_copy<<<numOfBlocks,threadsPerBlock> > >(d_A,d_C,N);

// Data back to Host

    CU_ERR_HANDLE(cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost));

cudaFree(d_A);

    cudaFree(d_C);

    free(h_A);

    free(h_C);

CU_ERR_HANDLE(cudaThreadExit());

return 0;

}

#endif

tera · December 6, 2011, 3:39pm

Because with [font=“Courier New”]double[/font] your memory transactions are twice as wide. Try copying [font=“Courier New”]float2[/font] to achieve the same bandwidth, or [font=“Courier New”]double2[/font] or [font=“Courier New”]float4[/font] for further improvement.

jester09 · December 6, 2011, 3:52pm

Thanks very much for your reply! I will try it…

But let me ask one more question:

Why is the memory width relevant for the bandwidth?

I still copy the same amount of data as [font=“Courier New”]double[/font] or as [font=“Courier New”]floa[/font]t.

In my viewing it should result in the same bandwidth if I copy for example 1 million [font=“Courier New”]double[/font] or 2 million [font=“Courier New”]float[/font]?!

Greetings,

Jester

jester09 · December 7, 2011, 9:05am

To get this straight:

A float access is a 32-bit acces.

A float2 access is a 64-bit access.

If I issue a normal float (32-bit) accesses with a CUDA Device of compute capability 2.0 I will NEVER be able to achieve the full bandwidth?! External Image

jester09 · December 7, 2011, 3:55pm

This is really bad and I assume this should be relevant for everyone who uses CUDA for single precision operations, so this should at least be mentioned by nvidia in the Programming Guide!?

Topic		Replies	Views
How to achieve peak memory bandwidth？ CUDA Programming and Performance	5	2720	July 17, 2020
How much faster you think it can goes? CUDA Programming and Performance	1	491	November 18, 2010
Quadro 4000 Bandwidth The device to device bandwidth obtained with CUDA Programming and Performance	8	3516	March 7, 2011
Device Memory Bandwidth CUDA Programming and Performance	8	1846	March 29, 2015
global memory bandwidth problem CUDA Programming and Performance	4	1407	March 2, 2010
Effective Bandwidth Problem CUDA Programming and Performance	13	7709	March 23, 2011
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13629	June 2, 2008
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	763	November 20, 2023
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11245	July 8, 2009
Bandwith Problem CUDA Programming and Performance	7	2633	March 16, 2009

Low Bandwidth with simple data copy

Related topics