Problem about A800 80GB GPU memory bandwidth test

Rookie_programmer · March 16, 2024, 10:51am

I was tried to test my A800 80GB GPU bandwidth, a strange phenomenon was obtained.

When i used FP32 format to test bandwidth, the read bandwidth only can get half of peak performance approximately. I changed FP32 to double then can get peak performance.

But when i tested write bandwidth, whatever format i used were not influence the memory bandwdith .

To avoid L1/L2 cache affect, i only access all data once.

So is there something different between FP32 and double when GPU read date from DRAM?

My test code as follows:

#include <cuda_runtime.h>
#include <stdio.h>

__global__ void copyRow(float * MatA,float * MatB)
{

  int idx = threadIdx.x + blockIdx.x * blockDim.x;
  float tmp =MatA[idx];

// Prevents the compiler from optimizing the above assignment statement
  if ( tmp == 123.0)
  {
    MatA[idx] = tmp ;
  }

}

int main(int argc,char** argv)
{
  printf("strating...\n");
  int nxy=128*1024*1024;
  int nBytes=nxy*sizeof(float);

  //Malloc
  float* A_host=(float*)malloc(nBytes);
  float* B_host=(float*)malloc(nBytes);

  //cudaMalloc
  float *A_dev=NULL;
  float *B_dev=NULL;
  cudaMalloc((void**)&A_dev,nBytes);
  cudaMalloc((void**)&B_dev,nBytes);


 for(int test =0; test<10; test++)
      copyRow<<<nxy/1024,1024>>>(A_dev,B_dev);


  cudaMemcpy(B_host,B_dev,nBytes,cudaMemcpyDeviceToHost);

  cudaFree(A_dev);
  cudaFree(B_dev);
  free(A_host);
  free(B_host);
  cudaDeviceReset();
  return 0;
}

Robert_Crovella · March 18, 2024, 1:35pm

If there is a difference it is most likely due to the number of bytes per thread (4 vs. 8) being loaded, not the actual data type (float vs. double).

Rookie_programmer · March 18, 2024, 1:58pm

Is that mean the short, float, double have different peak read memory bandwidth?

Topic		Replies	Views
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13629	June 2, 2008
Low Bandwidth with simple data copy CUDA Programming and Performance	4	9116	December 7, 2011
why reading doubles is much faster than reading floats? CUDA Programming and Performance	3	1044	January 2, 2014
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Problem about L2 cache hit rate in A800 CUDA Programming and Performance	3	180	May 14, 2024
Performance Float vs Double effective bandwidth and execution time CUDA Programming and Performance	8	1359	June 6, 2011
weird performance in GPU CUDA Programming and Performance	2	2701	October 17, 2011
Bandwith Device to Device - FAQ and reality why is it slower? CUDA Programming and Performance	4	4723	May 24, 2007
Performance question CUDA Programming and Performance	2	1747	November 3, 2008
Cannot achieve max shared memory bandwith CUDA Programming and Performance	12	814	November 20, 2023

Problem about A800 80GB GPU memory bandwidth test

Related topics