I’ve already posted it in the gtx500 series topic, but this topic is better for this question.
I have a new gtx580 card and I obtained some strange and confusing performance data during the programming. Gtx 580 has 32SP and 16proc per SM, which is alltogether 512 Core.
According to my measures it uses only 128 proc parallel during the kernel execution, I have no idea why.
The following very simple vector addition code adds two arrays into a third one (C=A+B)
global void VecAdd(const float* A, const float* B, float* C, int N)
int index = blockDim.x * blockIdx.x + threadIdx.x;
int temp1 = 100000;
for (int i=0;i<temp1;i++)
C[indextemp1+i] = A[indextemp1+i] + B[indextemp1+i];
for (int j=0;j<500;j++) C[indextemp1+i] += A[index*temp1+i]; //repeat 500 times, just to consume time
int main(int argc, char** argv)
int N = 60000000;
size_t size = N * sizeof(float);
// Allocate input vectors h_A and h_B in host memory
h_A = (float*)malloc(size);if (h_A == 0) Cleanup();
h_B = (float*)malloc(size);if (h_B == 0) Cleanup();
h_C = (float*)malloc(size);if (h_C == 0) Cleanup();
// Initialize input vectors
for (int i=0;i<N;i++) h_A[i]=1;
for (int i=0;i<N;i++) h_B[i]=2;
// Allocate vectors in device memory
cutilSafeCall( cudaMalloc((void**)&d_A, size) );
cutilSafeCall( cudaMalloc((void**)&d_B, size) );
cutilSafeCall( cudaMalloc((void**)&d_C, size) );
// Copy vectors from host memory to device memory
cutilSafeCall( cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice) );
// Invoke kernel
int threadsPerBlock = 1;
int blocksPerGrid = 128;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, blocksPerGrid);
// Copy result from device memory to host memory
cutilSafeCall( cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost) );
The code copies a lot of data (260MB) into the device and depending on the blocksPerGrid value fills the first (blocksPerGrid100K) element.
When I execute the kernel with 1-128 blocksPerGrid it consumes the same time, but using 129 blocks the execution time doubles, which means only 128 proc can work parallel.
I already have some experience with programming other cuda devices such as NVS3100 (2SP*8Proc= 16 Cores), which shows expectable result 1-16 Block the same time, 17-32 block double time.
I also can’t see the usage of Caches, when I change the size of data (below or above 768K per core).
I have the latest drivers, and obtained the same results on 64bit linux and 32bitXP as well.
Does anyone has any idea, where I was wrong, or who I should ask?
Thank you in advance.