gtx580 question only 128 parallel block?


I’ve already posted it in the gtx500 series topic, but this topic is better for this question.

I have a new gtx580 card and I obtained some strange and confusing performance data during the programming. Gtx 580 has 32SP and 16proc per SM, which is alltogether 512 Core.
According to my measures it uses only 128 proc parallel during the kernel execution, I have no idea why.

The following very simple vector addition code adds two arrays into a third one (C=A+B)

global void VecAdd(const float* A, const float* B, float* C, int N)
int index = blockDim.x * blockIdx.x + threadIdx.x;
int temp1 = 100000;
for (int i=0;i<temp1;i++)
C[indextemp1+i] = A[indextemp1+i] + B[indextemp1+i];
for (int j=0;j<500;j++) C[index
temp1+i] += A[index*temp1+i]; //repeat 500 times, just to consume time

int main(int argc, char** argv)
int N = 60000000;
size_t size = N * sizeof(float);

// Allocate input vectors h_A and h_B in host memory
h_A = (float*)malloc(size);if (h_A == 0) Cleanup();
h_B = (float*)malloc(size);if (h_B == 0) Cleanup();
h_C = (float*)malloc(size);if (h_C == 0) Cleanup();

// Initialize input vectors
for (int i=0;i<N;i++) h_A[i]=1;
for (int i=0;i<N;i++) h_B[i]=2;

// Allocate vectors in device memory
cutilSafeCall( cudaMalloc((void**)&d_A, size) );
cutilSafeCall( cudaMalloc((void**)&d_B, size) );
cutilSafeCall( cudaMalloc((void**)&d_C, size) );

// Copy vectors from host memory to device memory
cutilSafeCall( cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice) );
cutilSafeCall( cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice) );

// Invoke kernel
int threadsPerBlock = 1;
int blocksPerGrid = 128;

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, blocksPerGrid);

// Copy result from device memory to host memory
cutilSafeCall( cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost) );

The code copies a lot of data (260MB) into the device and depending on the blocksPerGrid value fills the first (blocksPerGrid100K) element.
When I execute the kernel with 1-128 blocksPerGrid it consumes the same time, but using 129 blocks the execution time doubles, which means only 128 proc can work parallel.

I already have some experience with programming other cuda devices such as NVS3100 (2SP*8Proc= 16 Cores), which shows expectable result 1-16 Block the same time, 17-32 block double time.

I also can’t see the usage of Caches, when I change the size of data (below or above 768K per core).

I have the latest drivers, and obtained the same results on 64bit linux and 32bitXP as well.

Does anyone has any idea, where I was wrong, or who I should ask?

Thank you in advance.

Blocks are scheduled and run at the SM level, not individual cores, and each SM can have at most 8 active blocks (this is in the appendices of the programming guide). On your Fermi card with 16 SM, that would mean that at most 128 blocks are active at any given time. Therefore, observing constant execution time for up 128 block grids, and a doubling of that time for 129 blocks when the block execution time is constant would seem to make perfect sense.

It is also consistent with your other card: 2 SM * 8 blocks per SM = maximum 16 active blocks.

Thank you. But according to the specification gtx580 has 32SM, which is 32*8=256

Is it true, that I don’t have to care about the L1,L2 cache during the programming, it will be used automatically? Beause I also can’t see non linear speeds up, decreasing the amount of data was used.

No it does not. It has 16 SM, each SM has 32 cores, which is 512 cores in total. But the core count is irrelevent in this case. Blocks run on SM, your card has 16, each can maintain a maximum of 8 blocks active at a time, giving you the step in runtime about grids of 128 blocks that you are observing.

You should have 48kb of active L1 cache per MP by default, and L2 cache size and use is invariant. Your code is accessing memory in a very sub-optimal way for the GPU, so cache is probably going to have very little influence on effective memory throughput of that code.

Thank you very much for your answer again. I try to consider it in my original code, which is much more complicated than this simple vector addition. According to your answer everything working fine with my card, unfortunatelly the execution time is under my expectation.

As I pointed out in the previous reply, you code is accessing memory in a very sub-optimal pattern for bandwidth utilization. You might want to look at the discussion in the programming guide on the concept of memory coalescing.

Ok, I going to read through the “Coalesced Access to Global Memory” part of programming guide.