Questions about a double sum code optimization.

Okay I have recently succeeded in making my code partially parallel and achived a 15x increase in how fast it completed, and looking at it I am sure that it can be further optimised but would like input from others on how they might do it.

the general idea behind the code is a double sum over two different indices N (The number of the vectors index.) and Nstep(How many times the constant t change), I have made one of the for loops into a CUDA kernel that does it all in parallel, and it works perfectly.

The following bits of the code should be enough for this purpose, all memory allocations and deallocations have been made and all values declared properly.

I have only included A1 in this example, in my actuall code there also is A2 A3 … and so forth but their structure is exactly the same as A1, just one include sine another an log function etc.

[codebox]global void Vectorkernel(float *B, float *C, float *D, float *A1, int N, float t)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx < N) 

{

A1[idx] = B[idx] * C[idx] * D[idx] * t ;

}

}

int main()

{

… //Memory allocation, declarations and filling the vectors with values.

// Kernel options

int BlockSize = 512;

int Nblocks = N/BlockSize + (N%BlockSize == 0?0:1);

for (int i = 0; i < Nstep + 1; i++){

	A1sum = 0.0;  // Reset the sum for next iteration  

	t = i * constant 

	VectorKernel<<< Nblocks, BlockSize >>>(B_d, C_d, D_d, A1_d,  N, t);

// Copy data from device to host

	cudaMemcpy(A1_h, A1_d, sizeof(float) * N, cudaMemcpyDeviceToHost);

// Calculate the sums

	for (int i = 0; i < N ; i++)

	{

		A1sum = A1sum + A1_h[i];

	}

cout < A1sum << endl;

}

… // Memory deallocation on both host and device etc.

}

[/codebox]

So my questions about optimisation are these.

Since its a purely associative sumation of a vector entries I assume it can be done in parallel, anyone able to verify that and perhaps come with a simple example?

Would it be possible to include the Nstep loop into the kernel so that the entire process could be done in parallel?

I dont understand the kernel settings entirely. So please do correct me if I am wrong in this. Blocksize is how many threads that can be within a single block at a time yes ? and Nblocks is how many blocks are running at once ?

If I run the code with low value blocksizes the code still performs correctly, but it does slower then if I use blocksize 512. why is that ? is is simply, more blocks = more threads = faster computation ?

Any suggestions and answers on how to optimise code of this type further would be much appreciated. Also hope I expressed my problem clearly enough, otherwise do not hesitate to tell me!

Many thanks in advance