How to accelerate N-Vector Scalar Product

xlro · August 7, 2007, 4:57pm

Hi,

i would like to implement a scalar product for N vectors with each size S.
I have the problem, that these vectors are quite small (S = 7) and the Number of vectors is quite big (N = 100.000).
Id like to achieve the porformance of the standard cublas library when performing the scalar product of all 700.000 entries.

My current implmentation is equal to the one in the CUDA SDK. I tried some changes of block size, but “my” implementation didn’t get much faster.

Thanks for hints and cheers,
xlro

xlro · August 8, 2007, 9:21am

I tried to optimize performance by using some shared memory. Because i have some nearly landom accesses on vector B (im implementing a Random Sparse Matrix-Vector Multiplication). But the performance is even worse. The code example shows how i declare and (correctly?) initialize the shared memory

__global__ void e_w_v_r(

    float *d_C,

    float *d_A,

    float *d_B,

    int *offset,

    int vectorN,

    int elementN

){

    //Accumulators cache

    __shared__ float accumResult[ACCUM_N];

	extern __shared__ float b[];	

	d_B = b;

//do some work on dB and other vectors

asadafag · August 8, 2007, 11:49am

What exactly is your problem?
Sparse Matrix x Dense Vector could be done via alpha blending (without CUDA) really efficiently.
Since there isn’t atomic on 8800 and nVidia doesn’t want to expose alpha blending, I doubt whether CUDA could do better.

xlro · August 8, 2007, 12:40pm

i just took a look to some other examples and wrote the following code. But it doesn’t do what i want it to do <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ /> i’ll give a little example:

I have 40k vectors and want to compute the scalar product of each of them. Each Vector has 8 entries.

Consequently i choose the gridsize to be just 40k and the number of threads per block is 8.

All vector entries are stored in d_A and d_B and contain 320k elements.

The result vector d_C only contains 40k elements (to store the 40k scalar products).

I just can’t see any mistake.

texture<float, 1, cudaReadModeElementType> tex;

__global__ void e_w_v_r(

    float *d_C,

    float *d_A,

    float *d_B,

    int *offset,

    int elementN

){

	__shared__ float sum;

	sum = 0;

	

	//compute scalar products

	//each thread computes one of them

	sum += d_A[threadIdx.x] * tex1Dfetch(tex, offset[threadIdx.x]);  __syncthreads();

	

	//write result back to d_C;

	d_C[blockIdx.x] = sum;	

}

extern "C" void multVecVecRef(float *d_C, float *d_A, float *d_B, int *d_Offset, int vectorN, int elementN) {

	e_w_v_r<<<vectorN, elementN>>>(d_C, d_A, d_B, d_Offset, elementN);

}

MisterAnderson42 · August 8, 2007, 1:02pm

Oh, lets count the problems:
Major bugs:

You only access d_A and offset elements 0-7 since you index them with threadIdx.x.
You have all 8 threads doing += on sum at the same time, each is writing a different value so the result is undefined.
You then have all threads in the block write the same result out, which technically should give correct results, but it is bad form to have multiple threads write to the same memory location.

Performance problems:

Access to d_A and offset are not fully coalesced

Suggestions: have each thread compute a single dot product in a loop. You won’t be able to coalesce memory reads this way either, but you can make use of the texture units since you’ll have good data locality. Maybe try a 2D texture and make each thread go across a row for the best cache performance.

That should be simpler to program. If the performance is lacking, you could try going back to the setup you have now with a block doing a scalar product in parallel, but you will need to do a few things to get it to work correctly. Look at the parallel sum reduction in the SDK scalar product example. It will be a bit messy, but you will also need to have each block process SEVERAL scalar products so you can make use of several warps in each block for the best performance.