best possible matrix-vector multiplication performance? poor guy with only an emulator wonders about

Hi,

I just wrote my first CUDA program. Don’t have a card yet.

I have to multiply (“x=A*v”) a small but dense, static matrix A (MxN) with a fresh vector v (N) and then output the result x (M). These tasks can not be batched. Matrix A has maybe 64k x 4k elements and is the same for all multiplications, only v changes.

The resulting vector must be back in host memory within 500us, that is, 500 microseconds. You guys who have cuda-supported cards, do you think the 500us are possible to achieve even in theory? Or does host/OS-side 10ms process granularity prevent any success?

I looked at memory latencies in the GPU specs and CUDA presentations specs and they seem quite deadly. The emulator doesn’t simulate them so it’s hard for me to tell… Some sources said it takes 10us to start h2d/d2h cudaMemcpy’s? And 20us to invoke a kernel? Are these correct?

Because A is not updated, I put it into texture memory with 2D/1D access. To benefit from caching, every new vector v is cudaMemcpy’d into a mapped 1D/linear texture. This is I guess optimal?

Does anyone know how bad texture cache misses are? Does it take how many nanoseconds?

The code is attached and a snipped is below. I’m not sure – are there any faster methods?

[attachment=8211:staticMa…ul_bench.zip]

texture <float, 1, cudaReadModeElementType> texRefV;

texture <float, 2, cudaReadModeElementType> texRefA2D;

__global__ void gpuMulSmv(float* X, const int height, const int width)

{

	// 1 block, and 'height' number of threads

	// Each thread computes the dot product of one entire matrix row 

	// with full input vector, A2D is the matrix, V is the vector in texture mem

	// Local/shared memory is useless here(?)

	float sum = 0;

	int row = BLOCK_SIZE*threadIdx.x + threadIdx.y;

	for (int col = 0; col < width; ++col) {

		float a = tex2D(texRefA2D, col, row);

		float v = tex1Dfetch(texRefV, col);

		sum += a * v;

	}

	X[row] = sum;

}

...

	CUDA_SAFE_CALL( cudaMemcpyToArray(d_A2D, 0,0, h_A, sA, cudaMemcpyHostToDevice) );

	CUDA_SAFE_CALL( cudaBindTexture(NULL, texRefV, d_V) );

	CUDA_SAFE_CALL( cudaBindTextureToArray(&texRefA2D, d_A2D, &chDesc) );

...

	for (int iteration = 0; iteration < 100; iteration++) {

		CUDA_SAFE_CALL( cudaMemcpy(d_V, h_V, mem_size_V, cudaMemcpyHostToDevice) );

		gpuMulSmv<<<grid, threads>>>(d_X, HA, WA);

		CUT_CHECK_ERROR("Kernel execution failed");

		CUDA_SAFE_CALL( cudaMemcpy(h_X, d_X, mem_size_X, cudaMemcpyDeviceToHost) );

	}

...

When I run the program in a Mac Mini with very small matrix A and without cuda-supprted card I get:

janwagner:/Developer/CUDA/bin/darwin/emurelease$ ./adaptiveOptics

Matrix dimensions  : 496 actuators x 4864 sensors (9424.00 kB)

Matrix setup time  : 20.716000 (ms)

Avg processing time: 454.931702 (ms)

Avg cudaMemcpy time: 0.015700 (ms)

Avg host time	  : 454.949903 (ms)

Host compute time  : 4.102945 (ms)

Test PASSED

Press ENTER to exit...

But all this says is that the code works. Not how fast it works on a GPU ;-)

If anyone is interested, could you give a try see how fast the code runs on some CUDA card?

TIA,

Fedora 10 (32 bit), CUDA SDK 2.1 Beta…

../../bin/linux/release/adaptiveOptics 

Using device 0: GeForce 8400M GS											

Matrix dimensions  : 496 actuators x 4864 sensors (9424.00 kB)			  

Matrix setup time  : 4.987000 (ms)										  

Avg processing time: 0.024000 (ms)										  

Avg cudaMemcpy time: 27.352200 (ms)										 

Avg host time	  : 27.382700 (ms)										 

Host compute time  : 4.798044 (ms)										  

Test FAILED																 

diff(0,0) CPU=1213.6199,

[...]

CPU=1223.3485, GPU=1223.3469 ndiff(494,0) CPU=1219.4166, GPU=1219.4147 ndiff(495,0) CPU=1212.9670, GPU=1212.9657 n nTotal Errors = 473 n

Press ENTER to exit...

:(…

Using textures made for some amazing performance improvements on my end… thanks for the suggestion, I do not know why I did not think of it earlier given the way I was doing the product (I am also working on a M*V kernel, both with CUBLAS and CUDA). (4096x4096 matrix * 4096x1 vector)

I was using basically the same algorithm as you are for both CPU and GPU execution, basically this:

(the array of floats *c is assumed to be all filled with 0’s in the following code)

void mat_vec ( float *a, const int R, const int C, float *b, const int SIZE, float *c) {

	//A is a column major ordered matrix

	float temp = 0;

	if ( null == a || null == b || null == c ) return;

	for (int i = 0; i < R; i++) {

		for (int j = 0; j < C; j++) {

			c[i] += MATC(a,i,j,R) * b[i];

		}

	}

	return;

}

#define MATC(mat, i, j, NROWS) mat[indexC((i),(j),(NROWS))]

#define indexC(i ,j, n_rows) (((j) * (n_rows)) + (i)) /**< \brief This macro provides column-major order addressing + 1st element has id#0 */

As it turns out, it becomes much faster to do things this other way:

void mat_vec ( float *a, const int R, const int C, float *b, const int SIZE, float *c) {

	//A is a column major ordered matrix

	float temp = 0;

	if ( null == a || null == b || null == c ) return;

	for (int j = 0; j < C; j++) {

		temp = b[j];

		for (int i = 0; i < R; i++) {

			c[i] += MATC(a,i,j,R) * temp;

		}

	}

	return;

}

The code has to jump around memory a lot less here and thus takes a lot less time to do the job…

My first CUDA code (I have Thread Blocks made of 256 threads and ROWS/256 Thread Blocks… each thread produces one element of the result vector) was this:

#if SHARED_MEM == 0

	d_C[i] = 0.0f;

	for (int c = 0; c < COLS; c++) {

		d_C[i] += d_A[indexC(i,c,ROWS)] * d_B[c];

	}

#endif

Which was basically the same algorithm as the one you are using (and the same as my first version of the CPU multiplication code)… on my GeForce 8400M GS it takes about 192+ ms to execute (execution time on GPU only, not taking into account PCI-Express transfers of source data and readback of the result).

After that I worked on a CUBLAS based solution which takes about 21.81 ms of execution time.

Then I implemented a CUDA solution which used Shared Memory (the M*V problem is mostly bandwidth bound more than execution resources bound so to lower execution time the solution is to reduce pressure on device memory by reducing the amounts of accesses performed to it during kernel’s execution… Shared Memory support seemed like a good place to start working on):

// Block index

	int bx = blockIdx.x;

	// Thread index

	int tx = threadIdx.x;

	int block_id = bx* blockDim.x;

	int i = tx + block_id;

	//printf ("tx: %d\n", tx);

	//printf ("bx: %d\n", bx);

	//printf ("i: %d\n", i);

#if SHARED_MEM == 1

	__shared__ float mA[TBLOCK];

	__shared__ float mA1[TBLOCK];

	float temp = d_B[0];

	int idx = 0;

	mA1[tx] = MATC(d_A, i, 0, ROWS);

	mA[tx] = 0.0f;

	__syncthreads();

	#pragma unroll

	for (idx = 1; idx < (COLS); idx++) {

		mA1[tx] *= temp;

		temp = d_B[idx];

		mA[tx] += mA1[tx];

		mA1[tx] = MATC(d_A, i, idx, ROWS);

		__syncthreads();

	}

	mA1[tx] *= temp;

	mA[tx] += mA1[tx];

	d_C[i] = mA[tx];

#endif

This shaved a few ms off (between 4 and 8 ms) compared to the version which was not using Shared Memory, but it was not much of a gain…

I was trying to reduce stalls mixing loading data to Shared Memory and computation (imagine the source matrix A being processed in slices of 256 elements each… and then moving that slice across all columns), but the way I was loading the temp variable from device memory (from d_B) each iteration of the loop was just making the hardware sit idle a lot of the time.

I looked at this thread and I was like… wait… that temp variable… is used and re-used and re-used for a lot of elements (all the elements of a column of the source matrix are multiplied by the same element of the source vector so even if I have this temp = tex1Dfetch(texVecB, idx) called by each thread, you should only have one miss per column processed).

The current code is this:

#if SHARED_MEM == 1

	__shared__ float mA[TBLOCK];

	__shared__ float mA1[TBLOCK];

	float temp = tex1Dfetch(texVecB, 0);

	int idx = 0;

	mA1[tx] = MATC(d_A, i, 0, ROWS);

	mA[tx] = 0.0f;

	__syncthreads();

	#pragma unroll

	for (idx = 1; idx < COLS; idx++) {

		mA1[tx] *= temp;

		temp = tex1Dfetch(texVecB, idx);

		mA[tx] += mA1[tx];

		mA1[tx] = MATC(d_A, i, idx, ROWS);

		__syncthreads();

	}

	mA1[tx] *= temp;

	mA[tx] += mA1[tx];

	d_C[i] = mA[tx];

#endif

… and the output is… ;)…

Initializing data...

...allocating CPU memory.

Matrix is 4096x4096

Vector is 4096x1

Exec time only on CPU: 27.302999 (ms)

...allocating GPU memory.

Using device 0: GeForce 8400M GS

...copying input data to GPU mem.

Data init done.

"Using Shared Memory..."

---256 ---

---16 ---

Reading back GPU result...

Transfer + Exec + Readback time on GPU with CUDA: 59.832001 (ms)

Execution time on GPU with CUDA: 23.636999 (ms)

Transfer to GPU with CUDA: 35.262001 (ms)

Transfer from GPU with CUDA: 0.933000 (ms)

Risultati CPU (C/C++):

C_CPU.x= 2.000000 C_CPU.y= 1.000000 C_CPU.z= 1.000000 C_CPU.w= 1.000000

Risultati GPU (CUDA):

C_GPU.x= 2.000000 C_GPU.y= 1.000000 C_GPU.z= 1.000000 C_GPU.w= 1.000000

h_C_CPU == h_C_GPU... :).

Shutting down...

I am like 2 ms behind the CUBLAS code :).

Yep doing a dot product down the matrix rows, with a cached piece of the input vector, is quite fast. I used that in the CUDA scalar code as well as on IBM Cell simd code.

Thank you that you tried my code and reported the result! :) Too bad the results failed to verify, maybe I have too many threads there or missing the syncthreads. No matter. The most important (and disappointing) figure was the latency.

So >20ms latency is not uncommon then. Quite dreadful! And strange as well.

Because, even PCI/PCIe soundcards have 2ms or lower latency (e.g. with ASIO). So I wonder what nVidia folks are doing “wrong” in CUDA. Maybe a new SDK version could improve upon things. Or using some user-managed context and stream approach that skips reloading the shaders at every kernel invocation (I haven’t checked CUDA api docs closely enough yet to see if this is possible to manage).

Btw, about your own code speedup experiments, nice!

Not bad – it’s tough to beat the BLAS packages! :)

A minor comment however. I tried the same computation on IBM Cell on a Playstation3 (Debian Lenny, kernel 2.6.26, Cell SDK 3.1). The single precision BLAS sgemv with a 4096x4096 matrix and a 4096x1 vector takes only about 7.7ms. The 7.7ms include the transfer+exec+readback time. And CUBLAS needs 23.6ms for the computation alone. Totally pwned! ;) Looks like level 2 blas is not so great on GPUs despite all optimizations External Media

Cell sgemv:

General sgemv A=4096x4096, time 7808.923721 usec, computation 4.296934 GFLOPS, goodput 8.007572 GB/s

General sgemv A=4096x4096, time 7842.063904 usec, computation 4.278776 GFLOPS, goodput 7.973732 GB/s

General sgemv A=4096x4096, time 7712.125778 usec, computation 4.350867 GFLOPS, goodput 8.108078 GB/s

General sgemv A=4096x4096, time 7674.932480 usec, computation 4.371951 GFLOPS, goodput 8.147370 GB/s
  • Jan

FYI, these are the results with better hw:

  1. MacosX, CUDA 2.0, 8800GT
    Using device 0: GeForce 8800 GT
    Matrix dimensions : 496 actuators x 4864 sensors (9424.00 kB)
    Matrix setup time : 1.985000 (ms)
    Avg processing time: 0.026000 (ms)
    Avg cudaMemcpy time: 6.423501 (ms)
    Avg host time : 6.449914 (ms)
    Host compute time : 3.171921 (ms)
    Test PASSED

  2. Linux, CUDA 2.1, Tesla C1060
    sing device 0: Tesla T10 Processor
    Matrix dimensions : 496 actuators x 4864 sensors (9424.00 kB)
    Matrix setup time : 1.895000 (ms)
    Avg processing time: 0.011500 (ms)
    Avg cudaMemcpy time: 5.868600 (ms)
    Avg host time : 5.881786 (ms)
    Host compute time : 4.548073 (ms)

Thx for results! :) Can latency be improved further?

Judging from your results, a non-laptop PCIe card will cut the latency quite significantly. While calculation time is still the same or twice faster (with this code). Do you know if there are still some tricks to reduce the ~6ms “inputfeed+readback” total latency to 1/10th, down to below 0.6ms?

About the application, it is hard real-time control of E-ELT telescope adaptive optics. Computation and memory bandwidth on the nVidia GPUs are superior to other system solutions. Only the latency makes it a no-go :(

Do you know if there is any method to cut that latency down to below 0.6ms? Possible or not possible with todays GPUs?

Or is this just some deficiency in the GPU’s DMA engine…? The first speedup method I can think of, shaders running in an infinite loop, programming their own DMA fetches from host RAM to global mem don’t seem to be supported on GPUs in general?

  • Jan

Hi

can you send me please all code to amir_b3@walla.com?

thanks a lot

Amir