Hi,
I just wrote my first CUDA program. Don’t have a card yet.
I have to multiply (“x=A*v”) a small but dense, static matrix A (MxN) with a fresh vector v (N) and then output the result x (M). These tasks can not be batched. Matrix A has maybe 64k x 4k elements and is the same for all multiplications, only v changes.
The resulting vector must be back in host memory within 500us, that is, 500 microseconds. You guys who have cuda-supported cards, do you think the 500us are possible to achieve even in theory? Or does host/OS-side 10ms process granularity prevent any success?
I looked at memory latencies in the GPU specs and CUDA presentations specs and they seem quite deadly. The emulator doesn’t simulate them so it’s hard for me to tell… Some sources said it takes 10us to start h2d/d2h cudaMemcpy’s? And 20us to invoke a kernel? Are these correct?
Because A is not updated, I put it into texture memory with 2D/1D access. To benefit from caching, every new vector v is cudaMemcpy’d into a mapped 1D/linear texture. This is I guess optimal?
Does anyone know how bad texture cache misses are? Does it take how many nanoseconds?
The code is attached and a snipped is below. I’m not sure – are there any faster methods?
[attachment=8211:staticMa…ul_bench.zip]
texture <float, 1, cudaReadModeElementType> texRefV;
texture <float, 2, cudaReadModeElementType> texRefA2D;
__global__ void gpuMulSmv(float* X, const int height, const int width)
{
// 1 block, and 'height' number of threads
// Each thread computes the dot product of one entire matrix row
// with full input vector, A2D is the matrix, V is the vector in texture mem
// Local/shared memory is useless here(?)
float sum = 0;
int row = BLOCK_SIZE*threadIdx.x + threadIdx.y;
for (int col = 0; col < width; ++col) {
float a = tex2D(texRefA2D, col, row);
float v = tex1Dfetch(texRefV, col);
sum += a * v;
}
X[row] = sum;
}
...
CUDA_SAFE_CALL( cudaMemcpyToArray(d_A2D, 0,0, h_A, sA, cudaMemcpyHostToDevice) );
CUDA_SAFE_CALL( cudaBindTexture(NULL, texRefV, d_V) );
CUDA_SAFE_CALL( cudaBindTextureToArray(&texRefA2D, d_A2D, &chDesc) );
...
for (int iteration = 0; iteration < 100; iteration++) {
CUDA_SAFE_CALL( cudaMemcpy(d_V, h_V, mem_size_V, cudaMemcpyHostToDevice) );
gpuMulSmv<<<grid, threads>>>(d_X, HA, WA);
CUT_CHECK_ERROR("Kernel execution failed");
CUDA_SAFE_CALL( cudaMemcpy(h_X, d_X, mem_size_X, cudaMemcpyDeviceToHost) );
}
...
When I run the program in a Mac Mini with very small matrix A and without cuda-supprted card I get:
janwagner:/Developer/CUDA/bin/darwin/emurelease$ ./adaptiveOptics
Matrix dimensions : 496 actuators x 4864 sensors (9424.00 kB)
Matrix setup time : 20.716000 (ms)
Avg processing time: 454.931702 (ms)
Avg cudaMemcpy time: 0.015700 (ms)
Avg host time : 454.949903 (ms)
Host compute time : 4.102945 (ms)
Test PASSED
Press ENTER to exit...
But all this says is that the code works. Not how fast it works on a GPU ;-)
If anyone is interested, could you give a try see how fast the code runs on some CUDA card?
TIA,
- Jan
staticMat_Vec_mul_bench.zip (3.23 KB)