Matrix multiplication with Streams

Hi There,

I am doing some tests trying to implement Volkov’s matrix multiplication code with Streams to see if there’s a performance increase.

The machine I’m working with has the following characteristics:

*   Dual Intel Xeon QuadCore E5410 a 2.33 Ghz (8 cores total)

*   Memory:

      o Main memory: 8 Gbytes FB-DIMM (Full Buffered RAM)

      o L2 Cache: 12 Mbytes

* GPU: Nvidia Tesla c1060

Till now, I have it working and I’m getting results such as:

[codebox]1024 x 1024 matrix

Volkov’s code: 14.361200 (ms)

Volkov’s code (with 2 streams): 9.277300 (ms)

2048 x 2048 matrix

Volkov’s code: 76.821701 (ms)

Volkov’s code (with 2 streams): 61.628498 (ms)

4096 x 4096 matrix

Volkov’s code: 483.279297 (ms)

Volkov’s code (with 2 streams): 604.088928 (ms)[/codebox]

when I keep increasing the matrix size I’m getting really poor results with streams.

I think this could be because I have to call cudaMallocHost() to achieve asynchronity and, according to the manual, a big amount of this kind of memory reduces the amount of memory available to the system for paging, so swap is being intensively used.

I’m only using cudaMallocHost() for matrix B (A x B = C)

I would appreciate if you could tell me if this is the source of the problem or if there’s another option to implement this.

Thanks in advance

Hi,
I was testing pinned memory in some of my code the other day, and it seemed like malloc/free of pinned memory was a bit more expensive … try allocating only once at start?

Did you try the kernel without any streams / async, to compare between pinned and unpinned memory?

Thanks for the reply. In the comparison of times that I’ve posted, on each matrix size, the first version is without streams and only using unpinned memory (malloc()).

That’s why my problem is when I increase the matrix size. In that moment, although the use of streams should increase performance, that doesn’t happen and I don’t know if that could be caused for the excessive amount of pinned memory a matrix requires.

any more ideas about the source of the problem? could it be a pinned memory issue or is something related to streams? thanks

well, according to your configuration, you have 8Gb of RAM, so I guess unless your system is at heavy load, paging can be excluded…

you may also try using cudaHostAllocMapped functionality from CUDA2.2

btw I also made some stream benchmarks with my kernel. If anyone interested, here is the results:

note: each block has 128 threads, the size of input + output is approx 1024 words per block

I run several tests with increasing number of streams and grid sizes (my GPU is GTX280):

32K blocks: total amount of data (in + out): 128 Mb

no streams: 42 ms

2 streams: 31.3 ms

3 streams: 28 ms **

4 streams: 29 ms

6 streams: 30.7 ms

8 streams: 29.9 ms

64K blocks: total amount of data (in + out): 256 Mb

no streams: 80 ms

2 streams: 60 ms

3 streams: 61 ms

4 streams: 56 ms **

6 streams: 58 ms

8 streams: 58 ms

16 streams: 61 ms

128K blocks: total amount of data (in + out): 512 Mb

no streams: 156 ms

2 streams: 118 ms

3 streams: 122 ms

4 streams: 123 ms

6 streams: 119 ms **

8 streams: 120 ms

16 streams: 123 ms

32 streams: 123 ms

Thanks asm, I agree with you, having such quantity of RAM and the machine not working at heavy load, we can exclude a paging problem.

So, doing a more accurate timming, this is what I get working without streams:

A x B = C

4096 x 4096 Matrices
Copy A (host to device): 34.207699 (ms)
Copy B (host to device): 34.217999 (ms)
Kernel execution: 372.297516 (ms) **
Copy C (device to host): 43.201199 (ms)
Total processing time (copy A and B, kernel exec and copy C): 487.516022 (ms)
Gflops: 28.191

8192 x 8192 Matrices
Copy A (host to device): 136.432098 (ms)
Copy B (host to device): 136.507202 (ms)
Kernel execution: 2958.412598 (ms) **
Copy C (device to host): 172.801193 (ms)
Total processing time (copy A and B, kernel exec and copy C): 3418.319580 (ms)
Gflops: 32.165

while working with streams and these matrix sizes the results are:

4096x4096
8 streams: 610.972717 (ms)
4 streams: 607.909302 (ms)
2 streams: 604.088928 (ms)
1 stream: 602.272705 (ms)

8192x8192
8 streams: 5278.885742 (ms)
4 streams: 5286.395508 (ms)
2 streams: 5288.012207 (ms)
1 stream: 5288.884277 (ms)

how can it be possible that the results with streams are worse that the ones without them?

PS: I am waiting for CUDA 2.2 to be installed in the machine I’m working, now CUDA 2.0 is installed. Then I’ll test the new cudaHostAllocMapped functionality

yes, this is interesting, according to your benchmarks, streaming is substantially slow,
perhaps you can post a part of your code here, to see how you use streams and do the timing…