I am doing some tests trying to implement Volkov’s matrix multiplication code with Streams to see if there’s a performance increase.
The machine I’m working with has the following characteristics:
* Dual Intel Xeon QuadCore E5410 a 2.33 Ghz (8 cores total) * Memory: o Main memory: 8 Gbytes FB-DIMM (Full Buffered RAM) o L2 Cache: 12 Mbytes * GPU: Nvidia Tesla c1060
Till now, I have it working and I’m getting results such as:
[codebox]1024 x 1024 matrix
Volkov’s code: 14.361200 (ms)
Volkov’s code (with 2 streams): 9.277300 (ms)
2048 x 2048 matrix
Volkov’s code: 76.821701 (ms)
Volkov’s code (with 2 streams): 61.628498 (ms)
4096 x 4096 matrix
Volkov’s code: 483.279297 (ms)
Volkov’s code (with 2 streams): 604.088928 (ms)[/codebox]
when I keep increasing the matrix size I’m getting really poor results with streams.
I think this could be because I have to call cudaMallocHost() to achieve asynchronity and, according to the manual, a big amount of this kind of memory reduces the amount of memory available to the system for paging, so swap is being intensively used.
I’m only using cudaMallocHost() for matrix B (A x B = C)
I would appreciate if you could tell me if this is the source of the problem or if there’s another option to implement this.
Thanks in advance