when I keep increasing the matrix size I’m getting really poor results with streams.
I think this could be because I have to call cudaMallocHost() to achieve asynchronity and, according to the manual, a big amount of this kind of memory reduces the amount of memory available to the system for paging, so swap is being intensively used.
I’m only using cudaMallocHost() for matrix B (A x B = C)
I would appreciate if you could tell me if this is the source of the problem or if there’s another option to implement this.
Hi,
I was testing pinned memory in some of my code the other day, and it seemed like malloc/free of pinned memory was a bit more expensive … try allocating only once at start?
Did you try the kernel without any streams / async, to compare between pinned and unpinned memory?
Thanks for the reply. In the comparison of times that I’ve posted, on each matrix size, the first version is without streams and only using unpinned memory (malloc()).
That’s why my problem is when I increase the matrix size. In that moment, although the use of streams should increase performance, that doesn’t happen and I don’t know if that could be caused for the excessive amount of pinned memory a matrix requires.
Thanks asm, I agree with you, having such quantity of RAM and the machine not working at heavy load, we can exclude a paging problem.
So, doing a more accurate timming, this is what I get working without streams:
A x B = C
4096 x 4096 Matrices
Copy A (host to device): 34.207699 (ms)
Copy B (host to device): 34.217999 (ms)
Kernel execution: 372.297516 (ms) **
Copy C (device to host): 43.201199 (ms)
Total processing time (copy A and B, kernel exec and copy C): 487.516022 (ms)
Gflops: 28.191
8192 x 8192 Matrices
Copy A (host to device): 136.432098 (ms)
Copy B (host to device): 136.507202 (ms)
Kernel execution: 2958.412598 (ms) **
Copy C (device to host): 172.801193 (ms)
Total processing time (copy A and B, kernel exec and copy C): 3418.319580 (ms)
Gflops: 32.165
while working with streams and these matrix sizes the results are:
how can it be possible that the results with streams are worse that the ones without them?
PS: I am waiting for CUDA 2.2 to be installed in the machine I’m working, now CUDA 2.0 is installed. Then I’ll test the new cudaHostAllocMapped functionality
yes, this is interesting, according to your benchmarks, streaming is substantially slow,
perhaps you can post a part of your code here, to see how you use streams and do the timing…