MPI-CUDA-Runtime-comapre

Tasim · April 16, 2015, 7:02am

hello, im doing project matrix multiplication in both Message passing and CUDA programming, what is best suitable way to calculate time and compare the runtime
between both the MPI and CUDA ? in the message passing ive used the MPI_Wtime() giving the run time in seconds, and in cuda used a ready function to calculate the GB/S which is the throughput time for the kernel run,

but is this correct way to compare ? because what i know that the GPU is good for its throughput and CPU for its latency, and there seem number of links calculating the CPU Gflops but nothing mentioning how or what best function to use,

what im aware on that the GPU comparison with MPI might not be fair, but there should be way to give indication on best way to compare between both - i mean programming function what to use ?
can you help advise in this matter ?

Much appreciated really thanks!

little_jimmy · April 16, 2015, 2:46pm

are you comparing apples with apples?

processing/ throughput should be independent of (medium of) communication
peak theoretical throughput should be the same, regardless of the medium of communication
if a particular communication medium achieves lower throughput than another, you would likely focus on communication specific factors like latency, and attempt to work in on or cancel that

am i missing something?

Tasim · April 17, 2015, 4:20pm

hello Little_jimmy, Thanks, so what your saying if i got u right that its preferable to focus on the latency for the Message passing, pthread and CUDA,

can you tell me what running functions i should use to compare these three running code latency ? - im sorry i failed to inform you that i am doing my matrix multiplication in three functions , Message Passing , Pthread and CUDA.

because my problem here is how to relate these three codes together to do an analysis like so and check which is best code from performance point of view, to make it clear more :

in the Cuda what i was advised to use below ready code to calculate the run time as you see below part of the code , i calculate the GB/s:

    // this is in the Host the code i calculate to run
    cudaEvent_t startEvent, stopEvent; // Initialize the start and end
checkCuda(cudaEventCreate(&startEvent));
checkCuda(cudaEventCreate(&stopEvent));
float ms;

    checkCuda(cudaEventRecord(startEvent, 0));
     //Some kernel call example : TestRC <<< 1, 1 >>>(d_idataER, d_idataEC);
    checkCuda(cudaEventRecord(stopEvent, 0));
checkCuda(cudaEventSynchronize(stopEvent));
checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent)); 

    TimeCalculation(Edge * Ny , ms); // Calculate the time taken

while in MEssage Passing i Calculate using the function MPI_Wtime(), out put in Second,

in Pthread i calculate using function clock_gettime(CLOCK_MONOTONIC, &begin); //for begin and end
the output in second,

how can i relate the 3 outputs to be compared, i dont think the CUDA GB/S is fair in compare to these two right? am i using the right functions to evaluate my code between these three??

or should i find way to calculate for the MPI and Pthread in GB/S ?

will really appreciate your best help , Thanks

little_jimmy · April 18, 2015, 5:26am

i think it is possible to compare:

calculation_time(MPI) :: calculation_time(pthread) :: calculation_time(cuda)

but i am not sure whether it is sensible

if all calculation times are equal, everyone is happy
if all calculation times are not equal, it would be the fault of the device to a far lesser extent
and in such a case, with a few tweaks here and there, that adapt to differences in the communication medium, one would likely end with all calculation times comparable, give or take

you really need to focus on communication specific measures then, like latency and such
for your measurement to be accurate and valid, you would need to standardize the measurement base across all mediums, and have it as a fraction of anticipated/ expected latency
i would deem your cuda measurement as realistic, and your mpi measurement as unrealistic, because its base is in seconds, and unlikely a fraction of true latency
i have not worked with mpi really, but i would think that the same (cuda) measure can be applied across all mediums - certainly in the case of pthreads; mpi should still support the required, underlying apis

and you probably need to increase your sample size - one measure is hardly comprehensive

Topic		Replies	Views
Compare Execution Times CPU vs GPU the proper way? CUDA Programming and Performance	5	5871	September 8, 2009
Number of GPU clock cycles CUDA Programming and Performance	15	10094	June 16, 2017
calculating execution time CUDA Programming and Performance	4	5503	June 22, 2009
When to use Serial CPU, CUDA, OpenMP and MPI? CUDA Programming and Performance	8	13270	May 29, 2021
GPU and CPU time comparison CUDA Programming and Performance	0	2923	May 1, 2012
CUDA OpenCL comparison CUDA Programming and Performance	9	3386	August 23, 2011
processing time check CUDA Programming and Performance	5	551	November 16, 2010
MPI and CUDA mixed programming General CUDA programming CUDA Programming and Performance	22	23605	July 27, 2010
How to report speed-up from the GPU vs. CPU? CUDA Programming and Performance	5	15418	January 8, 2011
MPI + Peer2Peer combine MPI and Peer2Peer CUDA Programming and Performance	5	1806	February 8, 2012

MPI-CUDA-Runtime-comapre

Related topics