MPI-CUDA-Runtime-comapre

hello, im doing project matrix multiplication in both Message passing and CUDA programming, what is best suitable way to calculate time and compare the runtime
between both the MPI and CUDA ? in the message passing ive used the MPI_Wtime() giving the run time in seconds, and in cuda used a ready function to calculate the GB/S which is the throughput time for the kernel run,

but is this correct way to compare ? because what i know that the GPU is good for its throughput and CPU for its latency, and there seem number of links calculating the CPU Gflops but nothing mentioning how or what best function to use,

what im aware on that the GPU comparison with MPI might not be fair, but there should be way to give indication on best way to compare between both - i mean programming function what to use ?
can you help advise in this matter ?

Much appreciated really thanks!

are you comparing apples with apples?

processing/ throughput should be independent of (medium of) communication
peak theoretical throughput should be the same, regardless of the medium of communication
if a particular communication medium achieves lower throughput than another, you would likely focus on communication specific factors like latency, and attempt to work in on or cancel that

am i missing something?

hello Little_jimmy, Thanks, so what your saying if i got u right that its preferable to focus on the latency for the Message passing, pthread and CUDA,

can you tell me what running functions i should use to compare these three running code latency ? - im sorry i failed to inform you that i am doing my matrix multiplication in three functions , Message Passing , Pthread and CUDA.

because my problem here is how to relate these three codes together to do an analysis like so and check which is best code from performance point of view, to make it clear more :

in the Cuda what i was advised to use below ready code to calculate the run time as you see below part of the code , i calculate the GB/s:

    // this is in the Host the code i calculate to run
    cudaEvent_t startEvent, stopEvent; // Initialize the start and end
checkCuda(cudaEventCreate(&startEvent));
checkCuda(cudaEventCreate(&stopEvent));
float ms;

    checkCuda(cudaEventRecord(startEvent, 0));
     //Some kernel call example : TestRC <<< 1, 1 >>>(d_idataER, d_idataEC);
    checkCuda(cudaEventRecord(stopEvent, 0));
checkCuda(cudaEventSynchronize(stopEvent));
checkCuda(cudaEventElapsedTime(&ms, startEvent, stopEvent)); 

    TimeCalculation(Edge * Ny , ms); // Calculate the time taken

while in MEssage Passing i Calculate using the function MPI_Wtime(), out put in Second,

in Pthread i calculate using function clock_gettime(CLOCK_MONOTONIC, &begin); //for begin and end
the output in second,

how can i relate the 3 outputs to be compared, i dont think the CUDA GB/S is fair in compare to these two right? am i using the right functions to evaluate my code between these three??

or should i find way to calculate for the MPI and Pthread in GB/S ?

will really appreciate your best help , Thanks

i think it is possible to compare:

calculation_time(MPI) :: calculation_time(pthread) :: calculation_time(cuda)

but i am not sure whether it is sensible

if all calculation times are equal, everyone is happy
if all calculation times are not equal, it would be the fault of the device to a far lesser extent
and in such a case, with a few tweaks here and there, that adapt to differences in the communication medium, one would likely end with all calculation times comparable, give or take

you really need to focus on communication specific measures then, like latency and such
for your measurement to be accurate and valid, you would need to standardize the measurement base across all mediums, and have it as a fraction of anticipated/ expected latency
i would deem your cuda measurement as realistic, and your mpi measurement as unrealistic, because its base is in seconds, and unlikely a fraction of true latency
i have not worked with mpi really, but i would think that the same (cuda) measure can be applied across all mediums - certainly in the case of pthreads; mpi should still support the required, underlying apis

and you probably need to increase your sample size - one measure is hardly comprehensive