I have a question regarding parallel implementation on the Nvidia Jetson Nano B01. I have a sequential C code that performs matrix multiplication with square matrices of dimension N=1000. I executed this code on the Nvidia Jetson Nano using a single CPU, and it ran successfully. I measured the execution time of the matrix multiplication function.
Next, I parallelized this code with OpenMP, specifically in the matrix multiplication function where the for loop is located. I utilized the 4 available CPUs on the Nvidia Jetson. I measured the execution time of the parallel version.
My work focuses on comparing the execution times. However, the issue is that the parallel execution time is greater than the sequential execution time?. Is OpenMP working correctly? I would need assistance in resolving this problem.