My PC has a i7-920 quad-cores CPU and a Tesla C2050 GPU card. The code runs with CPU/GPU hybrid mode, with which GPU takes about a half of total running time (~200 seconds). Just out of curiosity, I tried to run 4 jobs simultaneously as there are 4 CPU cores available, and all jobs finished around 300 seconds, which seems quite good as if I run the jobs sequentially it would take about 800 seconds (200*4) .
So, the question is what this indicates about the code performance? It could be due to 1) my GPU code sucks as GPU cores are not fully utilized with a single job. 2) GPU idles too much with a single job 3) PCI transfer is negligible (the CUDA compute profile measures the PCI transfer takes about 6% to 8% of total running time) 4) others factors that I am missing?
Of course, the good indication is that the CPU and GPU job schedulers seem work well for my code.
Are there any other comments and suggestions? Thanks!