My PC has a i7-920 quad-cores CPU and a Tesla C2050 GPU card. The code runs with CPU/GPU hybrid mode, with which GPU takes about a half of total running time (~200 seconds). Just out of curiosity, I tried to run 4 jobs simultaneously as there are 4 CPU cores available, and all jobs finished around 300 seconds, which seems quite good as if I run the jobs sequentially it would take about 800 seconds (200*4) .
So, the question is what this indicates about the code performance? It could be due to 1) my GPU code sucks as GPU cores are not fully utilized with a single job. 2) GPU idles too much with a single job 3) PCI transfer is negligible (the CUDA compute profile measures the PCI transfer takes about 6% to 8% of total running time) 4) others factors that I am missing?
Of course, the good indication is that the CPU and GPU job schedulers seem work well for my code.
Are there any other comments and suggestions? Thanks!
My PC has a i7-920 quad-cores CPU and a Tesla C2050 GPU card. The code runs with CPU/GPU hybrid mode, with which GPU takes about a half of total running time (~200 seconds). Just out of curiosity, I tried to run 4 jobs simultaneously as there are 4 CPU cores available, and all jobs finished around 300 seconds, which seems quite good as if I run the jobs sequentially it would take about 800 seconds (200*4) .
So, the question is what this indicates about the code performance? It could be due to 1) my GPU code sucks as GPU cores are not fully utilized with a single job. 2) GPU idles too much with a single job 3) PCI transfer is negligible (the CUDA compute profile measures the PCI transfer takes about 6% to 8% of total running time) 4) others factors that I am missing?
Of course, the good indication is that the CPU and GPU job schedulers seem work well for my code.
Are there any other comments and suggestions? Thanks!
I’m not sure if this would be helpful to you but just wanted share my experience with C2050 compared to GTX285. When the number of thread blocks are small (10-50) GTX285 is faster. But when the number of blocks are in the order of hundreds or thousands C2050 shows it’s performance.
I’m not sure if this would be helpful to you but just wanted share my experience with C2050 compared to GTX285. When the number of thread blocks are small (10-50) GTX285 is faster. But when the number of blocks are in the order of hundreds or thousands C2050 shows it’s performance.