understand COMPUTE_PROFILE output

Hi all,

I just started learning how to profile CUDA fortran by doing:

export COMPUTE_PROFILE=1

But I do not understand the output.

Here is the code I profile:

istat=cudaMemcpyAsync(x_d,x,natom,stream4)
         istat=cudaMemcpyAsync(y_d,y,natom,stream4)
         istat=cudaMemcpyAsync(z_d,z,natom,stream4)

         istat=cudaMemcpy2DAsync(bondpep_d,n_b,bondpep,n_b,n_b,nbondpep,cudaMemcpyHostToDevice,stream4)

         istat=cudaMemcpyAsync(bontyp_d,bontyp,nbondpep,stream4)

         call fpepbonds<<<(nbondpep+1)/512,512,0,stream4>>> &
              (FXPEP_d,FYPEP_d,FZPEP_d,X_d,Y_d,Z_d,NBONDPEP,BONDPEP_d,BONTYP_d, &
               K_P,LSTART,LREP,LSWITCH,KSWITCH)

         istat=cudaMemcpyAsync(fxpep,fxpep_d,natom,stream4)
         istat=cudaMemcpyAsync(fypep,fypep_d,natom,stream4)
         istat=cudaMemcpyAsync(fzpep,fzpep_d,natom,stream4)

         istat=cudaDeviceSynchronize()

And here is the output:

CUDA_PROFILE_LOG_VERSION 2.0

CUDA_DEVICE 0 Tesla K20c

CUDA_CONTEXT 1

TIMESTAMPFACTOR 1387a1b050e913fe

method,gputime,cputime,occupancy
method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 27.513 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.952 ]
method=[ memcpyHtoD ] gputime=[ 1.408 ] cputime=[ 6.278 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 7.962 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.998 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 8.255 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 7.303 ]
method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 7.360 ]
method=[ memcpyHtoD ] gputime=[ 0.864 ] cputime=[ 9.088 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.406 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 24.377 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.414 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.023 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.962 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 5.936 ]
method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 6.247 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.959 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.977 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.081 ]
method=[ memcpyHtoD ] gputime=[ 1.376 ] cputime=[ 5.833 ]
method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 6.013 ]
method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 5.939 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 5.255 ]
method=[ memcpyHtoD ] gputime=[ 0.928 ] cputime=[ 4.053 ]
method=[ memcpyHtoDasync ] gputime=[ 55.808 ] cputime=[ 8.239 ]
method=[ memcpyHtoDasync ] gputime=[ 55.520 ] cputime=[ 3.671 ]
method=[ memcpyHtoDasync ] gputime=[ 55.136 ] cputime=[ 3.183 ]
method=[ memcpyHtoDasync ] gputime=[ 326.976 ] cputime=[ 4.503 ]
method=[ memcpyHtoDasync ] gputime=[ 15.776 ] cputime=[ 2.835 ]
method=[ memcpyHtoD ] gputime=[ 1.120 ] cputime=[ 336.131 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.575 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.234 ]
method=[ memcpyHtoD ] gputime=[ 0.864 ] cputime=[ 4.744 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.149 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.425 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.339 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.186 ]
method=[ mods_grow_fpepbonds_ ] gputime=[ 29.184 ] cputime=[ 622.631 ] occupancy=[ 0.750 ]
method=[ memcpyDtoHasync ] gputime=[ 51.712 ] cputime=[ 485.609 ]
method=[ memcpyDtoHasync ] gputime=[ 50.624 ] cputime=[ 440.018 ]
method=[ memcpyDtoHasync ] gputime=[ 49.920 ] cputime=[ 416.596 ]


I wonder why there are so many memcpyHtoD events? Can you please exlain?

Thanks,

Lam

Hi Lam,

While I can be sure, they’re most likely the F90 descriptors and the kernel arguments.

  • Mat

Thanks Mat,

So how do I find out which ones correspond to the F90 descriptors and which to he kernel arguments? Among the arguments FXPEP_d,FYPEP_d,FZPEP_d,X_d,Y_d,Z_d,BONDPEP_d,BONTYP_d are arrays, and there are 6 scalars: NBONDPEP,K_P,LSTART,LREP,LSWITCH,KSWITCH

There is one which takes a lot of CPU time, I don’t know what makes this:
method=[ memcpyHtoD ] gputime=[ 1.120 ] cputime=[ 336.131 ]

Thanks,

Lam

Hi Lam,

There is a good way to correlate which copy goes with which variable. You can add “memtransfersize” to your COMPUTE_PROFILE_CONFIG file to see how much data is being transferred. That at least with give a clue.

There is one which takes a lot of CPU time, I don’t know what makes this:
method=[ memcpyHtoD ] gputime=[ 1.120 ] cputime=[ 336.131 ]

This just shows that it’s waiting for something to finish before it transfers the data. Most likely it’s blocked by a stream.

  • Mat

Thanks Mat,

Lam

Hi Mat,

First I create 4 streams like this:

integer(kind=cuda_stream_kind)::stream1,stream2,stream3,stream4

istat=cudaStreamCreate(stream1)
istat=cudaStreamCreate(stream2)
istat=cudaStreamCreate(stream3)
istat=cudaStreamCreate(stream4)


As I add streamid to the COMPUTE_PROFILE_CONFIG file I find that, the memcpyHtoD events are done by streamid = 1 while the rest is done by streamid = 15, probably corresponding to stream4.

Does this mean my program is not using different streams efficiently because the kernel will have to wait for streamid=1 to transfer the scalar arguments?

Thanks,

Lam

Hi Mat,

Here is the output after I add memtransfersize:

CUDA_PROFILE_LOG_VERSION 2.0

CUDA_DEVICE 0 Tesla K20c

CUDA_CONTEXT 1

TIMESTAMPFACTOR 1387a1b0b9e80747

method,gputime,cputime,occupancy,streamid,memtransfersize
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 26.381 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.278 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.783 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.741 ] streamid=[ 1 ] memtransfersize=[ 112 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.775 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.719 ] streamid=[ 1 ] memtransfersize=[ 112 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.211 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.771 ] streamid=[ 1 ] memtransfersize=[ 112 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.478 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.009 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 22.774 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.983 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.898 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.807 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 5.996 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.280 ] cputime=[ 5.828 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.710 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.344 ] cputime=[ 5.872 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.754 ] streamid=[ 1 ] memtransfersize=[ 88 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.787 ] streamid=[ 1 ] memtransfersize=[ 112 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 6.001 ] streamid=[ 1 ] memtransfersize=[ 112 ]
method=[ memcpyHtoD ] gputime=[ 1.312 ] cputime=[ 5.835 ] streamid=[ 1 ] memtransfersize=[ 112 ]
method=[ memcpyHtoD ] gputime=[ 0.864 ] cputime=[ 5.042 ] streamid=[ 1 ] memtransfersize=[ 4 ]
method=[ memcpyHtoD ] gputime=[ 0.864 ] cputime=[ 3.959 ] streamid=[ 1 ] memtransfersize=[ 4 ]
method=[ memcpyHtoDasync ] gputime=[ 55.776 ] cputime=[ 6.635 ] streamid=[ 15 ] memtransfersize=[ 320000 ]
method=[ memcpyHtoDasync ] gputime=[ 55.328 ] cputime=[ 3.409 ] streamid=[ 15 ] memtransfersize=[ 320000 ]
method=[ memcpyHtoDasync ] gputime=[ 55.040 ] cputime=[ 2.778 ] streamid=[ 15 ] memtransfersize=[ 320000 ]
method=[ memcpyHtoDasync ] gputime=[ 409.600 ] cputime=[ 3.970 ] streamid=[ 15 ] memtransfersize=[ 158400 ]
method=[ memcpyHtoDasync ] gputime=[ 15.680 ] cputime=[ 2.741 ] streamid=[ 15 ] memtransfersize=[ 79200 ]
method=[ memcpyHtoD ] gputime=[ 1.120 ] cputime=[ 419.674 ] streamid=[ 1 ] memtransfersize=[ 72 ]
method=[ memcpyHtoD ] gputime=[ 0.864 ] cputime=[ 4.495 ] streamid=[ 1 ] memtransfersize=[ 72 ]
method=[ memcpyHtoD ] gputime=[ 0.864 ] cputime=[ 4.232 ] streamid=[ 1 ] memtransfersize=[ 72 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.390 ] streamid=[ 1 ] memtransfersize=[ 72 ]
method=[ memcpyHtoD ] gputime=[ 0.864 ] cputime=[ 4.143 ] streamid=[ 1 ] memtransfersize=[ 72 ]
method=[ memcpyHtoD ] gputime=[ 1.152 ] cputime=[ 4.446 ] streamid=[ 1 ] memtransfersize=[ 72 ]
method=[ memcpyHtoD ] gputime=[ 0.896 ] cputime=[ 4.318 ] streamid=[ 1 ] memtransfersize=[ 96 ]
method=[ memcpyHtoD ] gputime=[ 0.864 ] cputime=[ 4.234 ] streamid=[ 1 ] memtransfersize=[ 72 ]
method=[ mods_grow_fpepbonds_ ] gputime=[ 29.888 ] cputime=[ 252.816 ] occupancy=[ 0.750 ] streamid=[ 15 ]
method=[ memcpyDtoHasync ] gputime=[ 51.648 ] cputime=[ 360.587 ] streamid=[ 15 ] memtransfersize=[ 320000 ]
method=[ memcpyDtoHasync ] gputime=[ 50.304 ] cputime=[ 285.290 ] streamid=[ 15 ] memtransfersize=[ 320000 ]
method=[ memcpyDtoHasync ] gputime=[ 49.792 ] cputime=[ 274.432 ] streamid=[ 15 ] memtransfersize=[ 320000 ]


If memcpyHtoD was done on kernel arguments, then memtransfersize should have been either 4 for integer or 8 for double precision. But most of them are quite large. I don’t know how to interpret this.

Also if I have different sets of (H2D,kernel,D2H) on different streams, there are always those memcpyHtoD events done by streamid = 1 in between operations by other streams. Does this mean my using different streams is useless because streamid =1 blocks them?

Do you have any suggestion?

Thanks,

Lam