doubt in coalesced reads

Hi All,
I wrote a simple program to test the timings of fully coalesced reads and partial coalesced reads. But I find the timings weird. Timings of the partial coalesced reads are less than the timings of the full coalesced reads.

The program is here http://pastebin.com/aJYEmiQL

Time with full coalesced reads : 0.689680 (ms)
Time with partial coalesced reads : 0.658030 (ms)

Compute capability of the device is 1.3

To my understanding the memory transactions in full coalesced reads are like this (1x 64B , 1x 64B) 16 times
and the memory transactions in partial coalesced reads are like this (1x 128B,1x 64B,1x 32B) 16 times

Can you please tell me where I went wrong.

Regards,
M.Kiran Kumar.

Hi All,
I wrote a simple program to test the timings of fully coalesced reads and partial coalesced reads. But I find the timings weird. Timings of the partial coalesced reads are less than the timings of the full coalesced reads.

The program is here http://pastebin.com/aJYEmiQL

Time with full coalesced reads : 0.689680 (ms)
Time with partial coalesced reads : 0.658030 (ms)

Compute capability of the device is 1.3

To my understanding the memory transactions in full coalesced reads are like this (1x 64B , 1x 64B) 16 times
and the memory transactions in partial coalesced reads are like this (1x 128B,1x 64B,1x 32B) 16 times

Can you please tell me where I went wrong.

Regards,
M.Kiran Kumar.

I’d think that [font=“Courier New”]size=512[/font] is just too small to give any reliable results. Note that the GPU has a memory bandwidth of the order of 100GB/s. Try a value that is at least several millions, and read the memory multiple times.

I’d think that [font=“Courier New”]size=512[/font] is just too small to give any reliable results. Note that the GPU has a memory bandwidth of the order of 100GB/s. Try a value that is at least several millions, and read the memory multiple times.

Hi tera,

 I don't think that is the problem in the program. When I made the size to 512 * 8192 * 8 the timings are like this

Time with full coalesced reads : 0.308870 (ms)

Time with partial coalesced reads : 0.220430 (ms)

Any other possible explanations…?

Hi tera,

 I don't think that is the problem in the program. When I made the size to 512 * 8192 * 8 the timings are like this

Time with full coalesced reads : 0.308870 (ms)

Time with partial coalesced reads : 0.220430 (ms)

Any other possible explanations…?

you don’t write any output, so kernel is empty after optimization.

you can use -ptx option to check ptx file, it should be

.entry _Z19full_coalesced_readPii (

                .param .u64 __cudaparm__Z19full_coalesced_readPii_d_array,

                .param .s32 __cudaparm__Z19full_coalesced_readPii_size)

        {

        .loc    17      7       0

$LDWbegin__Z19full_coalesced_readPii:

        .loc    17      10      0

        exit;

$LDWend__Z19full_coalesced_readPii:

        } // _Z19full_coalesced_readPii

you don’t write any output, so kernel is empty after optimization.

you can use -ptx option to check ptx file, it should be

.entry _Z19full_coalesced_readPii (

                .param .u64 __cudaparm__Z19full_coalesced_readPii_d_array,

                .param .s32 __cudaparm__Z19full_coalesced_readPii_size)

        {

        .loc    17      7       0

$LDWbegin__Z19full_coalesced_readPii:

        .loc    17      10      0

        exit;

$LDWend__Z19full_coalesced_readPii:

        } // _Z19full_coalesced_readPii

Even when I’m writing to memory, the timings of the kernels are weird. I changed the code to

global void full_coalesced_read(int *d_array,int size)

{

    int x = d_array[blockIdx.x*blockDim.x+threadIdx.x];

    d_array[blockIdx.x*blockDim.x+threadIdx.x] = x+1;

}

global void partial_coalesced_read(int *d_array,int size)

{

    int x = d_array[blockIdx.x*blockDim.x+threadIdx.x + 1];

    d_array[blockIdx.x*blockDim.x+threadIdx.x+1] = x+1;

}

full program is here http://pastebin.com/Pc4AB7cF

I also experimented to see whether launching a grid with 65536 blocks is better or (65536 / 2) twice is better. I found that launching a (65536 / 2) twice is better. The timings are here

Time with num of blocks = 65536 kernel launch 173.744863 (ms)

Time with num of blocks = 32768 kernel launch 53.440391 (ms)

The program is here http://pastebin.com/jqyHLLW7

But the time taken to launch is high. In the previous program(i.e experiments with coalesced reads) I launched a same size kernel but it took very less time. Why this odd behaviour…?

Also how to compile with ptx option and view the ptx file…? I use Makefile to compile

Even when I’m writing to memory, the timings of the kernels are weird. I changed the code to

global void full_coalesced_read(int *d_array,int size)

{

    int x = d_array[blockIdx.x*blockDim.x+threadIdx.x];

    d_array[blockIdx.x*blockDim.x+threadIdx.x] = x+1;

}

global void partial_coalesced_read(int *d_array,int size)

{

    int x = d_array[blockIdx.x*blockDim.x+threadIdx.x + 1];

    d_array[blockIdx.x*blockDim.x+threadIdx.x+1] = x+1;

}

full program is here http://pastebin.com/Pc4AB7cF

I also experimented to see whether launching a grid with 65536 blocks is better or (65536 / 2) twice is better. I found that launching a (65536 / 2) twice is better. The timings are here

Time with num of blocks = 65536 kernel launch 173.744863 (ms)

Time with num of blocks = 32768 kernel launch 53.440391 (ms)

The program is here http://pastebin.com/jqyHLLW7

But the time taken to launch is high. In the previous program(i.e experiments with coalesced reads) I launched a same size kernel but it took very less time. Why this odd behaviour…?

Also how to compile with ptx option and view the ptx file…? I use Makefile to compile

  1. grid size cannot exceed 65535, your configuration is wrong.

try following code

int size= 512*65536 ;

....

        float avg_timer=0,avg_timer1=0;

        for(int i=0;i<100;i++ )

        {

                CUT_SAFE_CALL( cutStartTimer( timer));

                full_coalesced_read<<<size/512,512>>>(d_array,size);

                cudaThreadSynchronize();

                cudaError_t status = cudaGetLastError();

                if ( cudaSuccess != status ){

                   fprintf(stderr, "Error(full): i = %d,  %s\n", i, cudaGetErrorString(status)) ;

                   exit(1) ;

                }

                CUT_SAFE_CALL( cutStopTimer( timer));

                avg_timer += cutGetTimerValue( timer);

        }

Then you will obtain error message

Error(full): i = 0,  invalid configuration argument

Remember that if your configuration is wrong, then timing is nonsense.

Second, you can use option “-ptx” to generate ptx code, for example

nvcc -arch=sm_14 -ptx -I[include path] xxx.cu
  1. grid size cannot exceed 65535, your configuration is wrong.

try following code

int size= 512*65536 ;

....

        float avg_timer=0,avg_timer1=0;

        for(int i=0;i<100;i++ )

        {

                CUT_SAFE_CALL( cutStartTimer( timer));

                full_coalesced_read<<<size/512,512>>>(d_array,size);

                cudaThreadSynchronize();

                cudaError_t status = cudaGetLastError();

                if ( cudaSuccess != status ){

                   fprintf(stderr, "Error(full): i = %d,  %s\n", i, cudaGetErrorString(status)) ;

                   exit(1) ;

                }

                CUT_SAFE_CALL( cutStopTimer( timer));

                avg_timer += cutGetTimerValue( timer);

        }

Then you will obtain error message

Error(full): i = 0,  invalid configuration argument

Remember that if your configuration is wrong, then timing is nonsense.

Second, you can use option “-ptx” to generate ptx code, for example

nvcc -arch=sm_14 -ptx -I[include path] xxx.cu

hurray!!! now I get the correct results. The program is here http://pastebin.com/pMD8AVMv . Kept size as (65536-512*4) * 512 and the timings are

Time with full coalesced reads : 170.208477 (ms)

Time with partial coalesced reads : 279.066875 (ms)

I tried whether launching grid with 65536 blocks is better or (65536 / 2) twice is better or (65536 / 4) four times is better. The program is here http://pastebin.com/NjiHDiMd

The results are here

Time with num of blocks = 63488 kernel launch 196.754746 (ms)

Time with num of blocks = 31744 kernel launch 51.798213 (ms)

Time with num of blocks = 15872 kernel launch 314.474219 (ms)

I see that to launch an empty kernel it is taking more time than a kernel with reads and writes(the read/write time is very less but the difference is 26(ms)).

The kernel launch overhead seems to be quite high. Can you please tell me the optimum number of blocks to launch.

hurray!!! now I get the correct results. The program is here http://pastebin.com/pMD8AVMv . Kept size as (65536-512*4) * 512 and the timings are

Time with full coalesced reads : 170.208477 (ms)

Time with partial coalesced reads : 279.066875 (ms)

I tried whether launching grid with 65536 blocks is better or (65536 / 2) twice is better or (65536 / 4) four times is better. The program is here http://pastebin.com/NjiHDiMd

The results are here

Time with num of blocks = 63488 kernel launch 196.754746 (ms)

Time with num of blocks = 31744 kernel launch 51.798213 (ms)

Time with num of blocks = 15872 kernel launch 314.474219 (ms)

I see that to launch an empty kernel it is taking more time than a kernel with reads and writes(the read/write time is very less but the difference is 26(ms)).

The kernel launch overhead seems to be quite high. Can you please tell me the optimum number of blocks to launch.

I doubt your timing result. I modify your code (see attached file) and use cudaEvent to record timing, also I compute effective bandwidth.

size=(65536-512*4) * 512

run on GTX480, cuda 3.2

thread block = 512, grid.x = 63488, do 100 times

Time with coalesced reads : 1.780661 (ms)

Time with uncoalesced reads : 1.814458 (ms)

bandwidth of coalesced = 1.46E+02 GB/s

however penalty of kernel launch may be (5 micro second)x100 ~= 0.5 ms
main.cu (2.64 KB)

I doubt your timing result. I modify your code (see attached file) and use cudaEvent to record timing, also I compute effective bandwidth.

size=(65536-512*4) * 512

run on GTX480, cuda 3.2

thread block = 512, grid.x = 63488, do 100 times

Time with coalesced reads : 1.780661 (ms)

Time with uncoalesced reads : 1.814458 (ms)

bandwidth of coalesced = 1.46E+02 GB/s

however penalty of kernel launch may be (5 micro second)x100 ~= 0.5 ms

yes…there was a small mistake in the code…i kept the start and stop of the timer in the for loop…the timing results after executing the code are here

size = 32505856 integer, 1.3002E+02 MB

thread block = 512, grid.x = 63488, do 100 times

Time with coalesced reads : 3.369986 (ms)

Time with uncoalesced reads : 5.607354 (ms)

bandwidth of coalesced = 7.72E+01 GB/s

The difference between full coalesced reads and partial coalesced reads is more in c1060 than in gtx480 as the data is cached in gtx480.

yes…there was a small mistake in the code…i kept the start and stop of the timer in the for loop…the timing results after executing the code are here

size = 32505856 integer, 1.3002E+02 MB

thread block = 512, grid.x = 63488, do 100 times

Time with coalesced reads : 3.369986 (ms)

Time with uncoalesced reads : 5.607354 (ms)

bandwidth of coalesced = 7.72E+01 GB/s

The difference between full coalesced reads and partial coalesced reads is more in c1060 than in gtx480 as the data is cached in gtx480.