Coalesced memory access example

Hi Folks,

I’m interested in a simple example program of coalesced access of global memory. Unfortunately, I’m having two problems. First, the code:

#include <stdio.h>

#include <stdlib.h>

#include <malloc.h>

#include <stdlib.h> // rand()

#include <string.h>

#include <sys/timeb.h>

#include <time.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <device_functions.h>

#include <device_launch_parameters.h>

__device__ int reverse[32] = { 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};

__device__ int forward[32] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__global__ void no_coalesce(int * data, int n, int iter)

{

    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int sr[32];

    sr[idx] = reverse[idx];

    __syncthreads();

    for (int i = 0; i < iter; ++i)

        data[sr[idx]] += n;

}

__global__ void coalesce(int * data, int n, int iter)

{

    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int sf[32];

    sf[idx] = forward[idx];

    __syncthreads();

    for (int i = 0; i < iter; ++i)

        data[sf[idx]] += n;

}

int main(int argc, char**argv)

{

    argc--; argv++;

// First argv is an int, cuda device number.

    int rvdev = cudaSetDevice(atoi(*argv));

int n = 32;

    int * in = (int*)malloc(n * sizeof(int));

    for (int i = 0; i < n; ++i)

        in[i] = 0;

    if (1)

    {

        struct _timeb  t1;

        struct _timeb  t2;

        struct _timeb  t3;

        struct _timeb  t4;

        printf("Starting GPU test v1 ...\n");

        _ftime(&t1);

        int * din;

        cudaMalloc(&din, n * sizeof(int));

        _ftime(&t2);

        cudaMemcpy(din, in, n * sizeof(int), cudaMemcpyHostToDevice);

        _ftime_s(&t3);

        int kernel_calls = 1;

        int internal_iters = 10000000;

        int block_size = 32;

        int blocks = 1;

        dim3 block(block_size);

        dim3 grid(blocks);

        for (int i = 0; i < kernel_calls; ++i)

        {

            no_coalesce<<<grid, block>>>(din, n, internal_iters);

            cudaThreadSynchronize();

            int rv1 = cudaGetLastError();

            if (rv1)

                printf("last error %d\n", rv1);

        }

        _ftime(&t4);

        printf("N Time t4-t3 %f\n", (double)(t4.time - t3.time + ((double)(t4.millitm - t3.millitm))/1000));

        _ftime_s(&t3);

        for (int i = 0; i < kernel_calls; ++i)

        {

            coalesce<<<grid, block>>>(din, n, internal_iters);

            cudaThreadSynchronize();

            int rv1 = cudaGetLastError();

            if (rv1)

                printf("last error %d\n", rv1);

        }

        _ftime(&t4);

        printf("C Time t4-t3 %f\n", (double)(t4.time - t3.time + ((double)(t4.millitm - t3.millitm))/1000));

        cudaFree(din);

    }

    return 0;

}

This code runs two kernels, one that should have memory coalescing, the other should not. The kernel “coalesce” accesses 4-byte quantities per thread, each half warp within 64-bytes, and each k’th load in the k’th thread. The kernel “no_coalesce” should not have coalesced access because it violates the k’th load rule. (See NVIDIA CUDA C Programming Guide, section G.3.2.1 Devices of Compute Capability 1.0 and 1.1.)

The problem is that I do not get a large runtime difference between the two kernels: ~1.2 seconds for no_coalesce, 0.8 seconds for coalesce. I was under the impression that the difference would be much larger. The presentation “Optimizing CUDA” by M. Harris, at http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf , says “Coalesced vs. Non-coalesced = order of magnitude”. I am not seeing an order of magnitude difference.

(I am running the program on a GeForce 9800 GT, a 1.1 compute device, as specified by the command-line argument. I have two graphics cards, the other card a GeForce GTX 470, but I know I am selecting the 9800 GT GPU in this case.)

If I change the mapping for coalesce to violate the k’th load rule, (e.g., device int forward[32] = { 0, 1, 2, 3, 4, 5, 6, 8, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31}), then the runtimes for coalesce and no_coalesce are the same. So, I do think that this exercises memory coalescing.

The other problem I am having is with the Compute Visual Profiler. When I run my program in the profiler, I get zero counts for both “gld coalesced” and “gld uncoalesced”. That’s weird because there is no bad return code for the kernel calls, and the data was altered by the kernels. The profiler should show some non-zero counter values for coalesced and uncoalesced global access.

I must be doing something obviously wrong, but I just can’t see what it is. Anyone have an ideas?

Ken D.

Hi Folks,

I’m interested in a simple example program of coalesced access of global memory. Unfortunately, I’m having two problems. First, the code:

#include <stdio.h>

#include <stdlib.h>

#include <malloc.h>

#include <stdlib.h> // rand()

#include <string.h>

#include <sys/timeb.h>

#include <time.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <device_functions.h>

#include <device_launch_parameters.h>

__device__ int reverse[32] = { 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0};

__device__ int forward[32] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31};

__global__ void no_coalesce(int * data, int n, int iter)

{

    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int sr[32];

    sr[idx] = reverse[idx];

    __syncthreads();

    for (int i = 0; i < iter; ++i)

        data[sr[idx]] += n;

}

__global__ void coalesce(int * data, int n, int iter)

{

    // assume one block of size 32.

    int idx = threadIdx.x;

    __shared__ int sf[32];

    sf[idx] = forward[idx];

    __syncthreads();

    for (int i = 0; i < iter; ++i)

        data[sf[idx]] += n;

}

int main(int argc, char**argv)

{

    argc--; argv++;

// First argv is an int, cuda device number.

    int rvdev = cudaSetDevice(atoi(*argv));

int n = 32;

    int * in = (int*)malloc(n * sizeof(int));

    for (int i = 0; i < n; ++i)

        in[i] = 0;

    if (1)

    {

        struct _timeb  t1;

        struct _timeb  t2;

        struct _timeb  t3;

        struct _timeb  t4;

        printf("Starting GPU test v1 ...\n");

        _ftime(&t1);

        int * din;

        cudaMalloc(&din, n * sizeof(int));

        _ftime(&t2);

        cudaMemcpy(din, in, n * sizeof(int), cudaMemcpyHostToDevice);

        _ftime_s(&t3);

        int kernel_calls = 1;

        int internal_iters = 10000000;

        int block_size = 32;

        int blocks = 1;

        dim3 block(block_size);

        dim3 grid(blocks);

        for (int i = 0; i < kernel_calls; ++i)

        {

            no_coalesce<<<grid, block>>>(din, n, internal_iters);

            cudaThreadSynchronize();

            int rv1 = cudaGetLastError();

            if (rv1)

                printf("last error %d\n", rv1);

        }

        _ftime(&t4);

        printf("N Time t4-t3 %f\n", (double)(t4.time - t3.time + ((double)(t4.millitm - t3.millitm))/1000));

        _ftime_s(&t3);

        for (int i = 0; i < kernel_calls; ++i)

        {

            coalesce<<<grid, block>>>(din, n, internal_iters);

            cudaThreadSynchronize();

            int rv1 = cudaGetLastError();

            if (rv1)

                printf("last error %d\n", rv1);

        }

        _ftime(&t4);

        printf("C Time t4-t3 %f\n", (double)(t4.time - t3.time + ((double)(t4.millitm - t3.millitm))/1000));

        cudaFree(din);

    }

    return 0;

}

This code runs two kernels, one that should have memory coalescing, the other should not. The kernel “coalesce” accesses 4-byte quantities per thread, each half warp within 64-bytes, and each k’th load in the k’th thread. The kernel “no_coalesce” should not have coalesced access because it violates the k’th load rule. (See NVIDIA CUDA C Programming Guide, section G.3.2.1 Devices of Compute Capability 1.0 and 1.1.)

The problem is that I do not get a large runtime difference between the two kernels: ~1.2 seconds for no_coalesce, 0.8 seconds for coalesce. I was under the impression that the difference would be much larger. The presentation “Optimizing CUDA” by M. Harris, at http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf , says “Coalesced vs. Non-coalesced = order of magnitude”. I am not seeing an order of magnitude difference.

(I am running the program on a GeForce 9800 GT, a 1.1 compute device, as specified by the command-line argument. I have two graphics cards, the other card a GeForce GTX 470, but I know I am selecting the 9800 GT GPU in this case.)

If I change the mapping for coalesce to violate the k’th load rule, (e.g., device int forward[32] = { 0, 1, 2, 3, 4, 5, 6, 8, 7, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31}), then the runtimes for coalesce and no_coalesce are the same. So, I do think that this exercises memory coalescing.

The other problem I am having is with the Compute Visual Profiler. When I run my program in the profiler, I get zero counts for both “gld coalesced” and “gld uncoalesced”. That’s weird because there is no bad return code for the kernel calls, and the data was altered by the kernels. The profiler should show some non-zero counter values for coalesced and uncoalesced global access.

I must be doing something obviously wrong, but I just can’t see what it is. Anyone have an ideas?

Ken D.

Ditto.

I am making use of cutStartTimer and cutStopTimer to count the difference and there is no large difference between coalescing and no-coalescing.