A couple of questions

  1. How to measure the memory bandwidth of an application ? - Is there any way to determine this without having to sit and calculate based on the profiler counters ?

  2. The branch counter from the profiler shows a non-zero value even though I dont have any branch statements in my kernel. What is the possible reason for this ?

No.

But you can automate the profiler counter calculations. It’d be nice if someone wrote a script. But at least you can make a formula in Excel and copy in the #s. The formula is something like 64*(CoalescedReads + CoalescedWrites) + 4 * (UncoalescedReads + UncoalescedWrites). A formula for “wasted” bandwidth is 60 * (UncoalescedReads + UncoalescedWrites).

(Of course, keep in mind on G200 uncoalesced reads/writes are not reported. So this formula doesn’t work on G200, and you can’t calculate bandwidth.)

You can calculate bandwidth by doing your own profiling. Just count up the number of bytes that your kernel will read and write in each thread by hand and time the kernel launches.

Thanks guys…

Any idea about the second question. Also there is another one -

  1. When we make a coalesced access what is the burst size of the data that can be sent ? Like for example in normal processors, when there is a cache miss, an entire cache line is brought into the memory. So in each coalesced access how much data comes in ? - is it data required for a half warp or a complete warp or a fixed size ?

We can’t exactly help you without seeing your kernel code there. A call to a math function or something like that could have branches it it.

Coalesced reads come in several different transaction sizes. On compute 1.0/1.1 hardware, these are 32, 64, and 128 bits in each thread. On compute 1.2 and newer hardware, it is a little different (see the programming guide). And there is no cache. A single memory transaction feeds a half warp (though it is easier to think in terms of warps instead of half warps).

ok. So if have to bring in say data elements for each thread, it is better to bring them together as a float2 rather than 2 separate loads of float ?

and regarding the other question about branch counters… this is a test program that does the element-wise product of two vectors. ie. C[i] = A[i] * B[i] where A, B and C are 3 vectors.

#include <stdio.h>

#include <stdlib.h>

#include "cuda.h"

#define THREADS 256

__global__ void dot(float*, float*, float*, int);

void cpuDot(float*, float*, float*, int);

int main(int argv, char** argc){

	int i, vec_size;

	vec_size = 1024;

	float *A = (float *) malloc (vec_size * sizeof(float));

	float *B = (float *) malloc (vec_size * sizeof(float));

	float *Result = (float *) malloc (vec_size * sizeof(float));

	float *cpuResult = (float *) malloc (vec_size * sizeof(float));

	printf("\n Initializing arrays\n");

	for(i=0;i<vec_size;i++){

		B[i] = i%467;

		A[i] = i%3218;

	}

	printf("Declaring arrays on GPU\n");

	float *Ad,*Bd, *Resultd;

	cudaMalloc((void**)&Ad, vec_size*sizeof(float));

	cudaMalloc((void**)&Bd, vec_size*sizeof(float));

	cudaMalloc((void**)&Resultd, vec_size*sizeof(float));

	cudaMemcpy(Ad, A, vec_size*sizeof(float), cudaMemcpyHostToDevice);

	cudaMemcpy(Bd, B, vec_size*sizeof(float), cudaMemcpyHostToDevice);

	dim3 block(THREADS, 1);

	dim3 grid(vec_size/THREADS, 1);

	printf("Starting multiplication on GPU\n");

	dot<<<grid, block>>>(Ad, Bd, Resultd,vec_size);

	cudaMemcpy(Result, Resultd, vec_size*sizeof(float), cudaMemcpyDeviceToHost);

	cudaThreadSynchronize();

	cpuDot(A, B, cpuResult, vec_size);

	float error = 0.0;

	for(i=0;i<vec_size;i++) error += Result[i] - cpuResult[i];

	

	for(i=0;i<vec_size;i++){

		printf("%d \t %f \t %f \n", i, Result[i], cpuResult[i]);

	}

	printf("\n Avg Error : \t %f \n", error/vec_size);

	cudaFree(Ad);

	cudaFree(Bd);

	cudaFree(Resultd);

	free(A);

	free(B);

	free(cpuResult);

	return 0;

}

__global__ void dot(float* r1, float* r2, float* result, int vec_size){

	int tid = blockDim.x * blockIdx.x + threadIdx.x;

	__shared__ float Adata[THREADS];

	__shared__ float Bdata[THREADS];

	__shared__ float product[THREADS];

	Adata[threadIdx.x] = r1[tid];

	Bdata[threadIdx.x] = r2[tid];

	__syncthreads();

	product[threadIdx.x] = Adata[threadIdx.x] * Bdata[threadIdx.x];

	__syncthreads();

	result[tid] = product[threadIdx.x];

}

void cpuDot(float* r1, float* r2, float* cpuResult, int vec_size){

	for(int i = 0; i < vec_size; ++i){

		cpuResult[i] += r1[i] * r2[i];

	}

}

The way i see it, there are no conflicts and branch statements. But the profiler shows its branch counter = 16. How is this possible ?