CUDA clock() issue

H

The data read into j is never used anywhere, so without the “-G” flag the memory read is just optimized away.
Use it to calculate something you write out at the end of the kernel.

Also the placement of the clock accesses in the compiled code is probably not what you expect (the compiler will happily order memory reads around it). Use “cuobjdump -sass test” to look at the compiled code and experiment with barriers, dependencies and asm volatile(“”); and asm volatile(“” ::: “memory”); statements until you get the code you want to time.

Hi tera,

Thanks for the reply, I have tried what you proposed, and the kernel code now looks like this:

__global__ void clock_test(UINT64 *timer, UINT32 *data, UINT32 *data_copy)
{
	register volatile UINT32 i, j = 0;
	register volatile UINT64 start, end;

	
	for (i = 0; i < ELEMENTS; ++i)
	{		
		asm volatile("membar.cta;");
		start = clock64();
			
		j = data[i];
		asm volatile("membar.cta;");		
		end = clock64();		
		
		data_copy[i] = j;
		timer[i] = end - start;
	}
	
}

When i run the code (compiled without -G), i get this:

0 1357
1 539
2 540
3 539
4 540
5 540
6 540
7 539
8 539
9 540
10 540
11 540
12 539
13 540
14 540
15 539
16 540
17 540
18 540
19 540
20 540
21 539
22 540
23 540
24 539
25 540
26 540
27 539
28 540
29 540
30 539

It shouldn’t be like this, i tried with barriers at different sections of the code… but the values keep getting constant with more bariers. I dont get what is wrong…

These results look sensible to me. What do you think is suspicious?
If you want to discuss this further, please also show the code for how you call the kernel and what is is you are printing out.

Hi tera,

The ideea is the following, i want to expose the L1 cache line size. To do that i read values from L1, do something with them and read the latency, it should be almost constant while you are reading from line, and then a miss should come, this should repeat at every N elements, and this should indicate the cache line size. I need to verify the cache line size for optimizations, and i do not get a pattern or “real latencies”.

The whole code:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

#define STRIDE 1
#define ELEMENTS 512

#define UINT64	unsigned long long
#define UINT32	unsigned int

__global__ void clock_test(UINT64 *timer, UINT32 *data, UINT32 *data_copy)
{
	register volatile UINT32 i, j = 0;
	register volatile UINT64 start, end;

	
	for (i = 0; i < ELEMENTS; ++i)
	{		
		asm volatile("membar.cta;");
		start = clock64();		
		j = data[i];
		asm volatile("membar.cta;");		
		end = clock64();		
		data_copy[i] = j;
		timer[i] = end - start;
	}

	
}

int main(int argc, char **argv)
{
	UINT64 h_duration[ELEMENTS];
	UINT32 h_data[ELEMENTS];
	UINT32 h_data_copy[ELEMENTS];

	printf("%d\n", ELEMENTS);

	UINT64 *d_duration;
	cudaError_t error_id;
	error_id = cudaMalloc(&d_duration,  ELEMENTS * sizeof(UINT64));
	if (error_id != cudaSuccess) {
		printf("Could not allocate d_duration %s\n", cudaGetErrorString(error_id));
	}
	
	UINT32 *d_data;
	error_id = cudaMalloc(&d_data,  ELEMENTS * sizeof(UINT32));
	if (error_id != cudaSuccess) {
		printf("Could not allocate d_data %s\n", cudaGetErrorString(error_id));
	}

	UINT32 *d_data_copy;
	error_id = cudaMalloc(&d_data_copy,  ELEMENTS * sizeof(UINT32));
	if (error_id != cudaSuccess) {
		printf("Could not allocate d_data_copy %s\n", cudaGetErrorString(error_id));
	}

	for (int i = 0; i < ELEMENTS; ++i) {
		h_data[i] = (i + STRIDE) % ELEMENTS;
	}
	
	cudaMemcpy((void*) d_data, (void*) h_data, ELEMENTS * sizeof(UINT32), cudaMemcpyHostToDevice);
	cudaDeviceSynchronize();

	dim3 Db = dim3(1);
	dim3 Dg = dim3(1,1,1);

	clock_test <<<Dg, Db>>> (d_duration, d_data, d_data_copy);
	cudaDeviceSynchronize();

	cudaMemcpy((void *)h_duration, (void *)d_duration, ELEMENTS * sizeof(UINT64) , cudaMemcpyDeviceToHost);
	cudaMemcpy((void *)h_data_copy, (void *)d_data_copy, ELEMENTS * sizeof(UINT32) , cudaMemcpyDeviceToHost);
	cudaDeviceSynchronize();	

	for (int i = 0; i < ELEMENTS; ++i) {
		printf("%d %llu\n", i, h_duration[i]);
		
		if (h_data[i] != h_data_copy[i]) {
			printf("TEST FAILED !");
			exit(-1);
		}
	}
}

Output:

0 1338
1 533
2 534
3 534
4 532
5 532
6 532
7 531
8 532
9 532
10 532
11 532
12 534
13 532
14 531
15 531
16 531
17 531
18 534
19 534
20 532
21 532
22 532
23 532
24 532
25 534
26 534
27 534
28 532
29 532
30 532
31 534
32 534
33 532
34 531
35 534
36 534
37 532
38 531
39 531
40 531
41 532
42 532
43 532
44 531
45 532
46 534

How the output should be theoretically:

0 670
1 115
2 90
3 91
4 90
5 91
6 92
7 90
8 170
9 90
10 90
11 92
12 91
13 90
14 92
15 90
16 90
17 170
18 90

But it does not happen, there are to many clock cycles in the output, and they are constant… they shouldnt since I am itterating over data that should be in L1. The output is the clock diff before and after

j = data[i];

, and it shouldnt be that big. Am i missing something?

What you are seeing is global memory latency on the first access, then L2 latency for the following ones.
You need to explicitly opt-in to using the L1 cache by adding “-Xptxas -dlcm=ca” to the nvcc command line.

are you sure that you really need to measure that yourself instead of reading about it? gpus are different to cpus, so if you just use yourr cpu experice to explore gpu, this may fail

my own list of low-level benchs:
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
Demystifying GPU Microarchitecture through Microbenchmarking | stuffedcow Demystifying GPU Microarchitecture through Microbenchmarking
http://asg.ict.ac.cn/dgemm/microbenchs.tar.gz
Understanding the ISA impact on GPU Architecture.
https://hal.inria.fr/file/index/docid/789958/filename/112_Lai.pdf
Dissecting GPU Memory Hierarchy through Microbenchmarking | hgpu.org Dissecting GPU Memory Hierarchy through Microbenchmarking
Understanding Latency Hiding on GPUs | hgpu.org Understanding Latency Hiding on GPUs by Vasily Volkov

full list here:

Hi guys,

So i have tried with “-Xptxas -dlcm=ca”, indeed with this flag, i do get the pattern i was expecting, it seems that by default nvcc is using “-Xptxas -dlcm=cg”. But the latency is still high… i dont get it… it’s 1 block, one thread so that i can expose the latency, not hide it.

Thx for all the links BulatZiganshin, i think that altough GPU’s are different, the cache memory should be similar in structure to the CPU, access time and speed may differ. If i am wrong please correct me, but with “-Xptxas -dlcm=ca” the pattern confirms the intial theory.

This confirms that L1 cache line is 32B and that is exactly 8 vector elements. The only issue that remains is the latency, should it be that high?

The output with “-Xptxas -dlcm=ca”:

0 1171
1 369
2 370
3 369
4 369
5 370
6 369
7 369
8 515 ---> miss
9 369
10 369
11 369
12 369
13 369
14 369
15 369
16 515 ---> miss
17 369
18 370
19 369
20 370
21 369
22 369
23 369
24 515 ---> miss
25 370

I will go over the documentation and see if i can better understand this.

Thx

yes, it can be high since GPUs prioritize throughpurt over latency and can run 16 threads per core. that said maxwell l1c latency afair is 30-50 cycles as was measured in one of these papers

Hello,

I’ve been doing some modification to the kernel code and observed the following behavior. When I declare the “j” variable to a simple register (removing volatile), the clock changes significantly. The latency drops from ~240 to 75 (it stays constant). From what I have read in other topics, if the volatile is not used, the compiler is free to optimize the reads/writes to the variable by caching the data in a local register. I’m still confused on whether to use volatile or not.

register volatile UINT32 i;
        register volatile UINT32 j = 0;
        clock_t start, end;

        for(int k = 0; k < ITERATIONS; k++)
        {
                for (i = 0; i < ELEMENTS; ++i)
                {
                        start = clock64();
                        asm volatile("ld.cg.u32 %0, [%1];" : "=r"(j) : "l"(&data[j]));    
                        end = clock64();

                        data_copy[i] = j;
                        timer[i] = end - start;
                }
        }

As the “register” keyword already indicates, you want j to be in a register rather than in memory, so you don’t produce additional memory accesses beyond the one you want to measure. So the “volatile” modifier on it should go.

On a side note, the “register” keyword is just ignored by the compiler. It will place all variables in registers anyway (unless it’s not able to, in which case even the register keyword will not change that).

You should really look at the generated code as displayed by “cuobjdump -sass”. Without checking what you are actually timing microbenchmarking really is meaningless.

Hi tera,

I see, thx a lot for the info, will do that!