modifying simulator to work with CUDA.?.

cazint · March 12, 2009, 8:41pm

Hi, I’m wondering if the following concept would be workable on a GPU with CUDA.

As of now, I have a simulator (in C) that does the following. I basically have a number of objects that constantly move around at random. At each cycle, the simulator calculates the neighboring objects (within a certain distance) to any given object and stores those neighboring objects in an array specific to that object.

If that doesn’t make sense, here’s the code:

void queryAllMOs(MobileObject MOs[]) {

	int i, j;

	double distance;

	for (i = 0; i < numMobileObjects; ++i) {

		for (j = 0; j < numMobileObjects; ++j) {

			if (i != j) {

				distance = (double) sqrt((double)((MOs[j].x - MOs[i].x)^2) + ((MOs[j].y - MOs[i].y)^2));

				// So here we are caluclating the distance between object "i" and ALL other "j" objects,

				// by computing the distance formula based on their x and y coordinates

				if (distance < queryRadius) {

					AddNeighbor(MOs, i, j);

					//Here is where I now add this j object to the neighbor array of object i.

				}

			}

		}

	}

}

Is this something I can do in CUDA? All of these distance calculations can happen in parallel; so that part makes sense.

But for me, i’m just flat out stumped on how to make it happen. I’ve been trying to look at example of matrices, as this should ideally (?) shed some light on how to work with two dimensions (two coordinates).?.

cazint · March 12, 2009, 8:46pm

And also, all the basic examples I’ve seen on Dr. Dobbs and other tutorial websites always use a basic array on the host, whether it be a 1d, 2d, or m-d array.

In this example, on the host, i have an array of structs. Each struct has the x and y coordinate as well as other information for each object.

Is this workable?

cazint · March 13, 2009, 3:15pm

Any ideas here? I’m really at a loss.

Is there any sample project that i can somewhat map to this problem?

MisterAnderson42 · March 13, 2009, 4:16pm

It may be a bit overkill for you, but I’ve worked extensively with lists of neighbors on the GPU. All the details are in the paper here
[url=“http://www.ameslab.gov/hoomd/about.html#paper”]http://www.ameslab.gov/hoomd/about.html#paper[/url] - click download preprint if your library doesn’t subscribe to the journal. I can of course answer any specific questions you might have.

If all you really need is an example code of how to efficiently calculate a neighbor list in O(N^2) time on a GPU, HOOMD’s code has one of these too:
[url=“Error 404: File Not Found | Assembla”]http://trac2.assembla.com/hoomd/browser/tr...orListNsqGPU.cu[/url]

As for decoupling your data arrays from your structures, well, that is just something you have to do with CUDA. On GPUs, structures of arrays are much more efficient than arrays of structures because of the memory access patterns.

cazint · March 15, 2009, 12:35am

As of now, on my simulator, i have an array of structures. How do you suggest i work this so it is an easy transition to the GPU.

Is it perhaps easiest to just take out the information i need from the structures (coordinates) and make a seperate coordinate array. So i can have my array of structures. And then i can have my 2d array of coordinates. Thus, i can pass my array of coordinates to the kernel and process that way.

Is that what you were suggesting?

I didn’t understand the “structures of arrays” part.

Thanks.

MisterAnderson42 · March 15, 2009, 3:37am

It is standard language, but can be confusing when you first see it.

Array of structures:

struct my_object

   {

   float mass;

   float x,y,z;   // position

   float vx,vy,vz;  // velocity

   float ax,zy,ax;  //acceleration

   }

my_object data_array[100];

}

Structure of arrays:

struct my_object

   {

   float *mass;

   float4 *pos;   // position

   float4 *vel;  // velocity

   float4 *accel;  //acceleration

   }

// each pointer is allocated to be of length N

cazint · March 15, 2009, 4:21am

Structure of arrays:

struct my_object

   {

   float *mass;

   float4 *pos;   // position

   float4 *vel;  // velocity

   float4 *accel;  //acceleration

   }

// each pointer is allocated to be of length N

Let me start by expressing my thanks for spending the time to help. It is greatly appreciated.

To be honest, even after you posted this, i was confused and began just googling “structure of arrays”. But then it hit me (I think). Instead of an array of structures, you are just making seperate arrays for EACH “thing” (seperate array for mass, seperate array for pos, for vel, and a seperate array for accel).

Then, just to package it all together, you are putting all these arrays in a struct. This last part really isn’t necessary right? It’s just suggested to keep the feel of everything being “together”.?.

And to confirm my humble understanding of this, the entire purpose of writing it as a structure of arrays is because we need to be able to pass the entire array to the GPU and have the GPU work on the entire array…and this is because the GPU can’t decouple the information we need from an array of structs…or if it can, it can’t do it nearly as quicly.

Do I sort of have this.?.

And lastly, what did you mean, “each pointer is allocated to be of length N”? Was N referring to the 100 in the example on array of structures?

jack · March 15, 2009, 5:38am

Yes, I believe that’s what he meant.

cazint · March 15, 2009, 7:22am

Okay, so now let’s see if I can make sense of my original question…transforming those for loops, which were used to calculate distances between objects, into CUDA code to be used on the GPU. Hopefully you all can give me some pointers here…

For the sake of easy “coding” on this thread, let’s just say i don’t use a struct, and that i just have seperate arrays for each of my important things.

So, previously, i had an array of structs, where each struct had an ID, an x coordinate, a y coordinate, and other things.

Now, i will have several arrays: an array of IDs, a 2d array of x and y coordinates, and other arrays for stuff. Or i could even have a seperate x-coord array and a y-coordinate array.

So now, in the main program, I will send a pointer of the x and y coordinate arrays (or the one 2d array) to the kernel. So if there are 1000 objects, there will an array of length 1000 for both the x and y coordinates. I will then need to use these values to calculate the distances between each and every object (saved as an int for space purposes).

A very humble, and perhaps naive, way would be to create a new “distance” array on both the host and device.

This distance array would be a 2d array of size N*N, where N is the # of Objects. The indeces of the array would simply refer to a specific object. So for example, the value stored at distance(212,487) would refer to the distance between object #212 and object # 487.

This distance array is what i would like the GPU to calculate for me. On the host, this would boil down to to two for loops, and would be a running time of O(n^2). Considering that I keep reading that the GPU is best used for most types of huge for loops that can be parallelized, I’m guessing this would be considerably sped up on the GPU.?.

So, now the question is, how do I make this happen.

Here’s a bit of code that I have NOT actually done anything with. I literally just typed this up in notepad for the purpose of pasting here to see if i even remotely get this.

My confusion is with the execution configuration and the kernel.

int numObjects = 100000;

int main() {

	// calculate "size" to be used for allocating memory of x & y coord. arrays

	int size = numObjects*sizeof(int);

	// Also calculate "size" of the distance array

	int size_d = numObjects*numObjects*sizeof(int);

	

	// make arrays for x and y coordinates, as well as the "to be computed"

	// distance array, and allocate necessary space on host

	int *xArray, *yArray, *distanceArray;

	xArray = malloc(size);

	yArray = malloc(size);

	distanceArray = malloc(size_d);

	// distance array is a 2d array as described above

	// now assume those x & y arrays have values in them

	// and we want to calculate the "distance" array mentioned previously

	// So we need to allocate space on the Device for the x & y arrays

	// as well as the distance array we will calculate.

	// So I start by making pointers to these arrays on the Device:

	int *x_onDevice, *y_onDevice, *distanceArray_onDevice;

	// the following just stores a result if i want to check what happened

	cudaError_t result;

	// And now we allocate memory on the Device for each of these arrays:

	result = cudaMalloc((void**)&x_onDevice, size);

	result = cudaMalloc((void**)&y_onDevice, size);

	result = cudaMalloc((void**)&distanceArray_onDevice, size_d);

	// Now I have to copy the x & y arrays to the device:

	result = cudaMemcpy(x_onDevice, xArray, size, cudaMemcpyHostToDevice);

	result = cudaMemcpy(y_onDevice, yArray, size, cudaMemcpyHostToDevice);

	//*********************************************************

	// Now I need to make the execution configuration which I am confused with

	// I'm not sure as to how I choose the values for dimBlock and dimGrid

	//*********************************************************

	dim3 dimBlock( blocksize, blocksize );

	dim3 dimGrid( numMobileObjects/dimBlock.x, numMobileObjects/dimBlock.y );

	computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);

	// Now I simply bring the result back to the Host

	result = cudaMemcpy(distanceArray, distanceArray_onDevice, size_d, cudaMemcpyDeviceToHost);

	// Do other computation here as needed 4 program

	// Now I can free all allocated space on Host and on Device:

	free(xArray); free(xArray); free(distanceArray);

	cudaFree(x_onDevice); cudaFree(y_onDevice); cudaFree(distanceArray_onDevice);

}

And here is the actual kernel:

__global__ computeDistanceArray (int *x, int *y, int *d, int numMobileObjects ) {

	// I'm stuck here

	// I need to somehow use the x and y coordinates, pair them up, and then calculate the distances

	// between each and every object.

}

So the two main things, and yes they are big things, are:

(1) my confusion on what to specify for the execution configuration

(2) how i calculate the distances array on the GPU

For (1), I’m guessing I should make my blocksize equal to 16. Thus, this would result in a dimBlock of 256…which seems to be a popular number ppl reference on these forums.

But for dimGrid, I really don’t get what I’m trying to accomplish there.

And then comes (2), which is the biggee. How to make this happen in the first place.

Whew.

Whoever, can help, thank you in advance.

cazint · March 15, 2009, 6:03pm

And just to help with the translation from C to CUDA, this is how I would calculate the distance array on the host:

int i, j, index;

for (int i=0; i < numObjects; ++i) {

	for (int j=0; j < numObjects; ++j) {

		// Just calculates the distance between objects "i" and "j"

		// by referencing the indeces of the xArray and yArray and using the distance formula

		distance[i][j] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

		// Now, couldn't I calculate the index of the storage mapping function directly:

		// index = i + j*numObjects;

		// distance[index] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

		// ??? But does this help me somehow in the translation

	}

}

On many of the 2 for-loop examples I’ve seen, they compute the storage mapping function and use that to reference the index of the arrays. I’m guessing this is to somewhat clarify the translation from C to CUDA.?.

I’m referring to the example shown on page 7 here:

http://heim.ifi.uio.no/~knutm/geilo2008/seland.pdf

My confusion is how to make this happen with my specific array to be calculated (distance), and then what to put in the execution configuration and how to use those values in the kernel.

Thanks.

cazint · March 15, 2009, 10:00pm

Okay, I’m trying to make sense of things as I wait for your response…so this may be waaaaay off. How does this look:

Here is the Execution Configuration:

dim3 dimBlock(blockSize);

	dim3 dimGrid(numObjects/blockSize + (numObjects%blockSize == 0 ? 0:1);

	computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);

Considering that my xArray and yArray are both 1d arrays, I really don’t need to specify more than one dimention for the dimBlock or dimGrid right? Is there any advantage of doing so?

So here, blockSize is 256, and then based on the example above with numObjects = 100,000, dimGrid would get the value of 391. So there would be 391 blocks, each with 256 threads…so a few more threads than needed, and we take care of this in the Kernel…

Here is the Kernel:

__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.x*blockDim.x + threadIdx.x;

	int idx = i + j*numObjects;

	if (i < numObjects && j < numObjects)

		distance[idx] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

}

To be honest, I’m doubting myself on this one, as it just seems funky. I tried to write what made sense.

Of course, I need to get the id of the threads, and that is done with the int i variable. However, as you see, i simply repeated this again with the int j variable. Why? My logic was that when i compute the distance array, since the x and y coordinates pair up, meaning that xArray[412] must be paired with yArray[412], because of this, i needed to somehow show this in the equation.

But does this work? I’m not too confident.

And I’m not sure about the distance array. Can I use this instead:

distance[i][j] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

Thanks for the help!

cazint · March 16, 2009, 5:35am

Okay, I’ve now tried to implement this, and I got all excited thinking it was working. But something is definitely wrong. Here’s the gist of my code:

// ******* KERNEL *******//

__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.x*blockDim.x + threadIdx.x;

	int idx = i + j*numObjects;

	if (i < numObjects && j < numObjects)

		distance[idx] = (int) sqrt(((xArray[i] - xArray[j])^2) + ((yArray[i] - yArray[j])^2));

}

int blockSize = 256;

int numObjects = 100000;

int main() {

	// calculate "size" to be used for allocating memory of x & y coord. arrays

	int size = numObjects*sizeof(int);

	// Also calculate "size" of the distance array

	int size_d = numObjects*numObjects*sizeof(int);

	// make arrays for x and y coordinates, as well as the "to be computed"

	// distance array, and allocate necessary space on host

	int *xArray, *yArray, *distanceArray;

	xArray = malloc(size);

	yArray = malloc(size);

	distanceArray = malloc(size_d);

	// distance array is a 2d array as described above

	// now assume those x & y arrays have values in them

	// and we want to calculate the "distance" array mentioned previously

	// So we need to allocate space on the Device for the x & y arrays

	// as well as the distance array we will calculate.

	// So I start by making pointers to these arrays on the Device:

	int *x_onDevice, *y_onDevice, *distanceArray_onDevice;

	// the following just stores a result if i want to check what happened

	cudaError_t result;

	// And now we allocate memory on the Device for each of these arrays:

	result = cudaMalloc((void**)&x_onDevice, size);

	result = cudaMalloc((void**)&y_onDevice, size);

	result = cudaMalloc((void**)&distanceArray_onDevice, size_d);

	// Now I have to copy the x & y arrays to the device:

	result = cudaMemcpy(x_onDevice, xArray, size, cudaMemcpyHostToDevice);

	result = cudaMemcpy(y_onDevice, yArray, size, cudaMemcpyHostToDevice);

	//*********************************************************

	// Now I need to make the execution configuration which I am confused with

	// I'm not sure as to how I choose the values for dimBlock and dimGrid

	//*********************************************************

	dim3 dimBlock(blockSize);

	dim3 dimGrid(numObjects/blockSize + (numObjects%blockSize == 0 ? 0:1);

	computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);

	// Now I simply bring the result back to the Host

	result = cudaMemcpy(distanceArray, distanceArray_onDevice, size_d, cudaMemcpyDeviceToHost);

	// Do other computation here as needed 4 program

	// Now I can free all allocated space on Host and on Device:

	free(xArray); free(xArray); free(distanceArray);

	cudaFree(x_onDevice); cudaFree(y_onDevice); cudaFree(distanceArray_onDevice);

}

That is NOT exactly copied from my program, so expect there are perhaps some misspellings, etc.

But it is the meaning i’m looking for. Something I am doing, probably with the execution configuration or with the Kernel itself (or both) is way wrong.

Any help would be appreciated.

Thanks.

cazint · March 16, 2009, 6:02am

Ok, well, it is not all bad. I added many switch statements to the code, allowing me to check the value of “result” after each cuda reference.

Everything returned cudaSuccess EXCEPT when i tried to cudaMalloc the space for the distanceArray_onDevice. It’s the third one shown below.

result = cudaMalloc((void**)&x_onDevice, size);

	result = cudaMalloc((void**)&y_onDevice, size);

	result = cudaMalloc((void**)&distanceArray_onDevice, size_d);

I don’t even know what error that returned because it wasn’t one of the two errors provided in the reference manual. So my switch statement just printed “other”.

I’m guessing this was just because it was a HUUUUUGE amount of space? I had the program set to 10,000 objects. size_d was equal to (10,000)*(10,000)*sizeof(int). Bottom line, a HUGE number. I then reduced the # of objects to 1,000 and it did give me a cudaSuccess on that particular malloc.

It’s still NOT WORKING…so i’m still back to looking at the execution configuration and the kernel. However, that initial problem seems to have been that i was asking to allocate too much space.?. Does that sound right?

cazint · March 16, 2009, 6:22am

Well, just b4 I hed off to sleep, let me say what is happening in case anyone reads this.

All of the cuda calls result in “cudaSuccess”. So those are working.

However, the distance_d array that is computed on the device is resulting in an array of all ZEROES.

Once the resulting array is copied back, which does copy successfully, for testing purposes, I printed out several array indeces, and all of the distance values are 0. So something is wrong with my coding of the kernel.

Ideas.?.

MisterAnderson42 · March 16, 2009, 9:59am

I would help more, but I’m at a conference all week (and had to prepare for the last couple days). Maybe someone else can step in and help…?

You are on the right track, though.

To avoid so much memory usage, you should consider not storing all the distance values. Instead, just run one thread per object in your simulator and have it build the neighbor list for that object. Did you look at the information I linked you to before? It documents a sliding window technique to maximize use of shared memory for these types of problems. The Nbody sample in the SDK uses the same technique.

cazint · March 16, 2009, 4:42pm

Thanks. Hopefully someone else can help out a bit.

For now, i’m not worried about this most effective or efficient solution. I knew that storing all the distance values would be a problem and a bottleneck, and I will work on that later.

But for now, i’m just wanting to first start by at least having this trivial method work…heck, even if it is slower! I just want it to calculate the distances and know that the kernel is doing it properly. As mentioned, it is currently returning all zeroes for the distance values. So I am stumped here. But yes, once I get that fixed, then I can working on more effective techniques.

So can anyone take a look and explain what I’m doing wrong with my kernel or perhaps the execution configuration (or both)? Why is it returning all zeroes?

Thanks.

cazint · March 17, 2009, 5:31am

Okay, I’ve now tried to implement this, and I got all excited thinking it was working. But something is definitely wrong. Here’s the gist of my code:

// ******* KERNEL *******//

__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.x*blockDim.x + threadIdx.x;

	int idx = i + j*numObjects;

	if (i < numObjects && j < numObjects)

		d[idx] = (int) sqrt(((x[i] - x[j])^2) + ((y[i] - y[j])^2));

}

int blockSize = 256;

int numObjects = 100000;

int main() {

	// calculate "size" to be used for allocating memory of x & y coord. arrays

	int size = numObjects*sizeof(int);

	// Also calculate "size" of the distance array

	int size_d = numObjects*numObjects*sizeof(int);

	// make arrays for x and y coordinates, as well as the "to be computed"

	// distance array, and allocate necessary space on host

	int *xArray, *yArray, *distanceArray;

	xArray = malloc(size);

	yArray = malloc(size);

	distanceArray = malloc(size_d);

	// distance array is a 2d array as described above

	// now assume those x & y arrays have values in them

	// and we want to calculate the "distance" array mentioned previously

	// So we need to allocate space on the Device for the x & y arrays

	// as well as the distance array we will calculate.

	// So I start by making pointers to these arrays on the Device:

	int *x_onDevice, *y_onDevice, *distanceArray_onDevice;

	// the following just stores a result if i want to check what happened

	cudaError_t result;

	// And now we allocate memory on the Device for each of these arrays:

	result = cudaMalloc((void**)&x_onDevice, size);

	result = cudaMalloc((void**)&y_onDevice, size);

	result = cudaMalloc((void**)&distanceArray_onDevice, size_d);

	// Now I have to copy the x & y arrays to the device:

	result = cudaMemcpy(x_onDevice, xArray, size, cudaMemcpyHostToDevice);

	result = cudaMemcpy(y_onDevice, yArray, size, cudaMemcpyHostToDevice);

	//*********************************************************

	// Now I need to make the execution configuration which I am confused with

	// I'm not sure as to how I choose the values for dimBlock and dimGrid

	//*********************************************************

	dim3 dimBlock(blockSize);

	dim3 dimGrid(numObjects/blockSize + (numObjects%blockSize == 0 ? 0:1);

	computeDistanceArray<<<dimGrid, dimBlock>>>(x_onDevice, y_onDevice, distanceArray_onDevice, numObjects);

	// Now I simply bring the result back to the Host

	result = cudaMemcpy(distanceArray, distanceArray_onDevice, size_d, cudaMemcpyDeviceToHost);

	// Do other computation here as needed 4 program

	// Now I can free all allocated space on Host and on Device:

	free(xArray); free(xArray); free(distanceArray);

	cudaFree(x_onDevice); cudaFree(y_onDevice); cudaFree(distanceArray_onDevice);

}

Can anyone take a look at my Kernel and shed some light as to what is wrong. Once I get the distance array back on the host side, I print out the values of a few random indeces, and they are all ZERO. Each and every index has the value 0 stored in it. Why? What am I doing wrong with the Kernel?

Regarding the coding of the Kernel, I explained my perhaps crazy thought process previously:

So, bottom line, it is not working. It is getting all zeroes. But why?

jack · March 17, 2009, 6:08am

Looks like you’re missing a parenthesis here? Not sure how that compiled like that…

dim3 dimGrid(numObjects/blockSize + (numObjects%blockSize == 0 ? 0:1);

I don’t really think that’s you problem though. Perhaps I will take a look tomorrow if no one else does before then.

cazint · March 17, 2009, 6:20am

Looks like you’re missing a parenthesis here? Not sure how that compiled like that…
dim3 dimGrid(numObjects/blockSize + (numObjects%blockSize == 0 ? 0:1);
I don’t really think that’s you problem though. Perhaps I will take a look tomorrow if no one else does before then.

Wow. GREAT eye.

Um, but as mentioned, I originally typed this code here on the forum. Afterwards, I then took it into my simulator, fixed any mistakes, typos, etc. (including the one you mentioned), and then tried to run it. The compiler did indeed yell at me regarding that missing parenthesis; I corrected it yesterday. But either way, good catch, as I’m not sure if too many ppl would have noticed.

If you do have the time tomorrow, please do see if you can assist with why the distance array is filling with all zeroes.

Thanks for your help.

BenSt · March 17, 2009, 8:10am

// ******* KERNEL *******//

__global__ computeDistanceArray (int *x, int *y, int *d, int numObjects) {

	int i = blockIdx.x*blockDim.x + threadIdx.x;

	int j = blockIdx.x*blockDim.x + threadIdx.x;

	int idx = i + j*numObjects;

	if (i < numObjects && j < numObjects)

		d[idx] = (int) sqrt(((x[i] - x[j])^2) + ((y[i] - y[j])^2));

}

Hope thats a typo, but you are using the same calculation for i and j.

So what you are basicly doing is something like sqrt((a-a)^2 + (b-B)^2) what hopefully always will be zero.

Second problem is, that you are launching about numObjects threads on the GPU which calculates one distance each and expect numObjects * numObjects results.

So you either have to implement a for-loop in your kernel or launch about numObjects^2 threads. Maybe someone can tell which would be more effective in terms of memory access, but i guess the version with more threads will be the better one.

Topic		Replies	Views
Simple Kernel, but having problems...ideas CUDA Programming and Performance	16	8127	April 24, 2011
CUDA & Smoothed Particle Hydrodynamics Best approach? CUDA Programming and Performance	19	10352	June 24, 2009
Is this a task in which CUDA could speed up things? CUDA Programming and Performance	11	11969	September 29, 2009
Density map Getting mad with cuda CUDA Programming and Performance	12	7924	July 1, 2008
Speeding up kernel with atomicAdd and results do not match CUDA Programming and Performance	26	434	July 29, 2024
Cluster siliarity calculated with usage of array reduction optimization CUDA Programming and Performance cuda , kernel	8	462	November 19, 2022
Optimizing nearest-neighbors querying CUDA Programming and Performance	20	6726	September 17, 2018
Newbie Question: Moving nested for loop CPU code to GPU. CUDA Programming and Performance	4	4586	December 2, 2015
Euclidian Distance between matrices how to find the quikiest way? CUDA Programming and Performance	21	24227	February 11, 2011
MD - Neighbour List Update How to re-construct (update) Neighbour Lists on GPU CUDA Programming and Performance	4	3345	July 24, 2011

modifying simulator to work with CUDA.?.

Related topics