Suggestions / help Hit a rut

Hi all,

I’ve been working on a summer research project to recreate the stationary phase of chromatography. The idea is to rewrite a VBA program into c++ and then to run in parallel with CUDA. Currently I have the c++ version fully functional and working, so I began trying to change the project over to CUDA to run on the GPU (C2050)

When planning the project I expected this part to go relatively smoothly. I planned on merely taking the bulk of the program and throwing it in a global kernel and setting the external functions to device calls. The idea being to run multiple instances of the program at once, now instead of getting 1 million results I could get many million results at once for analyzing.

However, since I am posting this it should be obvious this didn’t turn out to be as simple as I expected.
I’m having difficulties with passing variables between device functions while not over writing the data.
I am also having trouble with my calls to Mersenne Twister since they are technically host calls in a device / global function.

My plan is to create ~400 instances of the main program and have them run in parallel, while having the results (pretty much one variable) stored in a global array indexed by the blockIdx to be added and summed after all instances has ran. Since I have this working correctly in C++ I was wondering if I might be approaching this project poorly.

My approach has been to take the original variables and create dev_ versions which are allocated on the GPU and copy the values from the host variables to those on the GPU. I figured I have to do this inside the kernel call so that there are original instances of each variables to avoid accessing the same memory locations by 400+ threads at once. There are probably around 40-50 variables through the program. From there I planned to have to pass the values around through parameters since there doesn’t seem to be a way to make a variable “global to the thread’s scope” as far as I can tell. The problem seems to exist with passing variables around, and getting pointers/references mixed up while moving them around.

Basically I am looking for any suggestions or tips for the easiest and best way to go about doing this. Any and all suggestions or thoughts are more than welcome, and will help much more than banging my head against the wall.

Thank you for the read, I know it’s long.

Hi all,

I’ve been working on a summer research project to recreate the stationary phase of chromatography. The idea is to rewrite a VBA program into c++ and then to run in parallel with CUDA. Currently I have the c++ version fully functional and working, so I began trying to change the project over to CUDA to run on the GPU (C2050)

When planning the project I expected this part to go relatively smoothly. I planned on merely taking the bulk of the program and throwing it in a global kernel and setting the external functions to device calls. The idea being to run multiple instances of the program at once, now instead of getting 1 million results I could get many million results at once for analyzing.

However, since I am posting this it should be obvious this didn’t turn out to be as simple as I expected.
I’m having difficulties with passing variables between device functions while not over writing the data.
I am also having trouble with my calls to Mersenne Twister since they are technically host calls in a device / global function.

My plan is to create ~400 instances of the main program and have them run in parallel, while having the results (pretty much one variable) stored in a global array indexed by the blockIdx to be added and summed after all instances has ran. Since I have this working correctly in C++ I was wondering if I might be approaching this project poorly.

My approach has been to take the original variables and create dev_ versions which are allocated on the GPU and copy the values from the host variables to those on the GPU. I figured I have to do this inside the kernel call so that there are original instances of each variables to avoid accessing the same memory locations by 400+ threads at once. There are probably around 40-50 variables through the program. From there I planned to have to pass the values around through parameters since there doesn’t seem to be a way to make a variable “global to the thread’s scope” as far as I can tell. The problem seems to exist with passing variables around, and getting pointers/references mixed up while moving them around.

Basically I am looking for any suggestions or tips for the easiest and best way to go about doing this. Any and all suggestions or thoughts are more than welcome, and will help much more than banging my head against the wall.

Thank you for the read, I know it’s long.

is it possible to use memcpy from device to device as a means to copy data from one function to another within the kernel?
Or is device to device used for multi GPU situations?

is it possible to use memcpy from device to device as a means to copy data from one function to another within the kernel?
Or is device to device used for multi GPU situations?