Are there any examples of CUDA kernel executions that run for a long time?

I’m a researcher looking for examples of computations that spend a long time on the GPU, anything over 30 minutes for example. To be clear, the kernel execution alone should take that time, not the entire application execution or the sum time of multiple kernels launched in sequence.

Pointers to any GPU computing group that does this kind of thing or published work is most welcome.

This will take a least a day on one GPU;

https://sites.google.com/site/cudamagicsquare/

These type of problems split well between multiple GPUs, and the permutation problem will take along time once you go over 16!

https://sites.google.com/site/cudapermutations/

Thanks that definitely helped, would you happen to know of a more practical, real-world example that has a long running kernel? Thanks.

In terms of practicality GPU based particle simulations in complex 3D geometries could be configured to run over 30 minutes. Also a large case of the Floyd Warshall all pairs shortest path algorithm ( over 20,000 vertices) would take some time.

Thanks, this has been very helpful!

What is it that you are researching ? Can you be more specific ?

My research looks into resilience in high-performance computing (HPC). High performance compute clusters typically encounter failures that are handled in various ways. One very common way to handle failures is to periodically checkpoint the currently executing application so that a restart can be made from a previous checkpoint should any failure occur. In a nutshell, the application is suspended, and its current execution state is saved. So checkpoint/restart has been well documented and is so far the most reliable method to achieve resilience in compute clusters.

Within recent years GPUs have become attractive to HPC because they offer a cheap way to add more compute power. However, GPUs have their own memory system and execution model which does not work with the traditional checkpointing approach. What my research specifically looks at is resilience in the context of GPUs. That is, deriving a way that allows for successful application suspension when GPUs are used to accelerate computation. To motivate my research I am looking for practical, real-world examples that have long running kernels.

Most GPU-accelerated applications (wisely) use short-running kernels, and are therefore compatible with the traditional checkpoint & resume approach. One example would be the Folding@Home application. Is your research specifically about how to extend checkpoint & resume to applications with long-running kernels?

“” = oakleaf text.

“My research”

Is your research “theoretical” or also “practical” with the latter meaning access to a compute-cluster to run for example my would-be-provided example on ?

“looks into resilience in high-performance computing (HPC). High performance compute clusters typically encounter failures that are handled in various ways.”

I assume hardware failures because of overheat or damaged hardware components/GPUs ? Or do you also mean calculations errors because of bit/memory flips ?

“One very common way to handle failures is to periodically checkpoint the currently executing application so that a restart can be made from a previous checkpoint should any failure occur.”

Could you be more specific about what is considered a failure ?

“In a nutshell, the application is suspended, and its current execution state is saved. So checkpoint/restart has been well documented and is so far the most reliable method to achieve resilience in compute clusters.”

What if errors are saved ? Does this mean it’s not about soft errors but hardware failures ?

“Within recent years GPUs have become attractive to HPC because they offer a cheap way to add more compute power. However, GPUs have their own memory system and execution model which does not work with the traditional checkpointing approach.”

Seems like this traditional approach worked as follows:

  1. On detection of failure, save all state, and once problem is solved, restart from that state.

Why would this not work with cuda cores/gpus/compute clusters ? A signal could be “sent” to the kernel being executed that it should enter this procedure and “save state”.

"What my research specifically looks at is resilience in the context of GPUs. That is, deriving a way that allows for successful application suspension when GPUs are used to accelerate computation. To motivate my research I am looking for practical, real-world examples that have long running kernels.
"

What do you consider practical ? (Short code/kernels ?)

What do you consider real-world examples ?

I do have a kernel which is quite compute intense, it’s only 461 lines long in total for whole software package. It basically computes which attack pattern for a 4 vs 4 navy fleet would be best.

It was solved on a single GT 520 graphics card with only 48 cuda cores it would take a few hours if I remember correctly. On 2.0 GHZ cpu core it would take 12 hours to compute. The number of attack patterns/combinations/permutations is already huge for such a small fleet, I consider this “computational warfare” :)

The example could be scaled up to more ships and it will probably easily dwarf the computational power of any computer cluster you might have access to.

In case you do have access to a computer-cluster then I have some questions about that:

  1. How many GPUs/graphics cards does it have ?

  2. How many cuda cores in total ?

(My kernel would require very little data bandwidth mostly just branch/lookup processing, not 100% sure about that but I think so)

  1. Would you be allowed to share kernel computation results with me ? (It could be interesting for me to be used in a computer game :)) Will you share kernel compute results with me ?

  2. I have never programmed a “compute-cluster” kernel. So I am not sure what it would take to turn a single gpu kernel into a “compute-cluster” kernel. I do know that nvidia does have some gpu-to-gpu apis and such which might be of some use.

However a compute-cluster has multiple computers/PCs so would that involve “network communications” programming ? Or is there also something available for that already takes care of that ? and is it from nvidia or others ?

  1. Are you capable of taking a “simple” compute kernel and then modifieing it so that it runs on a compute-cluster ?

For example it process 0 to N-1 compute problems. Would you be able to simply split N up into such a situation that pieces are distributed across the computer cluster ?

How much work/coding is/would be required for that ? (I may look into this myself later on.)

  1. If not then I am guessing you are looking for a “compute-cluster” ready example ? I don’t think many people on this planet have access to such a beast… so this might get a bit tricky in your searching for something like this, hence your request here I guess ;)

In case you are interested in this kernel then what it does is basically explained in this usenet thread where I described the problem a bit more general/abstract in terms of “numbers attacking other numbers”. Instead of calling it ships or ship types and such. At the time a competition was active and I did not want any of the competitors to catch wind of this piece of software scared that it might be used against me :) so that’s why I described it more abstractly.

The link to this description is in here:

https://groups.google.com/forum/#!topic/comp.lang.python/PvkWLQeNzoM

The thread was called:

“Best attack order for groups of numbers trying to destroy each other, given a victory chance for number to number attack.”

For the participants it was apperently still hard to understand what it was trying to do.

I will add a little bit of potentially new information here:

Basically there is a 4 vs 4 navy ship/fleet situation. 4 ships fighting 4 ships.

Destroyer + Cruiser + Battleship + Carrier vs Destroyer + Cruiser + Battleship + Carrier

The question is which ship should be destroyed first to maximize winning chance for the team.

The question is then which ship to destroy next and so forth until all enemy ships are dead.

The question is also more complex then it seems. Should the entire fleet work together to destroy a single ship or should each ship attack a different ship.

The kernel works with “rounds” at each round a ship’s decision is computed and such.

It’s basically brute force in the sense that it tries all possible combinations and permutations of this compute problem.

In the game, let’s call it “reality”, though it’s “virtual reality”, though it could be real, the compute problem is much more complex, since there are also “bombers” “fighters” and much more ships.

This could be added to the kernel/model for even higher compute complexity.

I do find the 64 bit integers to be limitting when trying to compute “total complexity”. Total complexity can easily go beyond 64 bit I would guess for computer clusters.

This is a hint for nvidia to include 128 bit and perhaps even 256 bit integers to be able to compute some large numbers which can then be divided back into smaller parts for parallel solving.

These 128 to 256 bit integers may be software computed/simulated but would certainly make life a bit easier for intense compute kernels.

The competition itself was 7 vs 7 ships. If a compute-cluster would be capable of computing/solving this problem within let’s say 1 day or something I would be impressed ! =D

I am a bit reluctant to share this kernel with you because:

  1. It might be used by a real navy, which is kinda scary :)

So who do you work for ? :)

(The competitions will return in the near future, not sure yet if I will be taking part in it again, so this remains somewhat interesting for me to work on ;) so having the results of a bigger computation than just 4 v 4 ships could interest me, also out of curiosity.)

Bye,
Skybuck :)

Yes, the specific focus of my research is extending the checkpoint/resume to applications with long-running kernels.

“My research”

Is your research “theoretical” or also “practical” with the latter meaning access to a compute-cluster to run for example my would-be-provided example on ?

[b]

My research is also practical, I do have access to a compute cluster, so I should be able to run any provided examples.

[/b]

“looks into resilience in high-performance computing (HPC). High performance compute clusters typically encounter failures that are handled in various ways.”

I assume hardware failures because of overheat or damaged hardware components/GPUs ? Or do you also mean calculations errors because of bit/memory flips ?

[b]

Yes, hardware failures because of heating, or due to communication failures. I don’t look at soft failures, I think there are several proposed solutions for soft-failures already.

[/b]

“One very common way to handle failures is to periodically checkpoint the currently executing application so that a restart can be made from a previous checkpoint should any failure occur.”

Could you be more specific about what is considered a failure ?

[b]

A failure is anything that would cause the application to have to restart from a previous checkpoint. While my focus is on GPU failure, the host application can fail as well (for whatever reason, hardware or software) and would therefore need to be restarted. Berkley Labs Checkpoint/Restart (BLCR) has emerged as a popular way of performing checkpoint/restart for host application failures but does not currently handle GPUs.

[/b]

“In a nutshell, the application is suspended, and its current execution state is saved. So checkpoint/restart has been well documented and is so far the most reliable method to achieve resilience in compute clusters.”

What if errors are saved ? Does this mean it’s not about soft errors but hardware failures ?

[b]

Yes exactly, I only focus on hardware failures.

[/b]

“Within recent years GPUs have become attractive to HPC because they offer a cheap way to add more compute power. However, GPUs have their own memory system and execution model which does not work with the traditional checkpointing approach.”

Seems like this traditional approach worked as follows:

  1. On detection of failure, save all state, and once problem is solved, restart from that state.

Why would this not work with cuda cores/gpus/compute clusters ? A signal could be “sent” to the kernel being executed that it should enter this procedure and “save state”.

[b]

Not all checkpoint/restart do checkpointing at the moment of failure as this may be too late to properly save the application’s state due to the nature of the failure. What is typically done is periodic checkpointing; so if anything fails a restart from the most recent checkpoint is valid.

Hmm, I’m not sure how such a signal can be sent, after a kernel is launched, doesn’t it run to completion? The reason a simple save state does not work for GPUs has to do with the CUDA context, when any CUDA API call is made it is associated with a context, if no context exists then one is created. This context resides in host memory and interfaces with the device and does some state management (the documentation does not explain contexts in technical detail). Now with the traditional checkpointing, when a snapshot of the application state is made, this context is captured also (as it resides in host memory). Restarting from a checkpoint that contains a CUDA context does not work because that restored CUDA context is now invalid and future CUDA API calls then attempt to hook into an invalid context. (I hope I’ve explained this well, feel free to ask for more clarity if necessary)

[/b]

"What my research specifically looks at is resilience in the context of GPUs. That is, deriving a way that allows for successful application suspension when GPUs are used to accelerate computation. To motivate my research I am looking for practical, real-world examples that have long running kernels.
"

What do you consider practical ? (Short code/kernels ?)
[b]

Hmm, I perhaps should not have said “practical”, but what I mean by that is some example where a kernel’s execution time naturally runs for a long time. I don’t know much outside of computer science, but I was wondering if there were any physics simulations, weather simulations or anything in some other area that is computationally demanding enough that kernel executions are not short.

[/b]

What do you consider real-world examples ?

[b]

Any example that does something useful; crunches some data required for use in industry. I could create huge floating-point matrix multiplication operations, but if I need to have some reference that proves that this is something that needs to be done.

[/b]

I do have a kernel which is quite compute intense, it’s only 461 lines long in total for whole software package. It basically computes which attack pattern for a 4 vs 4 navy fleet would be best.

It was solved on a single GT 520 graphics card with only 48 cuda cores it would take a few hours if I remember correctly. On 2.0 GHZ cpu core it would take 12 hours to compute. The number of attack patterns/combinations/permutations is already huge for such a small fleet, I consider this “computational warfare” :)

[b]

Is the solution to this problem useful to anyone in particular? If so, might you know of published work? I can come up with a few long-running kernels too, but to have a somewhat formal basis for justification and motivation for my work I need to be able to point to some kind of problem requiring this (or any) long-running kernel.

[/b]

The example could be scaled up to more ships and it will probably easily dwarf the computational power of any computer cluster you might have access to.

In case you do have access to a computer-cluster then I have some questions about that:

  1. How many GPUs/graphics cards does it have ?

  2. How many cuda cores in total ?

(My kernel would require very little data bandwidth mostly just branch/lookup processing, not 100% sure about that but I think so)

[b]

The cluster I have access to has six NVIDIA K20s, one Quadro K6000 and an NVIDIA Titan XP.

[/b]

  1. Would you be allowed to share kernel computation results with me ? (It could be interesting for me to be used in a computer game :)) Will you share kernel compute results with me ?

[b]

Sure, sharing computation results isn’t a problem :-)

[/b]

  1. I have never programmed a “compute-cluster” kernel. So I am not sure what it would take to turn a single gpu kernel into a “compute-cluster” kernel. I do know that nvidia does have some gpu-to-gpu apis and such which might be of some use.

However a compute-cluster has multiple computers/PCs so would that involve “network communications” programming ? Or is there also something available for that already takes care of that ? and is it from nvidia or others ?

[b]

I’m not too concerned if the application is able to automatically adapt to the resources it has access to. Just a kernel that takes a while to complete even if it is on a single GPU. I want to work on a way to demonstrate how this can be handled, so whether the application scales at this point or not is irrelevant.

[/b]

  1. Are you capable of taking a “simple” compute kernel and then modifieing it so that it runs on a compute-cluster ?

[b]

Yep, I am a big fan of simple!

[/b]

For example it process 0 to N-1 compute problems. Would you be able to simply split N up into such a situation that pieces are distributed across the computer cluster ?

How much work/coding is/would be required for that ? (I may look into this myself later on.)

[b]

Hmm, I’ve never done that so it’s a bit hard for me to say. But as I said previously, a simple, single execution is fine.

[/b]

  1. If not then I am guessing you are looking for a “compute-cluster” ready example ? I don’t think many people on this planet have access to such a beast… so this might get a bit tricky in your searching for something like this, hence your request here I guess ;)

    ===================
    Lol, well I have no clue what might be required in industry, so I was just asking if anyone knew of anything. I just need something to point to for justification for my work. I don’t actually have to be able to run it myself :-)
    ===================

In case you are interested in this kernel then what it does is basically explained in this usenet thread where I described the problem a bit more general/abstract in terms of “numbers attacking other numbers”. Instead of calling it ships or ship types and such. At the time a competition was active and I did not want any of the competitors to catch wind of this piece of software scared that it might be used against me :) so that’s why I described it more abstractly.

The link to this description is in here:

https://groups.google.com/forum/#!topic/comp.lang.python/PvkWLQeNzoM

The thread was called:

“Best attack order for groups of numbers trying to destroy each other, given a victory chance for number to number attack.”

For the participants it was apperently still hard to understand what it was trying to do.

I will add a little bit of potentially new information here:

Basically there is a 4 vs 4 navy ship/fleet situation. 4 ships fighting 4 ships.

Destroyer + Cruiser + Battleship + Carrier vs Destroyer + Cruiser + Battleship + Carrier

The question is which ship should be destroyed first to maximize winning chance for the team.

The question is then which ship to destroy next and so forth until all enemy ships are dead.

The question is also more complex then it seems. Should the entire fleet work together to destroy a single ship or should each ship attack a different ship.

The kernel works with “rounds” at each round a ship’s decision is computed and such.

It’s basically brute force in the sense that it tries all possible combinations and permutations of this compute problem.

In the game, let’s call it “reality”, though it’s “virtual reality”, though it could be real, the compute problem is much more complex, since there are also “bombers” “fighters” and much more ships.

This could be added to the kernel/model for even higher compute complexity.

I do find the 64 bit integers to be limitting when trying to compute “total complexity”. Total complexity can easily go beyond 64 bit I would guess for computer clusters.

This is a hint for nvidia to include 128 bit and perhaps even 256 bit integers to be able to compute some large numbers which can then be divided back into smaller parts for parallel solving.

These 128 to 256 bit integers may be software computed/simulated but would certainly make life a bit easier for intense compute kernels.

The competition itself was 7 vs 7 ships. If a compute-cluster would be capable of computing/solving this problem within let’s say 1 day or something I would be impressed ! =D

I am a bit reluctant to share this kernel with you because:

  1. It might be used by a real navy, which is kinda scary :)

So who do you work for ? :)

(The competitions will return in the near future, not sure yet if I will be taking part in it again, so this remains somewhat interesting for me to work on ;) so having the results of a bigger computation than just 4 v 4 ships could interest me, also out of curiosity.)

[b]

Lol, I totally understand your reluctance to share the code, I’m just a student trying to obtain a PhD at a university in the UK. However, I do find it really interesting that your work involves permutations! I’ve found some published work that does genome sequencing and needs to do permutations that takes a very long time (a single invocation of their kernel ran for 25 minutes on the K20 I have access to. Nevertheless, thanks for you questions and lengthy response!! Feel free to write again.

[/b]

Bye,
Skybuck :)

[b]

A failure is anything that would cause the application to have to restart from a previous checkpoint. While my focus is on GPU failure, the host application can fail as well (for whatever reason, hardware or software) and would therefore need to be restarted. Berkley Labs Checkpoint/Restart (BLCR) has emerged as a popular way of performing checkpoint/restart for host application failures but does not currently handle GPUs.

[/b]

This is very interesting, I did not known that such a feature existed at all, though this is for linux only, so is your compute cluster/software limited to linux only ?

[b]

Not all checkpoint/restart do checkpointing at the moment of failure as this may be too late to properly save the application’s state due to the nature of the failure. What is typically done is periodic checkpointing; so if anything fails a restart from the most recent checkpoint is valid.

[/b]

"What my research specifically looks at is resilience in the context of GPUs. That is, deriving a way that allows for successful application suspension when GPUs are used to accelerate computation.
"

I think you should look into “gpu drivers” and see if such a feature can be built into gpu drivers, that would probably resemble the most of what is done with the technique you mentioned.

So your solution would then be in “kernel/driver space”, or is that to “deep” for you and are you thinking of an “application space” solution ?

For the “driver space” solution you might have to work together with nvidia driver programmers, maybe they would be interested in trying to help you out with that.

"
To motivate my research I am looking for practical, real-world examples that have long running kernels.
"

Concerning running time I have this piece of code in my kernel software:

__device__ unsigned int ClockTick()
{
    unsigned int vClockTick;
	asm volatile("mov.u32 %0, %%clock;" : "=r"(vClockTick));
	return vClockTick;
}

Perhaps this routine could be called in a while loop and then simply loop around until a certain ammount of clock ticks have passed, and perhaps even do a little bit of “random processing” or perhaps even work towards some goal which could already be pre-computed just to compare if the final computation matches. So why not use something like this ? Perhaps it might be a problem for “checkpoint/restart” perhaps the “tick count” might be way off ? That in itself could also be an interesting problem to investigate.

What do you consider real-world examples ?

[b]

Any example that does something useful; crunches some data required for use in industry. I could create huge floating-point matrix multiplication operations, but if I need to have some reference that proves that this is something that needs to be done.

[/b]

I find this answer a bit weird, I can only imagine this might be usefull to prove that the checkpoint/restart method worked, so that this requirement would be a “verification method” ?

[b]

Is the solution to this problem useful to anyone in particular? If so, might you know of published work? I can come up with a few long-running kernels too, but to have a somewhat formal basis for justification and motivation for my work I need to be able to point to some kind of problem requiring this (or any) long-running kernel.

[/b]

It gives some insight for gamers, no published work exists as far as I know off :) Again I find this “need for prove” weird. Does your teacher not believe that there exist real-world long running kernels ?

[b]

The cluster I have access to has six NVIDIA K20s, one Quadro K6000 and an NVIDIA Titan XP.

[/b]

I am not sure if this is a mix of “consumer and pro” gfx cards, or completely “pro”.

I know quadro is “pro”, the “pro” cards do support special apis for p2p gpu communication and such.

[b]

I’m not too concerned if the application is able to automatically adapt to the resources it has access to. Just a kernel that takes a while to complete even if it is on a single GPU. I want to work on a way to demonstrate how this can be handled, so whether the application scales at this point or not is irrelevant.

[/b]

What kind of work do you plan to do ? Again will be this an “application space” solution ? Or something more advanced like a “driver space” solution ?

For example it process 0 to N-1 compute problems. Would you be able to simply split N up into such a situation that pieces are distributed across the computer cluster ?

How much work/coding is/would be required for that ? (I may look into this myself later on.)

[b]

Hmm, I’ve never done that so it’s a bit hard for me to say. But as I said previously, a simple, single execution is fine.

[/b]

This is a bit weird but eventually it doesn’t matter you can start somewhere. But it might make a bit more sense to first make a “full compute cluster example” otherwise your solution will be “half-baked” possibly and will have to be “re-done” / “re-designed” which might lead to “wasted effort”.

Though a solution for a single gpu would already be somewhat usefull. A solution for a full compute cluster would ofcourse be great and would surely lead to an A / 10 rating for you :)

[b]

Lol, well I have no clue what might be required in industry, so I was just asking if anyone knew of anything. I just need something to point to for justification for my work. I don’t actually have to be able to run it myself :-)

[/b]

Here are some links to “tech” possibly used for compute clusters:

https://devblogs.nvidia.com/parallelforall/how-build-gpu-accelerated-research-cluster/

https://developer.nvidia.com/gpudirect

https://developer.nvidia.com/mpi-solutions-gpus

Again I find this “justification reasoning” a bit weird :)

You could ask your teacher if it’s ok to use an example which solves a “gaming problem”. You could tell him there is this game called “World of Warships” where navy fleets have to destroy each other. And the question is how they should destroy each other. Which kind of ship should be destroyed first ? Destroyers, Cruisers, Battleships, Carriers ? Each ship type has a certain “victory chance” against another ship type, and so forth it’s quite a complex story ;) Maybe he will find it interesting. If he says, “it’s too easy” challenge him to a couple of games ! LOL and make him prove he can win 100% of the time ! =D

[b]

Lol, I totally understand your reluctance to share the code, I’m just a student trying to obtain a PhD at a university in the UK. However, I do find it really interesting that your work involves permutations! I’ve found some published work that does genome sequencing and needs to do permutations that takes a very long time (a single invocation of their kernel ran for 25 minutes on the K20 I have access to. Nevertheless, thanks for you questions and lengthy response!! Feel free to write again.

[/b]

I have played around with “evolution” and “gene-like” stuff which could also be considered some form of permutations, indeed it was interesting.

In this case the number of permutations for navy fleets seemed limited enough that trying all of them is feasible on my system for small fleets, somewhat thougher for bigger fleets and perhaps unpractical for large fleets, in that case the “gene” approach might be tried as well, to see if any “lucky” gene sequences are found ;)

First I would need to know if your teacher is “cool” with this “gaming justification” if not alas ! ;)

Bye,
Skybuck :)

(Oh by the way concerning the signal, global memory might be used for that… it might have to do with “locking” or “pinning” that memory and “cuda streams”. I am not sure how reliable this would be.)

[b]

A failure is anything that would cause the application to have to restart from a previous checkpoint. While my focus is on GPU failure, the host application can fail as well (for whatever reason, hardware or software) and would therefore need to be restarted. Berkley Labs Checkpoint/Restart (BLCR) has emerged as a popular way of performing checkpoint/restart for host application failures but does not currently handle GPUs.

[/b]

This is very interesting, I did not known that such a feature existed at all, though this is for linux only, so is your compute cluster/software limited to linux only ?


Yeah, my compute cluster is for Linux only. So far I’ve only come across Linux systems being used for compute clusters. I have never heard of (nor have I looked for) any other operating system that is used. There may be others, but Linux is definitely a popular choice.


[b]

Not all checkpoint/restart do checkpointing at the moment of failure as this may be too late to properly save the application’s state due to the nature of the failure. What is typically done is periodic checkpointing; so if anything fails a restart from the most recent checkpoint is valid.

[/b]

"What my research specifically looks at is resilience in the context of GPUs. That is, deriving a way that allows for successful application suspension when GPUs are used to accelerate computation.
"

I think you should look into “gpu drivers” and see if such a feature can be built into gpu drivers, that would probably resemble the most of what is done with the technique you mentioned.

So your solution would then be in “kernel/driver space”, or is that to “deep” for you and are you thinking of an “application space” solution ?


Oh, definitely application space solution. Kernel/driver space would be more ideal but right now I’m working on more of a proof of concept before heading any deeper.


For the “driver space” solution you might have to work together with nvidia driver programmers, maybe they would be interested in trying to help you out with that.


Yeah, rather unfortunately NVIDIA is not open source :-(


"
To motivate my research I am looking for practical, real-world examples that have long running kernels.
"

Concerning running time I have this piece of code in my kernel software:

__device__ unsigned int ClockTick()
{
    unsigned int vClockTick;
	asm volatile("mov.u32 %0, %%clock;" : "=r"(vClockTick));
	return vClockTick;
}

Perhaps this routine could be called in a while loop and then simply loop around until a certain ammount of clock ticks have passed, and perhaps even do a little bit of “random processing” or perhaps even work towards some goal which could already be pre-computed just to compare if the final computation matches. So why not use something like this ? Perhaps it might be a problem for “checkpoint/restart” perhaps the “tick count” might be way off ? That in itself could also be an interesting problem to investigate.


Oh, excellent idea! I haven’t put too much thought into experiments since I’m not quite there yet.


What do you consider real-world examples ?

[b]

Any example that does something useful; crunches some data required for use in industry. I could create huge floating-point matrix multiplication operations, but if I need to have some reference that proves that this is something that needs to be done.

[/b]

I find this answer a bit weird, I can only imagine this might be usefull to prove that the checkpoint/restart method worked, so that this requirement would be a “verification method” ?


Yeah, the thing about these real world examples really has to do with trying to get some work published. I’ve never published any work before and this has to do with adding to the credibility and showing that there are real applications that take this much time. My supervisor has been around academics all his life and supervised many students. I get the feeling that him recommending I find “real world” examples is coming more from his experience.


[b]

Is the solution to this problem useful to anyone in particular? If so, might you know of published work? I can come up with a few long-running kernels too, but to have a somewhat formal basis for justification and motivation for my work I need to be able to point to some kind of problem requiring this (or any) long-running kernel.

[/b]

It gives some insight for gamers, no published work exists as far as I know off :) Again I find this “need for prove” weird. Does your teacher not believe that there exist real-world long running kernels ?


Oh, he does believe. It’s just like I said before, for strengthening the publication we need to point to some existing work that uses long-running GPU kernels.


[b]

The cluster I have access to has six NVIDIA K20s, one Quadro K6000 and an NVIDIA Titan XP.

[/b]

I am not sure if this is a mix of “consumer and pro” gfx cards, or completely “pro”.

I know quadro is “pro”, the “pro” cards do support special apis for p2p gpu communication and such.


What do you mean by “pro” and “consumer”?


[b]

I’m not too concerned if the application is able to automatically adapt to the resources it has access to. Just a kernel that takes a while to complete even if it is on a single GPU. I want to work on a way to demonstrate how this can be handled, so whether the application scales at this point or not is irrelevant.

[/b]

What kind of work do you plan to do ? Again will be this an “application space” solution ? Or something more advanced like a “driver space” solution ?


Yeah, this will be an “application space” solution.


For example it process 0 to N-1 compute problems. Would you be able to simply split N up into such a situation that pieces are distributed across the computer cluster ?

How much work/coding is/would be required for that ? (I may look into this myself later on.)


The current solution I’m working on aims to be independent of the kind of work being done. I just aim to be able to suspend and resume long-running kernels. For simplicity of testing however, a single kernel executing on one device would be ideal.


[b]

Hmm, I’ve never done that so it’s a bit hard for me to say. But as I said previously, a simple, single execution is fine.

[/b]

This is a bit weird but eventually it doesn’t matter you can start somewhere. But it might make a bit more sense to first make a “full compute cluster example” otherwise your solution will be “half-baked” possibly and will have to be “re-done” / “re-designed” which might lead to “wasted effort”.

Though a solution for a single gpu would already be somewhat usefull. A solution for a full compute cluster would ofcourse be great and would surely lead to an A / 10 rating for you :)


Haha, a full compute cluster example would be the holy grail for me indeed! but small steps. I’ve started on a few big attempts before and things very quickly got complex and messy without me even knowing if the approach even works at all. So now I do small steps and if they work then I think about going bigger.


[b]

Lol, well I have no clue what might be required in industry, so I was just asking if anyone knew of anything. I just need something to point to for justification for my work. I don’t actually have to be able to run it myself :-)

[/b]

Here are some links to “tech” possibly used for compute clusters:

https://devblogs.nvidia.com/parallelforall/how-build-gpu-accelerated-research-cluster/

https://developer.nvidia.com/gpudirect

https://developer.nvidia.com/mpi-solutions-gpus

Again I find this “justification reasoning” a bit weird :)


Thanks for the links, yeah the justification thing comes from how things work in the academic world. Solutions need to be derived for real examples. I’m still learning about this world and how to navigate it :-)


You could ask your teacher if it’s ok to use an example which solves a “gaming problem”. You could tell him there is this game called “World of Warships” where navy fleets have to destroy each other. And the question is how they should destroy each other. Which kind of ship should be destroyed first ? Destroyers, Cruisers, Battleships, Carriers ? Each ship type has a certain “victory chance” against another ship type, and so forth it’s quite a complex story ;) Maybe he will find it interesting. If he says, “it’s too easy” challenge him to a couple of games ! LOL and make him prove he can win 100% of the time ! =D


Lol, good suggestion! Is there a formal description of this game? Because he might have questions I don’t know how to answer. Also this sounds a bit like Battleships? is it?


[b]

Lol, I totally understand your reluctance to share the code, I’m just a student trying to obtain a PhD at a university in the UK. However, I do find it really interesting that your work involves permutations! I’ve found some published work that does genome sequencing and needs to do permutations that takes a very long time (a single invocation of their kernel ran for 25 minutes on the K20 I have access to. Nevertheless, thanks for you questions and lengthy response!! Feel free to write again.

[/b]

I have played around with “evolution” and “gene-like” stuff which could also be considered some form of permutations, indeed it was interesting.

In this case the number of permutations for navy fleets seemed limited enough that trying all of them is feasible on my system for small fleets, somewhat thougher for bigger fleets and perhaps unpractical for large fleets, in that case the “gene” approach might be tried as well, to see if any “lucky” gene sequences are found ;)

First I would need to know if your teacher is “cool” with this “gaming justification” if not alas ! ;)

Bye,
Skybuck :)

(Oh by the way concerning the signal, global memory might be used for that… it might have to do with “locking” or “pinning” that memory and “cuda streams”. I am not sure how reliable this would be.)

For the “driver space” solution you might have to work together with nvidia driver programmers, maybe they would be interested in trying to help you out with that.


Yeah, rather unfortunately NVIDIA is not open source :-(


I known there are some open source nVidia gpu drivers out there because I read about them once… (I think same applies to ATI/AMD)

A quickly good turns this up:

https://nouveau.freedesktop.org/wiki/

Not sure if this is also suited for CUDA/PTX/Compute Clusters.

There is also this:

http://www.nvidia.com/object/unix.html

And at the bottom it says:

"
Open source drivers for NVIDIA nForce hardware are included in the standard Linux kernel and leading Linux distributions. This page includes information on open source drivers, and driver disks for older Linux distributions including 32-bit and 64-bit versions of Linux.
"

Kinda weird how it mixes unix and linux info ? :) I am not sure if unix drivers work on linux or vice versa ! ;) :)


Yeah, the thing about these real world examples really has to do with trying to get some work published. I’ve never published any work before and this has to do with adding to the credibility and showing that there are real applications that take this much time. My supervisor has been around academics all his life and supervised many students. I get the feeling that him recommending I find “real world” examples is coming more from his experience.


Why is publishing important ? and for who is it important ? For you, for him or for academy ?

What is gained from it ? Attention ? Money ? Fame ? Will this help your carrier in the future ? (I guess so ;))

Personally I think the requirement of “real-world examples” might be a bit silly though real world examples might have things we did not yet think about, for example the timer/timing related things… there may be others.

Also what if the “real world” as to adept to new coding techniques to make it possible ? ;)

Personally if I were interested in “checkpoint start method” for GPUs I would already be happy with a document describing a technique for “synthetic examples”.

Once the technique works for “synthetic examples” it can then be tried on “real-world examples”. This also offers you the possibility of publishing twice.

First for “easy/simplistic/short” examples which might already be very helpfull for readers.
Second for “complex examples” and deeply interested individuals or organizations.

As you already note further down below, one has to start somewhere. I’ve read some academic papers myself or viewed over them… most are probably way to thick/to deep/too complex for me to be of any use.

The best one I came across so far was “fast grid traversal”… that one was short and to the point, some info was lacking but could be derived.

As you also noted, you already tried a bigger solution and it was though and complex ? Why make it even more complex with a “real-world” example ? :):):)

It gives some insight for gamers, no published work exists as far as I know off :) Again I find this “need for prove” weird. Does your teacher not believe that there exist real-world long running kernels ?


Oh, he does believe. It’s just like I said before, for strengthening the publication we need to point to some existing work that uses long-running GPU kernels.


Ok, now I understand why your teacher wants this. Your teacher believes that if your document is linked to existing work that this will give your document more credibility.

One possible problem with this is that this is still kind of a “new world”, compute clusters/cuda/ptx, etc, let alone “well known”.

This is new stuff, this is the frontier :) You may have to pave the way yourself ! =D That’s the fun part about it =D

Also see it as an oppertunity… “the frontier legend” ! =D


What do you mean by “pro” and “consumer”?


nVidia sells two different kinds of cards/brands; Quadro is for “pro” and GeForce is for “consumer/gamers”, the geforce may have some features disabled or lacking certain API support.


Yeah, this will be an “application space” solution.


So what is the plan here ? Do you intend to “transform/modify” an existing “real world example” ? Or do you intend to try and find a “generic solution” to “real world examples” ?

The last one will get tricky, and thus last one will probably require gpu driver support.

I can understand that you might want to go with the first one “transform/modify” to see what can be done to the real world examples at application space.

However describing this in a document would get a bit tricky… people would then need to understand the real world example and then “admire” the “transformed/modified” example.

For the reader this might be a bit much if it’s a big real world example. Perhaps you want to try and see what “real world examples” run into… to get a taste of what the problem entails (?)

Then again it is cuda c and by now cuda c++ we are discussing about and perhaps c++ has some features that I don’t know about… perhaps you could come up with some template or some kind of c++ trick which could create a “checkpoint” for any piece of random software…

Or perhaps you might even need to write some kind of c++ interpreter/compiler and analyze the cuda c/++ code/kernel and then generate some automatic checkpointing software/c/c++ code which can then be embedded into any “real world example” :)


The current solution I’m working on aims to be independent of the kind of work being done. I just aim to be able to suspend and resume long-running kernels. For simplicity of testing however, a single kernel executing on one device would be ideal.


;) Hmmm did I guess it right that you might be working on some cuda c/c++ compiler/interpreter ? :)


Haha, a full compute cluster example would be the holy grail for me indeed! but small steps. I’ve started on a few big attempts before and things very quickly got complex and messy without me even knowing if the approach even works at all. So now I do small steps and if they work then I think about going bigger.


Yes for now it looks like it’s pretty messy out there… Could be nice if CUDA API was expanded to support compute clusters and all kinds of network/computer communication would be done below that cuda api ;) to relief cuda/parallel programmers from having to deal with all that plumbing ;)


Lol, good suggestion! Is there a formal description of this game? Because he might have questions I don’t know how to answer. Also this sounds a bit like Battleships? is it?


The game is called “World of Warships”, for more information about it try and google it and youtube it then you will get a sense of what it is.

The comparison to Battleships is kinda interesting though ! WOWS could be reduced to such an easier/simpler version which includes position (for my kernel, currently it ignores position).

However consider World of Warships a slight variation on “Battleships”. As far as I know in the game “Battleships” the board game I once played, every player (only two) get to fire a single shot at each other each turn.

The “World of Warships Battleships” board game would work as follows:

Each ship that is still alive gets to fire one shot at the enemy each turn. However the bigger ships might get 2 or 3 shots… perhaps this is a bit much though.

However to make things more fair… the “shells” could already be considered “in flight”.

So for example during round 1, each player gets to fire something like 10 shells or something… interleaved with the enemy shots.

After both players have shot 10 shots… the “damaged” is assed. Sunking ships are removed from the game.

The new ammount of shells is calculated that each player is allowed to fire, and then round 2 starts.

Also it would be cool to simply give each other a list of “fired upon” positions and then verify if they actually hit anything… (after round 1) to prevent leaking info to enemy during the shots. (Instead of saying “splash” or “boom” after each shot which would allow “next shot” adjustment)

If I ever play Battleships again with somebody… I will try and play it like this !

Should be much fun ! And makes the game more realistic and much more complex ! =D

(To make it even more “insane”/“realistic” allow ships to move 1 square or so each round ! haha ! :) or perhaps even different ammount of squares, little ships can move 3 or 4 and bigger ships only 2 or 1)

Thanks for writing… perhaps I gave you some ideas… perhaps not… perhaps you already had them…

I wish you good luck and lots of fun talking to your teacher about all of this.

And too you I also write: do write again if you feel like it ! =D