≡

wincent.dev

  • Products
  • Blog
  • Wiki
  • Issues
You are viewing an historical archive of past issues. Please report new issues to the appropriate project issue tracker on GitHub.
Home » Issues » Feature request #1484

Feature request #1484: External monitoring and automatic respawning of EC2 instances

Kind feature request
Product wincent.dev
When 2010-02-20T09:31:02Z
Status open
Reporter Greg Hurrell
Tags no tags

Description

At the moment I have internal monitoring of EC2 instances via Monit; ie. if my Rails app stops responding, for example, it will be automatically restarted.

But if the EC2 instance itself fails, then I have no remote monitoring in place.

First step then, is to remotely monitor it. As an initial step I can just do this from my local machine. I could even use Monit for this purpose, I think. Or just a custom script.

When I am alerted to a failure I will have to manually intervene. This is in part necessary because I am new to AWS and I need to figure out how to intervene in this kind of case.

Second step is to make such intervention automatic, once I actually know what kind of steps will be required I can automate them. It will probably look something like this:

  • After X failures within a period of Y minutes:
    • Alert admin
    • Clone EBS root volume (still need to decide whether from latest snapshot, or by detaching from dead instance and cloning that)
    • Clone EBS data volume (again still need to decide how)
    • Launch new instance using those clone volumes
    • Point elastic IP at new instance
    • Notify admin that new instance is up

I can then examine the dead instance or the associated volumes to see what went wrong.

The EBS root volume shouldn't change in any interesting way very often, so cloning from last snapshot is probably fine.

The EBS data volume is another story, however, so will have to think carefully about whether to use a snapshot or to detach and clone (remembering that cloning the volume from the dead server may also end up cloning something problematic that caused a fault; however, I expect most if not all instance failures to be caused by external factors so it may not be worth worrying about that).

Comments

    Add a comment

    Comments are now closed for this issue.

    • contact
    • legal

    Menu

    • Blog
    • Wiki
    • Issues
    • Snippets