Failover: Keeping your Environment alive!

Why Everything I Have Is Broken

Computers break… it’s a fact of life! Sometimes it’s a nice quick fix, such as the one cunningly suggested by Randall Munroe. Sometimes it can take hours of trawling through logs. Regardless of the reason, in an ideal world,you want to fix the problem as soon as possible; but the next best thing is to have a Failover – another server that works just as well!

In a Logscape environment, your Management agent is the central point of your environment. It controls alerts, provides users access and runs the entire system: without it, you have nothing. So how do you make sure that your environment is resilient against a Management Agent failure? Simple… you add another one!

The recent 3.2 Logscape release has added new and improved Failover capabilities, making it easier to provide a seamless environment for your users. Having recently implemented this feature in my environment, I thought I’d share with you both the benefits – and the possible pitfalls – of installing this useful bit of kit.

You will need:

  • An existing Logscape Environment with a subscription (Failover is not supported without a license) running at least version 3.2.
  • A server you wish to make your new Failover Agent.
  • A little bit of understanding about ports
  • A pinch of bravery.

Step 1: Take a Backup!!

The backup screen

Here is the backup screen

Remember I mentioned pitfalls earlier? Here’s the first one. Your workspaces, searches, alerts and users are all stored in the Space directory on the Manager. Now when you join a Failover to the environment, the Failover will sync this folder with it’s own copy, to ensure that they remain consistent. However, it is equally possible to sync your Manager to the Failover – which is the equivalent of wiping your environment!

Of course, you are not going to make such a mistake dear reader; you have this blog post to assist and guide you. Just in case… take a backup. Simply use the Configure button on the Logscape screen, go to the Backup screen and click download to get an XML backup of your environment. That way, if anything goes wrong, you can get back to where you started.

Step 2: Configuring your new Failover Agent

The next element of setting up a Failover is relatively straightforward: installing your Logscape onto your new machine. Make sure it has sufficient resources – use this to help you.

If your machine has never had Logscape installed…

You can use the install instructions to get your server up and running; and if it’s a Windows host you just use the MSI. Make sure that the role has the name Failover at the end.

If your machine previously had Logscape installed…

Take care! If the machine was previously part of this (or any other) environment, then it is possible to overwrite your current environment. If you don’t mind losing previous data, I’d recommend uninstalling Logscape and completely deleting the install folder.

However, if you wish to keep some of the existing data, make sure you do the following:

  1. Stop the Logscape Service.
  2. Delete the contents of the logscape/downloads folder – replace them with the contents of the logscape/downloads folder found in the Logscape install zip file.
  3. Delete the logscape/work/DB folder
  4. Delete the following if they exist*:
    1. logscape/space folder
    2. logscape/work/jetty* folders
  5. Run the logscape/scripts/configure.(bat/sh) and reconfigure the agent as a Failover

*They will only exist if the machine was previously a management agent rather than a Forwarder.

Step 3: Check your Failover has started

Once started, your Failover should start acting as a Management agent: providing a web front end, indexing data etc. It will take a few minutes to do this as it has to expand the Management files, copy various data files from the manager and start it’s new services – so go make a quick cup of tea.

Once it’s been up for a few minutes, go to http://yourservernamehere:8080 – you should have a front end that looks remarkably similar to your Manager!

Step 4: Add the Connection.properties file

You’re not home and dry yet. Whilst the Failover has started, it will not yet take over in the event of a crisis. For that, you need to create a connection.properties file. This should have the following contents:

manager.address=stcp://YOURMANAGERHOSTNAME:PORT
failover.address=stcp://YOURFAILOVERHOSTNAME:PORT

The default port is 11000 – so if you haven’t changed the ports from the default settings, use that. If you have amended them, ensure they are correct for each host.

Save that file as connection.properties and upload it on the deployment screen. Once the hosts have downloaded that file, bounce the system and they will be ready for Failover.

Step 5: Failover Test!

You should now be in a position to test the Failover is working. To make it easier to check, you can edit the logscape/agent-log4j.properties file and make sure you add the following line to turn on Failover logging:

log4j.logger.com.liquidlabs.vso.agent.outtage=DEBUG

The test process is as follows:

  1. Log on to Manager and Failover – confirm they both work.
  2. Create a search on the Manager – confirm it appears on the Failover.
  3. Stop the Manager – The Failover should bounce twice – first when the Manager shuts down and the second time when it has tried to reach the Manager and failed 10 times. After this, the agents will come to the Failover after they have failed to reach the Manager
  4. Confirm that the Failover is still working as expected.

Step 6: Return to normal operations

Assuming the above test has worked, you should be pretty pleased that you have a more resilient environment. Now you want the Manager back up and running, so start it back up.

Once the Manager is back up and running, you will need to bounce the Failover – doing this will restart the agents and they will return to the Manager. However at present, the Failover process does not automatically fail back.

It’s worth bearing this in mind – if you don’t bounce the Failover and return the agents, then it is possible that you will end up with the Management and Failover running independently – meaning they will no longer be in sync.

Hopefully you have found this guide useful – please let us know if it’s helped you or if there are any other areas which could use some guidance!