Using Logscape with HPC, 1 of 3

fQf3scmToday we have a blog article by Guest writer Ben Newton, Ben manages a HPC Grid where Logscape is used in anger for all their management needs. This is the first of a three part series by Ben on how they use Logscape, and built their monitoring Solution. You can find more of Ben’s work on his github page, or his LinkedIn.

Microsoft HPC Server 2012 – More Compute, More Monitoring 

640K ought to be enough for anybody.

    – Unknown… but not Bill Gates!

Businesses have more data than ever and increasingly ingenious ideas about how to use it. The end result is that even a small start-up can end up with operations that might run for hours on a single desktop machine. You might hear rumours of Excel spreadsheets that are left to run on a desktop overnight – the crash of a machine or an unexpected outage causes curses and massive delays.

The answer is to use High Performance Computing (‘HPC’): much bigger quantities of compute power used to tear through the calculations at a greater speed. Whilst those with deep wallets may still pay for mainframes, most would consider Grid computing the sensible answer – multiple machines linked together in a Grid, with a master scheduler distributing the work and collecting the results. This can result in vastly improved performance and fault tolerance. However, the middleware that organises the workload balancing becomes crucial to your operation – which means it needs careful monitoring. What else would we choose for that but Logscape?

Our recent efforts focused on monitoring Microsoft HPC Server 2012 R2 – the giant’s latest offering. It’s a step up from the previous 2008 incarnation in many respects, notably:

  • A better looking interface – very important when you look at it all day.
  • Better Azure support – allowing you burst work to Azure Nodes on demand directly, even using Linux Nodes.
  • Finer degrees of control around workload balancing and resource pooling.

The latest version does include some native monitoring, perhaps enough for a casual user or a small business. However, some of the main issues we found were:

  1. The data isn’t granular enough to report on individual node performance.
  2. You can’t create graphs which overlay different metrics over one another.
  3. You can only report on one cluster at a time – making it laborious to cross-check multi-cluster environments.
  4. You can only filter on one criterion at a time, making multiple reports time-consuming.
  5. You need to be an Administrator to see the Charts and Reports section.

Whilst this monitoring is a better offering than Microsoft previously offered[1], a serious team running the Grid will want more detail and information. This has culminated in the fresh release of the Microsoft HPC App for Logscape. It simply plugs into your Logscape environment and gives you instant visibility over your cluster with flexibility of both monitoring and reporting.

This article will focus on some of the challenges we faced as well as technical explanations as to how we overcame them. However, if you just want to get straight into monitoring your Grid, download the app and follow the quick start guide!

Defining the Requirements: What metrics are useful?

there are known knowns … there are known unknowns … But there are also unknown unknowns – the ones we don’t know we don’t know.

    – Donald Rumsfeld

In order to be clear about how HPC Server works, it’s probably worth defining a few terms:

Head Node: The master scheduler, which controls the grid and ensures Nodes are properly balanced.

Broker Node: An optional but recommended Node Type, this connects the client machines to the Grid, and ensuring work is processed and queued efficiently.

Compute Node: A computer which will run the tasks set by the Head and Broker Nodes.

Job: The parent operation, submitted by the user. It contains one or more tasks.

Task: The initial division of workload, tasks queued and distributed to the Nodes of the Grid. Each core of each node can work on a task at a time. Tasks may require one or more calls.

Calls: Calls to the system components, the individual actions of the task/

Our initial user stories when asked to build the monitoring ran along the following lines:

  1. I want to know how many jobs are in progress and how far they’ve progressed.
  2. I want to be able to monitor the current state and performance of a compute node, a group of nodes or the entire grid.
  3. I want to be alerted when bottlenecks occur in time to resolve them.
  4. I want to know the utilisation of the Grid at any point.
  5. I want to know how much work each user or project has put through.

So we wished to monitor the current state of the Grid as well as report the history of utilisation and performance for long term trends. Some of them sound easy – how hard can it be to tell how far a job has progressed? Well, it all depends on your Job type.

Job Types

Microsoft HPC Server 2012 R2 allows multiple job types – methods of putting work on the Grid. The simplest to use is a Batch Job, a single, independent script that generates a large amount of tasks which are independent of each other. When these jobs are placed on the HPC Grid, they are relatively simple to monitor: this is because the completion time of the task should be similar for every core.

A Parametric Sweep is similar to a batch, except that HPC parses a limited selection of input parameters to the same script on every core. Again, the task time should be similar for every core and tasks queues may well be sufficient for monitoring.

However, HPC Server also allows for Service Oriented Architecture Jobs (‘SOA’) – which do not work in such a simplistic manner. Using SOA, each core hosts an SOA service, which may use the Windows Communication Framework (‘WCF’) to send and receive information and run work. This would appear to be the most useful model – since it means the job does not have to be isolated but can act recursively, react to fresh input or change in scale as required. It means that as long as the Task persists, the connection remains to the service and work can be passed through.

This is the preferred option for developers: a much more flexible attitude to Grid computing. Unfortunately, it renders Task tracking and various inbuilt monitoring less useful. In an SOA job, a Task will last as long as the SOA session persists – which could be seconds, minutes or hours depending on the underlying program logic. In theory a Job and an associated Task could persist for hours, without any Calls to process. This is why we resorted to measuring System Calls as a method of determining workload and current queue length –they give us the most accurate picture of the state of the Grid.

For tracking the Node performance, we decided to use traditional metrics: disk throughput, processor use and context switching among others. We also needed to ensure the Broker metrics were being collected, including the WCF and MSMQ traffic. Some of this could theoretically be gained using the Windows App – however, it became clearer that it would need to be integrated with the HPC App in order to deal with Windows Azure, where the user may not have a Forwarder on the system. Therefore, we made the decision to integrate these metrics to the HPC App.

It became clear we were going to need to split our requirements into two. Monitoring logs would poll data frequently, giving us the information to built granular queues, up to the minute monitoring and keep an eye out for problems. However, in order to deal with the needs of reporting, aggregate data would need to be collected so we could provide monthly or annual statistics and chart long term trends.

Once we were clear on how we would measure a jobs progress (Calls) and the Nodes performance, we then decided to cover utilisation both at a Node and core level, since that would give us more granular information. Now all we had to do was find the data that would give us this information.