Today Ben Newton returns for the second in a series of three Blog articles covering his progression through building a monitoring solution for Microsoft HPC through Logscape, todays article covers Data collection, both in where the data was sourced, and how he chose to format the data. You can find more of Ben’s work on his Github page, or his LinkedIn.
Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.
-Sherlock Holmes, The Adventure of the Copper Beeches
Having decided what we wanted to measure, we now needed to collect the raw data we would need. Since HPC Server 2012 is a distributed network of machines, our first challenge was to discover where Microsoft actually stored the kind of data we were interested in. It is simply not possible to retrieve all the information that might be needed from one source – We discovered that in fact there were three locations, each containing different data:
- The Head Node, containing up to date information regarding the current state of the Grid.
- The Database, containing the history of actions and changes.
- The Compute Nodes, hosting application logs and detailed action logging.
Each source required a different data mining technique.
Head Node – PowerShell
One of the most useful tools HPC Server comes with is the PowerShell Snap In – a collection of command line tools that allow you to interact with the HPC Cluster. You can administer the grid entirely programmatically if you choose which includes pulling back useful monitoring information.
To implement Logscape monitoring, a parent script runs PowerShell scripts as a forked process under Logscape. This meant we could design some simple PowerShell scripts to pull the monitoring information we required and write it to a file, which Logscape could then ingest. All of the PowerShell scripts are endless loops – they start once and then output data every so often according to a schedule. This reduces the overhead of starting/stopping.
This method has several advantages, the first being flexibility – these services can theoretically run on any host connected to the Cluster to pull back the monitoring for that cluster. However, we found it sensible to make the Head Node of each Cluster run the scripts, which allows you to use the host of the data as Cluster delineation.
Originally, we had some concerns that there might be some performance penalty, but none is noticeable. For most of the time, the services are counting down to the next request, leaving the PowerShell Services dormant.
Here is a brief overview of the Services provided – if you want to examine the code download the app and look in the lib directory.
A snapshot of the current Grid status: Jobs running, Cores available etc. Despite its simplicity, this is one of the most useful outputs we monitor because it gives a reactive, high level view of the changes in queues and Node availability. Since it provides the aggregate view of the Cluster activity, it’s perfect for base-lining alerts.
Information on the Cluster settings – the configuration set in the Head Node. These should rarely change, so it runs every few hours. The main use for this information is to allow Administrators to compare two clusters against each other or track changes in underlying settings made by colleagues.
The current state of each Node and performance metrics according to the Head Node. One advantage of using this method to track Node performance and availability is that it closer reflects how it will interact with the cluster. It also means that you do not necessarily require a Logscape Forwarder on each compute node.
We found this especially useful with Azure Compute Nodes, because their status or performance may well be abstracted from any on premises monitoring or interface. Of course should you delete the Node from the Cluster or prevent its communication, you will lost performance metrics until it returns.
The HPC Event Log Events from the Head Node – but with SQL Success events removed as they were too numerous and obscured useful data. Useful for diagnosing errors and discovering if the HPC Services themselves were restarted.
Pulls back a detailed Job History, which is useful for analysing historic trends for future requirements. Unfortunately, it’s made clear in the documentation for HPC Server that the Job History takes a while to stabilise – so rather than fetching a history as soon as it was completed, it fetches older history on a slightly slower schedule.
Live Call Totals
This query pulls back the number of completed and queued calls on the Cluster as a whole, as well as the average duration of those calls. If there are no calls running, this returns 0. This is a highly important data feed, because it gives us an understanding of how much work is actually queued inside the SOA sessions, as well as the average time that the queue is likely to reduce by.
The standard implementation runs this query every 30 seconds, providing the optimum balance between regular information and performance. This information combines brilliantly with the Cluster Overview data, because they both take high level aggregate information.
Live Call Individuals
Provides the number of Calls pending and completed for each active Job, as well as Job submission information. Unlike the previous data set, this will only return data if there are Jobs in an active state. This data allows you to track live Grid utilisation by Project, Pool or Template – so it becomes evident if a certain user base has queued up to much work for their available capacity.
The PowerShell information is incredibly useful in nature, simple to attain and comes from a single source – which is why a large amount of the monitoring data comes from here. However, for aggregate reporting and utilisation, snap-shot data is less useful.
The Database – SQL
Having mined all the data we could from PowerShell, we still had unfulfilled requirements: largely on the reporting and error handling .To find reporting data, we had to investigate the SQL database which HPC Server uses to store its information.
There are two databases of interest for monitoring: Reporting and Scheduler. Reporting is used by the internal reporting tool for long term utilisation metrics, whilst Scheduler is used in the running of HPC to record actions.
It should be noted that Microsoft advises against using the Scheduler database for monitoring to avoid database locks – which is why they have abstracted the Reporting information to a separate database. An earlier iteration of the App did use some small SQL queries to collect the Live Call information – and despite months of testing, no errors became apparent. However, in order to prevent any latent issues and to make implementation easier, they were redesigned as PowerShell scripts.
Logscape runs a Groovy script on a schedule that connects to the database, runs the query and then outputs the data to a file. Unlike the PowerShell methodology, no process persists between executions. Whilst this increases the overhead slightly on the machine running the scripts, it prevents the SQL connection from staying open which we wished to avoid at all costs.
This is essentially the aggregate history of your Grid’s work, in a format slightly more convenient for producing reports which combined with the Job History data to give you a complete overview of your historic use.
This collects the Utilisation of the Grid. Since you can break it down by Node, you can determine fairly quickly how optimal your set up is and then determine whether or not you need further resources.
The database queries complete the monitoring and reporting aspect of HPC – allowing oversight of usage and activity. However, it doesn’t cover errors that have occurred on the Nodes themselves.
Node Logging – HPC Parser
Any SOA application written to use HPC Server should provide logging on its own actions: most write to a local file. You can then use Logscape to collect those application logs using Logscape and ingest them. Since the log format will be designed by you, it’s beyond the scope of this article (but not this one).
HPC Server also active logging: compressed and stored locally on each Node. If you wish to analyse these logs for errors, you have to go through a long manual process on every single Node using the hpctrace command to extract the files you’re interested in.
In order to make this simpler, the HPC Parser was developed – it is now integrated with the HPC App. This application allows administrators to schedule automatic extraction of logs and to use criteria to determine detail level and which logs will be extracted. In addition there is a manual option which allows a user to log onto an affected machine and use a menu to choose which logs need extraction.
The upside of this is that, with a correctly configured HPCParser process, you can automatically ingest important warning and error logs. However, since it runs on node individually, there must be a Logscape Forwarder on each node – if you are using Windows Azure you will need to ensure your environment is Cloud Compatible.
Output and Some Lessons Learnt
Check your format is suitable Most of these scripts output comma-separated values (‘CSV’). The advantage of this is that it is the most disk-space effective. The disadvantage of it is that the data requires typing correctly within Logscape – losing or changing the data type will render the data useless.
For the Node data we switched to JSON output, which uses key value pairs. This data is much more portable, although takes up much more space. It also requires less typing, since Logscape can extract the values directly.
This was forced because our original implementation was flawed, in that each node attribute was a separate line (data format: Date, Node Name, Metric Name, Metric Value). Whilst this worked fine in a dev environment, in an environment with 15 Nodes, it was producing 225 lines of data every 30 seconds (15 nodes * 15 metrics). With the new JSON implementation, the line count has dropped to 15 lines every 30 seconds, making for a much more manageable file. It’s also much more functional and allows for cross referencing.
Make sure you collect all the meta-data. When we first started collecting data from HPC Server, we collected information on which Job was involved. The someone asked about the Project, so we added Project. Then someone needed the owner… you get the idea. The more meta-data you have on your data, the more ways you can split it and report on it.
As a general rule, we discovered that for the SOA application logs, it was essential to log the Job ID and the Session/Task or Scenario ID on every line. Otherwise, if a node deals with 200 tasks with one error, it’s much easier to determine which task was at fault if that tasks ID is registered. Also, ensure you log your final success/failure state for each job id and task id if you’d like a comfortable life!
For monitoring, Percentages beat numbers Is 2345MB RAM in use a problem on Host X? You don’t know. Is 15% in use a problem? No. If you can make sure all your data resolves into percentages, your monitoring charts will always be consistent and easier to work with. Also, make sure you have a field for 100Pct (and maybe 50Pct) in your data-type. It means your graph will always resolve to 100%, otherwise Logscape will scale the graph accordingly, making it less useful as a live monitor as you need to check the scale first.
Heat maps work Data can be heat mapped – given assigned colours according to either a numeric scale or values in position. All HPC App data has been heat mapped – make sure yours is too! It is far easier to spot outliers and issues quickly.
At last, we have all our data! Now all we needed to do is provide a nice simple interface…
Don’t forget to check back tomorrow for the third and final instalment from Ben where he will go over the process of building the App within the Logscape environment.