Log Analysis Performance:
..is akin to driving a car, taking the family on holiday, hitting a big hill and struggling to get to the top. Once you finally reach the peak you are told you need a bigger car, more cars, bigger engine (or that your license doesn’t cover this much luggage!). Leaving you in the state of mind, ‘Im here now, I can’t go back…. but I need to handle more data ($$)’
Jokes aside – we have been putting a lot of effort into performance improvements on the latest release. Log monitoring just got faster and more affordable.
Before digging much deeper, it makes sense to cover the pains of anything that solves data-centric problems. That being the cost of storage + processing capability. Architecture over the last couple of years now accepts that its not scale-up (bigger boxes) but the combined ability to scale-out (more boxes) which is the golden-arrow. Everything has a cost, so depending on the limiting factors you can determine a suitable architecture. For example, if my licenses are cheap or data centric then I can use more commodity (existing) hardware, however because the disks are likely to me slow. Scaling out solves this by throwing more machines at the problem (power and maintenance costs aside). Mind you; SSD prevalence is impending! In our case, we recognise the trade off of increasing server performance WRT disk performance, core-counts and relative savings. It makes sense to reason that a 2-3 server deployment is easier to live with than a 10 server deployment.
Q: What balance do we strike between data and processing density?
To serve as a sizing-guide follow us through on the benchmarking analysis we performed on the latest Logscape release. We are focusing on an IndexStore who’s only purpose in life is to receive remote data, make it searchable and do the search.
What to benchmark?
The answer is another question – what is the server doing? Its participating in a distributed compute environment, it has multiple remote servers streaming data to it, the data is persisted, indexed and searched.
The 2 most important elements are:
- How much data can I index per day? (WRITE)
i.e. sustainable Index Rate in MB/s ?
- How fast can data processed a search-time? (READ)
i.e. how quickly can we serve user search requests?
Data Processing Hardware
- HP PROLIANT DL380 G5,
- DUAL INTEL XEON QUAD CORE X5460 @ 3.16GHz
- 16GB RAM
- 4 x 146GB 10000RPM HDD’s [Raid-0]
- Ubuntu 12.x
Yes the server is a bit dated but it serves well as an industry benchmark. We named it “battlestar”
IO Subsystem performance
Using DD we can determine Read and Write performance in MB/s.
logscape@battlestar:~$ dd bs=100M count=10 if=/dev/zero of=test conv=fdatasync
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB) copied, 3.93265 s, 267 MB/s
logscape@battlestar:~$ dd bs=100M count=10 of=/dev/zero if=test conv=fdatasync
10+0 records in
10+0 records out
1048576000 bytes (1.0 GB) copied, 0.701993 s, 1.5 GB/s
We can convert this to IOPS using simple disk oriented formula – from memory each of these disks runs about 200 IOPS.
We used a Raid-0 configuration because we wanted to ensure that disk wasn’t the limiting factor. After-all – We have 8 x 3.16 GHz cores to milk 😉
Logscape Agent JVM Configuration
Our technology stack is Java based. Using the an older JRE: jdk1.7.0_07 : (to be updated)
Given that we have 6 months of historic data we want to maximize the use of Heap and Off-Heap storage.
- JVM Heap: -Xms4G -Xmx4G -XX:MaxDirectMemorySize=10G
Heap = 4GB, OffHeap-Indexes:10GB
- Data tenuring period: sysprops:-Dlog.max.slow.days=999
(this tells the agent to treat anything as new data and processing in normal priority threads). Without this setting long-historical searches are treated as background tasks.
- Threads:sysprops:-Dlog.search.threads=8 : using 8 processing threads – otherwise logscape will allocate Cores – 2.
Its all about the Data:
We are processing 2 sorts of data: SysLog [4.4GB] and Application Log [2.2GB] (log4j). Each data type has a different profile in terms of standard fields and discovered fields. Note: field discovery is on.
Benchmark – Indexing Performance
Importing the Syslog data took 4.minutes with 40% of CPU resources allocated to the task. The 40% provides 60% clear bandwidth for other tasks like Incoming Streams and Search requests.
Indexing occurs at a rate of 18.3MB/s.
Scale this out to Minutes, Hours and Days: 40% CPU allows for 1.1GB/m, 66GB/h, 1584GB/day
Benchmark – Search Performance
Whats the point of Indexing 1584GB/day if you cant serve it up to users? So the real challenge here is to understand where the bottle neck is in a READ process. Search use cases are as varied as the British weather.
The worst case is the brute force adhoc search by a poweruser that wants the world!
The Search (7 days):
* | _host.count() _agent.equals(lab.uk.IndexStore)
This will pull back everything from this particular IndexStore (there are 12 servers in this environment)
We can either use the Logscape UI as an indicator, this search took 22s to return 8,953,585 results. Giving us a just over 400K events per second. If you examine logscape/work/event.log – the search performance is also audited there. In this case we see 492K events per second. The difference is due to search complete coordination of 3 seconds.
Search performance is about 500K events per second.
Performing another search against syslog – mail log data we see results on 292K events per second.
We are currently performing more analysis, upgrading JDKs etc. We will come back and update with a few screen grabs to show the tools we used as part of this process.