Killing your Logstash performance with Grok
Your team is developing the next generation software-as-a-service solution, which promises to revolutionize the software industry. You have been reading on all of the latest trends in cloud development — 12 factor apps, microservices, continuous deployment — you name it.
However, you quickly realize that operating such a software solution will not be a child’s play. With so many virtual machines or containers, running small software components written in various languages, it becomes impossible to troubleshoot issues without a proper logging and monitoring stack in place.
Luckily, you stumble upon the Logstash, ElasticSearch, and Kibana trio. The first promises to enable easy log collection and processing, whereas the latter two should allow for distributed storage and proper visualization of the data.
Having set up ElasticSearch and Kibana, you turn your attention to Logstash. You see that there is a vast set of plugins available and one catches your eye — Grok. It should allow you to extract structured data from your otherwise plain line-based logs, so you start coding Grok expressions and are satisfied with the amount of useful data your are able to extract — log level, actual timestamp, source machine name, etc — all aspects you can later filter on in Kibana.
All is going well, except that the performance you get is not what you had expected. The VM running Logstash is on fire with CPU going at 100%.
Logstash Tuning
You start looking on the web and realize that you forgot to tune some Logstash settings. Nothing to worry about.
You use the LS_HEAP_SIZE
environment variable to allocate a fixed chuck of memory for the heap, assuring there is no swap to disk occurring on the machine. Once this does not help, you find out there is a -w
option to configure the number of Logstash workers. With still no improvement, you turn to the -b
property, allowing you to control the batching in Logstash. After that option does not produce any visible improvement either, you give up and just scale the Logstash component horizontally to handle the load.
Grok Regex
I have been in a similar situation, so I know the pain. However, after some further research and troubleshooting, the problem was narrowed down to the Grok plugin. After looking up information on the internet and running some experiments, I finally discovered what had been done wrong.
None of the Grok expressions had any regex anchors (^
and $
), resulting in much slower string rejection times. Let me use an example to visualize the problem.
Let’s say that you want to handle the following type of log messages.
192.168.10.15 - - [07/Mar/2016:13:10:02 -0800] "GET /products/0/price HTTP/1.1" 200 51
You would probably implement a Grok filter that looks similar to this.
grok {
match => {
"message" => [
"%{NOTSPACE:remote} - - \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{NOTSPACE:path} %{DATA:protocol}\" %{NUMBER:bytes} (?:-|%{NUMBER:duration})"
]
}
}
Now imagine that a new microservice is introduced to your system that produces log messages of the following format.
192.168.10.31 - - [12-03-2004:11:01:28 -0800] "GET /category/0/products HTTP/1.1" 200 815
So you decide to extend your filter to support this new format as well. You end up using the multi-match support in Grok.
grok {
break_on_match => true
match => {
"message" => [
"%{NOTSPACE:remote} - - \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{NOTSPACE:path} %{DATA:protocol}\" %{NUMBER:bytes} (?:-|%{NUMBER:duration})",
"%{NOTSPACE:remote} - - \[%{NOTSPACE:timestamp} %{INT:timezone}\] \"%{WORD:method} %{NOTSPACE:path} %{DATA:protocol}\" %{NUMBER:bytes} (?:-|%{NUMBER:duration})"
]
}
}
Side note: I know that we could have stuffed everything into a single expression. There are cases, however, where the two formats can vary vastly and you would still end up with the shown scenario.
So the filter we have written is somewhat readable and works. Nothing to worry about, right? Wrong!
What the Grok documentation and examples fail to mention is that these expressions perform substring matching. That is, our filter would also match the following log line.
PREFIX 192.168.10.15 - - [07/Mar/2016:13:10:02 -0800] "GET /products/0/price HTTP/1.1" 200 51 SUFFIX
The performance problem arises when Grok fails to match a string. It will try to match starting at every character of the string, resulting in many iterations, making it slow. (OK, my explanation might be a strong oversimplification of what actually happens and how the algorithm works, but you get the idea). Precious processing power is being wasted.
The correct way to implement your regex expressions would be to use the ^
and $
anchor characters, resulting in the following Grok filter.
grok {
break_on_match => true
match => {
"message" => [
"^%{NOTSPACE:remote} - - \[%{HTTPDATE:timestamp}\] \"%{WORD:method} %{NOTSPACE:path} %{DATA:protocol}\" %{NUMBER:bytes} (?:-|%{NUMBER:duration})$",
"^%{NOTSPACE:remote} - - \[%{NOTSPACE:timestamp} %{INT:timezone}\] \"%{WORD:method} %{NOTSPACE:path} %{DATA:protocol}\" %{NUMBER:bytes} (?:-|%{NUMBER:duration})$"
]
}
}
I downloaded sample log data from the internet and performed some performance tests with the filter above. I was able to get a 35% improvement by introducing the anchor characters in the expressions.
I have seen scenarios with real data where, at times, improvements were up to 90%, reducing CPU spikes and load drastically.
If you have regular expressions that are always expected to match, the anchors are unlikely to produce any improvement. The problem arises only once your log data gets polluted or you start relying on negative matches.
In retrospective, I should have known better from the start. Regular expressions should never be taken lightly. As powerful as they are, they can bring your software down easily and out of nowhere. Just take a look at the following outage report from Stack Exchange: stackstatus.net/post/147710624694/outage-po..
I find myself blaming Grok’s documentation to a large extent. In none of their examples do they use any anchors, nor do they document the default behavior, allowing you to think that this is handled for your behind the scenes.
This presumption is made in open source software out there as well. Take a look at the following snippet from the logsearch-boshrelease project.
To finish things off, I would like to say that maybe going for Grok is a bad idea overall. For legacy software, you might not have an alternative, but if you are starting your software solution from scratch, you should consider having your logs in a structured format (e.g. JSON) from the get-go (i.e. have the applications output it that way).
Receiving structured logs, your Logstash installation would barely need to do anything, except maybe delete or rename some unneeded fields and manage index creation and data insertion into ElasticSearch (or whichever output you use).
One distributed software that has gone down this road is Cloud Foundry, where components log JSON messages to standard output / error. Unfortunately, all the logs from the machines are sent via a custom non-standardized syslog format, so you still end up having to write Grok filters to get to the JSON content.
Note: This is an old article ported from medium. Some of the content may not be accurate anymore.