Momchil Atanasov
mokiat's blog

Follow

mokiat's blog

Follow

Writing maintainable Logstash filters

Momchil Atanasov's photo
Momchil Atanasov
·Nov 4, 2022·

8 min read

We can agree that Logstash is a powerful tool for log collection and processing. In fact, it’s design is so flexible, that it can be used for the processing of all sorts of data. With the various filter plugins it packs, one can easily extract structured data from raw text entries.

But as we all know — "with great power comes great responsibility".

If you ever thought having to work with poorly written source code was hard to swallow, you have clearly not seen poorly written Logstash filters. When designed to handle complex scenarios - trying to handle multiple input data formats from various producers, these configurations can easily get out of hand.

If you are new to Logstash, following are three tips to get you started with writing easy to read and maintainable Logstash filters.

Conditional parsing

If you are doing log collection of a large heterogeneous and multilingual distributed system, the format of log messages is likely to change over time. This could be due to a redesign of the log format or the introduction of a new component. Either way, it is best to have Logstash filters designed to handle failures.

Let us consider the following log message.

07-01-2017 21:13:00 Hello World

We can quickly put together a filter that can parse this.

grok {
  match => {
    "message" => "^%{DATE_EU:[@metadata][tmp_date]} %{TIME:[@metadata][tmp_time]} %{GREEDYDATA:content}$"
  }
}
mutate {
  add_field => {
    "[@metadata][tmp_timestamp]" => "%{[@metadata][tmp_date]} %{[@metadata][tmp_time]}"
  }
}
date {
  match => ["[@metadata][tmp_timestamp]", "dd-MM-yyyy HH:mm:ss"]
}

Notice how we are using the @metadata field to prevent any temporary fields from ending up in the actual output. You can find more information here.

This works fine until someone deploys a new component to our system that logs in a different format.

2017-01-07T21:13:00Z Hello World

The grok plugin will fail to parse the message. Execution would not stop there, however. Instead, the mutate plugin will construct the [@metadata][tmp_timestamp] field with an invalid content of %{[@metadata][tmp_date]} %{[@metadata][tmp_time]} which will lead to the date plugin failing as well.

Worst of all, the date plugin will complain and report a warning message to the Logstash log.

Failed parsing date from field {:field=>"[@metadata][tmp_timestamp]", :value=>"%{[@metadata][tmp_date]} %{[@metadata][tmp_time]}", :exception=>"Invalid format: \"%{[@metadata][tmp_date]} %{[@met...\"", :config_parsers=>"dd-MM-yyyy HH:mm:ss", :config_locale=>"default=en_US", :level=>:warn}

Unless there is a good log rotation in place, this can quickly fill up the disk of the host machine and result in disruption of the log processing pipeline.

One way we can solve this is to check whether the grok plugin managed to parse the message successfully.

grok {
  match => {
    "message" => "^%{DATE_EU:[@metadata][tmp_date]} %{TIME:[@metadata][tmp_time]} %{GREEDYDATA:content}$"
  }
}
if "_grokparsefailure" not in [tags] {
  mutate {
    add_field => {
      "[@metadata][tmp_timestamp]" => "%{[@metadata][tmp_date]} %{[@metadata][tmp_time]}"
    }
  }
  date {
    match => ["[@metadata][tmp_timestamp]", "dd-MM-yyyy HH:mm:ss"]
  }
}

Though this does the job, with additional checks that might follow, it can easily result in heavily nested and indented filters that are hard to follow through. A flat design, I have found, is usually easier to comprehend. We can achieve that by doing conditional operations based on the existence of dependent fields.

grok {
  match => {
    "message" => "^%{DATE_EU:[@metadata][tmp_date]} %{TIME:[@metadata][tmp_time]} %{GREEDYDATA:content}$"
  }
}
if [@metadata][tmp_date] and [@metadata][tmp_time] {
  mutate {
    add_field => {
      "[@metadata][tmp_timestamp]" => "%{[@metadata][tmp_date]} %{[@metadata][tmp_time]}"
    }
  }
}
if [@metadata][tmp_timestamp] {
  date {
    match => ["[@metadata][tmp_timestamp]", "dd-MM-yyyy HH:mm:ss"]
  }
}

The benefit of this approach is that we clearly state the data that is required in order to perform a given operation and it produces flat filters that are much easier to follow through, especially when configurations get longer.

Side note: I went overboard have by adding the two conditions. The mutate filter is unlikely to fail, so we could have gone with a single if statement instead. However, I wanted to showcase how this approach produces flat filters.

The conditional approach is often the preferred way to safeguard ourselves from unsupported messages or unexpected situations. In this exact case, however, there is another (more elegant) approach.

Turns out that the date filter will not complain if the field we match on (in our case [@metadata][tmp_timestamp]) does not exist at all. If we perform the mutate operations within the grok plugin we can assure that the field will not exist on grok failure, since those field operations are only executed on successful parse.

grok {
  match => {
    "message" => "^%{DATE_EU:[@metadata][tmp_date]} %{TIME:[@metadata][tmp_time]} %{GREEDYDATA:content}$"
  }
  add_field => {
    "[@metadata][tmp_timestamp]" => "%{[@metadata][tmp_date]} %{[@metadata][tmp_time]}"
  }
}
date {
  match => ["[@metadata][tmp_timestamp]", "dd-MM-yyyy HH:mm:ss"]
}

The [@metadata][tmp_timestamp] field will only be created if the grok filter manages to parse the message successfully. We have produced a more compact filter configuration.

Splitting filters apart

The messages we try to parse will likely follow a logical structure — often hierarchical. For example, we might be receiving syslog messages that wrap our actual log messages. We can parse the syslog message first, and only afterwards, depending on the source of the syslog message, could we perform additional parsing on the syslog payload.

In such situations, it can be beneficial to have our filter configurations split apart. We would have a separate filter file just for the syslog parsing and then subsequent ones for the payload. This is especially useful when writing tests for our filters, which will be explored in the next section.

Syslog is not the most trivial format to parse nor use as an example, so let’s use a simplified, custom-made format for log shipping.

<timestamp> <level> <source> "<content>"

Our logs could then look as follows.

2017-01-05T14:28:00 ERROR catalog "product.go:27: invalid id"
2017-01-06T21:13:00 WARN web "GET /index.html"

The first one is clearly from a Go application called catalog and the second one is from as static content hosting application called web. They share the same shipping format, though the payload differs. In this case, it might make sense to have three filter configurations.

The first one will deal with the parsing of the shipping format only. Following is one possible implementation.

grok {
  match => {
    "message" => "^%{TIMESTAMP_ISO8601:[@metadata][timestamp]} %{WORD:level} %{NOTSPACE:source} \"%{DATA:message}\"$"
  }
  overwrite => ["message"]
}
date {
  match => ["[@metadata][timestamp]", "ISO8601"]
}

Next, we can have a separate filter that deals with the log messages from our catalog application.

if [source] and [source] == "catalog" {
  grok {
    match => {
      "message" => 
"^%{DATA:filename}:%{NUMBER:line}: %{DATA:message}$"
    }
    overwrite => ["message"]
  }
}

Finally, we implement last one filter for the web application logs.

if [source] and [source] == "web" {
  grok {
    match => {
      "message" => "^%{WORD:method} %{NOTSPACE:path}$"
    }
  }
}

We could have placed all of that logic into a single filter configuration. In fact, we could have implemented it with a single multi-match grok filter. In the long run, though, it might become much harder to maintain and test.

The three filters can either be placed and sorted alphanumerically in a filter folder or a templating mechanism like ERB could be used during provisioning to concatenate them into a single filter file.

Testing filters

The last aspect I would like to explore is filter testing. I believe this to be the most important approach to maintainable filters. A suite of tests can be the best specification for your filters and a great support when needing to make a change. In fact, if there was ever a perfect scenario to do TDD, then this has to be it. It is extremely easy to describe your requirements in a test, and only afterwards add the parsing logic.

In order to write tests, one first needs to prepare their Logstash installation.

bin/logstash-plugin install --development

Next, we write our test. Let’s presume that all of our filters are placed in a folder called filters. Well use a separate folder calledteststo store our tests inside. We create ashipper_spec.rbfile and place it inside thetests` folder.

# encoding: utf-8
require 'logstash/devutils/rspec/spec_helper'
describe 'shipper' do
  config <<-EOF
    #{File.read(File.dirname(__FILE__) + '/../filters/01-shipper.conf')}
  EOF
  sample({
    'message' => '2017-01-06T21:13:00Z WARN software "hello world"'
  }) do
    insist { subject['tags'] }.nil?
    insist { subject['@timestamp'].to_s } == '2017-01-06T21:13:00.000Z'
    insist { subject['level'] } == 'WARN'
    insist { subject['source'] } == 'software'
    insist { subject['message'] } == 'hello world'
  end
end

For those of you that are familiar with rspec in Ruby, you will quickly notice that this test shares some similarities to it. In fact, one can also use nested describe and it statements to add structure to larger tests.

We can run the test above with the following command.

bin/rspec tests/

This will run all of our *_spec.rb tests in the tests folder.

Note: Since version 5.0.0 of logstash, the rspec script in the bin folder is no longer available and at the time of writing there is no good documentation on how to run similar types of tests. This is tracked as an issue here. It seems the new preferred way is to write integration tests on the whole filter configuration and treat Logstash as a black box instead. I cannot say I fully agree with this decision.

Conclusion

Well, there you have it. As promised, three approaches to writing maintainable Logstash filters. If you have been using Logstash for a while, you have probably already found other design patterns that best suit your needs. But for beginners, I think these are good ones to start at and keep in mind.


Note: This is an old article ported from medium. Some of the content may not be accurate anymore.

 
Share this