Logstash Processing Pipeline

edit

The Logstash event processing pipeline has three stages: inputs → filters → outputs. Inputs generate events, filters modify them, and outputs ship them elsewhere. Inputs and outputs support codecs that enable you to encode or decode the data as it enters or exits the pipeline without having to use a separate filter.

Inputs

edit

You use inputs to get data into Logstash. Some of the more commonly-used inputs are:

  • file: reads from a file on the filesystem, much like the UNIX command tail -0F
  • syslog: listens on the well-known port 514 for syslog messages and parses according to the RFC3164 format
  • redis: reads from a redis server, using both redis channels and redis lists. Redis is often used as a "broker" in a centralized Logstash installation, which queues Logstash events from remote Logstash "shippers".
  • beats: processes events sent by Filebeat.

For more information about the available inputs, see Input Plugins.

Filters

edit

Filters are intermediary processing devices in the Logstash pipeline. You can combine filters with conditionals to perform an action on an event if it meets certain criteria. Some useful filters include:

  • grok: parse and structure arbitrary text. Grok is currently the best way in Logstash to parse unstructured log data into something structured and queryable. With 120 patterns built-in to Logstash, it’s more than likely you’ll find one that meets your needs!
  • mutate: perform general transformations on event fields. You can rename, remove, replace, and modify fields in your events.
  • drop: drop an event completely, for example, debug events.
  • clone: make a copy of an event, possibly adding or removing fields.
  • geoip: add information about geographical location of IP addresses (also displays amazing charts in Kibana!)

For more information about the available filters, see Filter Plugins.

Outputs

edit

Outputs are the final phase of the Logstash pipeline. An event can pass through multiple outputs, but once all output processing is complete, the event has finished its execution. Some commonly used outputs include:

  • elasticsearch: send event data to Elasticsearch. If you’re planning to save your data in an efficient, convenient, and easily queryable format…​ Elasticsearch is the way to go. Period. Yes, we’re biased :)
  • file: write event data to a file on disk.
  • graphite: send event data to graphite, a popular open source tool for storing and graphing metrics. http://graphite.wikidot.com/
  • statsd: send event data to statsd, a service that "listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services". If you’re already using statsd, this could be useful for you!

For more information about the available outputs, see Output Plugins.

Codecs

edit

Codecs are basically stream filters that can operate as part of an input or output. Codecs enable you to easily separate the transport of your messages from the serialization process. Popular codecs include json, msgpack, and plain (text).

  • json: encode or decode data in the JSON format.
  • multiline: merge multiple-line text events such as java exception and stacktrace messages into a single event.

For more information about the available codecs, see Codec Plugins.

Fault Tolerance

edit

Events are passed from stage to stage using internal queues implemented with a Ruby SizedQueue. A SizedQueue has a maximum number of items it can contain. When the queue is at maximum capacity, all writes to the queue are blocked.

Logstash sets the size of each queue to 20. This means a maximum of 20 events can be pending for the next stage, which helps prevent data loss and keeps Logstash from acting as a data storage system. These internal queues are not intended for storing messages long-term.

The small queue sizes mean that Logstash simply blocks and stalls safely when there’s a heavy load or temporary pipeline problems. The alternatives would be to either have an unlimited queue or drop messages when there’s a problem. An unlimited queue can grow unbounded and eventually exceed memory, causing a crash that loses all of the queued messages. In most cases, dropping messages outright is equally undesirable.

An output can fail or have problems due to downstream issues, such as a full disk, permissions problems, temporary network failures, or service outages. Most outputs keep retrying to ship events affected by the failure.

If an output is failing, the output thread waits until the output is able to successfully send the message. The output stops reading from the output queue, which means the queue can fill up with events.

When the output queue is full, filters are blocked because they cannot write new events to the output queue. While they are blocked from writing to the output queue, filters stop reading from the filter queue. Eventually, this can cause the filter queue (input → filter) to fill up.

A full filter queue blocks inputs from writing to the filters. This causes all inputs to stop processing data from wherever they’re getting new events.

In ideal circumstances, this behaves similarly to when the tcp window closes to 0. No new data is sent because the receiver hasn’t finished processing the current queue of data, but as soon as the downstream (output) problem is resolved, messages start flowing again.

Thread Model

edit

The thread model in Logstash is currently:

input threads | filter worker threads | output worker

Filters are optional, so if you have no filters defined it is simply:

input threads | output worker

Each input runs in a thread by itself. This prevents busier inputs from being blocked by slower ones. It also allows for easier containment of scope because each input has a thread.

The filter thread model is a worker model where each worker receives an event and applies all filters, in order, before sending it on to the output queue. This allows scalability across CPUs because many filters are CPU intensive (permitting that we have thread safety).

The default number of filter workers is 1, but you can increase this number by specifying the -w flag when you run the Logstash agent.

The output worker model is currently a single thread. Outputs receive events in the order the outputs are defined in the config file.

Outputs might decide to temporarily buffer events before publishing them. One example of this is the elasticsearch output, which buffers events and flushes them all at once using a separate thread. This mechanism (buffering many events and writing in a separate thread) can improve performance because it prevents the Logstash pipeline from being stalled waiting for a response from elasticsearch.

Resource Usage

edit

Logstash typically has at least 3 threads (2 if you have no filters). One input thread, one filter worker thread, and one output thread. If you see Logstash using multiple CPUs, this is likely why. If you want to know more about what each thread is doing, you should read this article: Debugging Java Performance. Threads in Java have names and you can use jstack and top to figure out who is using what resources.

On Linux platforms, Logstash labels all the threads it can with something descriptive. For example, inputs show up as <inputname, filter workers show up as |worker, and outputs show up as >outputworker. Where possible, other threads are also labeled to help you identify their purpose should you wonder why they are consuming resources!