Fiverr DevOps Team: Introducing ELK (Elasticsearch, Logstash, Kibana)

Written by Marina Turetsky, DevOps Engineer at Fiverr.

As you probably know, Fiverr® is a global online marketplace, serving millions of buyers and sellers who generate millions of requests on our site. But what you may not know is that these requests are recorded into Fiverr traffic logs.

Our load balancer logs are important because they allow us to gain valuable insights related to the site and its traffic.. It’s crucial for both our DevOps and Dev teams to have a high-level view into this resource, as well as the ability to drill down into a single request. We need to be able to detect abnormal behavior in traffic volume (both drops and spikes), to aggregate data (by specific routes or backends, by error code, or by specific internal service), or even weed out malicious or fraudulent users. In order to do any of this, we need to have a quick and simple way to search through, analyze, and functionally display all our data. This allows us to ensure immediate response and provide a reliable, user-friendly experience at any given time.

You could say that our lives here at Fiverr DevOps were divided into two eras: pre-ELK and post-ELK.

Pre-ELK

Before the implementation of the ELK stack, analyzing huge text logs was extremely difficult. It required lots of manual scripting and simple Linux commands like GREPs , SEDs, and AWKs. To put it bluntly, it was very time consuming. Sometimes we would spend hours upon hours digging into those logs in order to find any piece of actionable data. Of course, there’s no need to add that there was no way to see trends and behaviors over time, and no way to compare today’s data to data from the day before. It was ugly, it was time consuming, and it was at best a partial solution. In addition, since we have several load balancers managing our internal and external traffic, we had no centralized log management, no way to see all the logs in same place and aggregate our analysis.

Post-ELK

Then, one bright day, we tried the ELK stack. We installed the stack with basic setup, all default parameters, just to perform a quick test of the product. We ran a logstash agent on one of our load balancers, and installed a new machine for the logstash indexer, elasticsearch, and kibana.

Then the data began to stream and we instantly fell in love.

The advantages of having a system that allows us to search and analyze data, to enjoy real-time insights into heart of our system in a visual way was a huge step up. There was no question about the added value we were going to receive from ELK.

So after a bit of planning we decided to implement ELK with the recommended design, including Redis as broker between the logstash agents and the logstash indexer.

At first we implemented two elasticsearch data nodes, each one an 8GB RAM server with SSD disks for fast performance. Since we wanted to have an odd number of nodes in the elasticsearch cluster, we installed another node, not hosting data, to participate in the cluster as well (another 8GB server). On this data-less server we also installed the logstash indexer and Redis.

Things ran smoothly for some time. We sat back and smiled happily at each other every time we needed to use ELK. We used it, congratulated ourselves, and then we smiled some more—until it stopped working smoothly.

As we added more and more data sources to the system the volume of data we sent to ELK increased exponentially. The searches became slow, the system was nearly unresponsive, and we started to get ominous Java exceptions. We needed to restart elasticsearch often in order to recover. We needed a stable solution, and fast.

At first, we tried to resolve this by upgrading our two elasticsearch data nodes to 16GB RAM servers. It helped at first but it didn’t resolve our performance issues completely. We still suffered from intermittent stability issues as well. So we dug a bit into the ELK documentation in order to understand how elasticsearch could be fine tuned.

After trying out several different configurations, we found our solution: we added one more 16GB data node to the elasticsearch cluster, and limited the field data cache to 40 percent of the elasticsearch memory heap size (field data cache is used during sorting on fields and faceted search). Additionally, we now run 2 instances of logstash indexer, for redundancy, and also in order to keep the indexer from becoming a bottleneck in our ELK flow due to high traffic volumes.

Our Final Topology

Scaling

After passing the initial instability phase, things have looked good for several months now. Our ELK stack is running and responding fast. However, since our traffic volume is constantly growing, we have no doubt that at some point at the future we will need to scale the system again. Thanks to our hard work and experience, we are ready! From our experience we’ve learned that it is better to scale the elasticsearch cluster horizontally as opposed to vertically. The scaling process itself is very simple, just joining a new elasticsearch node to the cluster and waiting for data rebalancing, which happens automatically when a new node successfully joins the cluster.

The logstash indexer can be scaled up as well. This procedure requires cloning the existing indexer configuration and then starting another process of logstash indexer. It’s really that simple. When several instances of logstash indexer read from the same redis, blocking reads are used to ensure that each redis entry is read by one indexer only. This prevents any danger of writing duplicate data.

Kibana Web Interface

The Kibana web interface is the frontend tool that makes the ELK stack the power player it is in the monitoring world.

With Kibana, filtering and searching through data is fast and yields useful results. Searches can be run both with free search in the query panel, or by filtering on one or more parameters in customized panels we build. Once a filter or search is set, the results are adjusted for all panels in the dashboard, giving broad visibility into each query.

We now have all of the Fiverr traffic data immediately available at the tips of our fingers, as opposed to the monolithic logs that stood in our way in the past. Once we know what we are looking for, accessing the data itself happens almost instantaneously. Search results can be easily adjusted just by clicking on desired field in a displayed graph. We are able to display as much data as we need, without being overwhelmed by traffic irrelevant to our search. We can filter results using several diverse search fields, which allows us to drill down into data, even to the level of a single request.

In addition, it’s become indispensably useful in diagnosing abnormal traffic or system behavior. By taking a quick look at our Kibana dashboards, we can quickly pick up on any abnormalities. These can range from be a sudden peak in overall traffic which is caused by an unusually high number of requests coming from specific user, or they can be growing numbers of error response codes which coming from a specific service, server, or route. Whatever it is, we identify the existence of an issue fast and are able to locate its root cause quickly, so we can take action to fix the issue .

Wishlist

We still haven’t tried Kibana 4, which is currently in beta releases, but we look forward to taking it for a test drive in the near future.

On a more personal note, I would like to see some alerting mechanism integrated into ELK in order to have more pro-active approach to using it. Being able to configure thresholds and receive a notification when those thresholds are reached is a useful function, since here at Fiverr we love to constantly monitor the pulse of our system’s performance.

To summarize

The ELK stack is an excellent system which we were glad to add to our monitoring arsenal, and its now in constant use by all Fiverr technical teams. We’ve already added Graylog data (used for application errors) to the ELK flow and are looking forward to discovering how we can integrate more data sources, gain improved visibility, and continue to improve Fiverr’s performance.

The post Fiverr DevOps Team: Introducing ELK (Elasticsearch, Logstash, Kibana) appeared first on Official Fiverr Blog.