Monitoring the Biggest Marketplace for Service at Fiverr

Our last post was about how we scaled our application platform at Fiverr®. This time we’re going to walk you through the journey of creating the monitoring solution that ensures our system health.

We started with dependable tools like Nagios, Ganglia and New Relic. Using these gave us an insight on the system health at infrastructure and application levels. However, production incidents were still reported by Customer Support team to DevOps team.

Several postmortems later, we decided to rethink our approach.

Our goal:

  • Implement an automatic monitoring system that detects and reports problems to DevOps team at real time.

Given this, we decided to recognize two major types of monitors:

  • Business markers (revenue, user registration, etc.)
  • Technical markers (machine disc space, service response time, etc.)

Technical markers were further divided into several levels following the application layers:

  • Infrastructure level (network, resources, server health, etc.)
  • Application platform level (service health, worker health, etc.)
  • End user level (client side performance)

Then we looked at the existing monitoring tools.

There were a number of excellent tools capable of providing a monitoring solution for a specific level. For example:

  • Nagios for infrastructure health monitoring.
  • New Relic for application platform health monitoring.
  • Kissmetrics or Google Analytics for web application analytics.

And we adopted these tools at Fiverr to use when applicable. But the real challenge was monitoring on all system levels in order to:

  • Detect anomalies in business markers as fast as possible.
  • Drill down to the cause of the problem as fast as possible by inspecting anomalies in technical markers.

Keeping that in mind, we compiled a list of requirements for our monitoring system:

  • Ability to monitor technical markers on all system levels.
  • Ability to monitor business markers.
  • Ability to monitor external events (i.e. software deploys, marketing campaign start/end, PR campaigns start/end, etc.)
  • Ability to detect a correlation between trends of different markers.
  • Ability for new markers to be added by development teams (as non intrusive as possible)
  • Ability not to impact a host system performance.
  • Ability to detect anomalies and send proper alerts.

We felt that an implementation of such a system requires a deep understanding of the Fiverr’s business rules and domain specific language on one side, and application architecture and design on another. We also wanted it to be based on Open Source software as much as possible. Eventually we decided to implement a mix of in-house development and existing Open Source tools usage.

And here is how we did it:

We selected Etsy’s statsd aggregation daemon for the following reasons:

  • It uses UDP and therefore decouples sender from receiver.
  • It supports counters, timings and gauges.
  • It provides a support for various backend plugins we use:
    • Graphite backend for visualization.
    • RabbitMQ backend for subsequent persistency to DB.

It supports a multi instance layout that we use:

  • One instance for tech events with 10 seconds aggregation window.
  • One instance for biz events with 5 minutes aggregation window.

We implemented our own stasd driver that can be used by all system modules (RoR, Microservices, Workers). All a developer has to do to add a new monitoring marker is to add a line of code such as the ones below:

  • FiverrRackStats::Monitor.instance.record_long_interval_timing(“some_api_response_time”, duration_in_millisecs)
  • FiverrRackStats::Monitor.instance.increment_long_interval_counter(“user_acount_activated”)
  • FiverrRackStats::Monitor.instance.increment_counter(‘page_view’)
  • The corresponding graph is automatically created in Graphite.

  • We selected Graphite (http://graphite.wikidot.com) as graphing tool.
  • We selected Grafana (http://grafana.org) for dashboarding and annotations.
  • We selected Syeren (https://github.com/scobal/seyren) as an alerting dashboard for Graphite.

biggestmarketplace_1
 
That was a combining effort of our DevOps and Labs teams. We had so much fun working with statsd, Graphite and Web Page Test! Graphite’s statistical analysis abilities fascinated us because they were so simple and brought us so much value. It felt like coming out of a darkness.

The only thing we miss from the old days is that instead of getting a phone call from one of our friendly Customer Support gals at 2:00AM, we now just get a simple SMS.

Here are some examples:

1. Business markers example:

Graph showing a real time comparison of three business markers with the corresponding average value of previous 3 weeks. There are two dotted horizontal lines acting as thresholds for alerts (WARNING at 70% and ERROR at 50%).

biggestmarketplace_2

Graph showing the Marketplace CTA.

biggestmarketplace_3

2. Technical events example:

Graph showing total number of http requests, mean response time, and mean 90 percentile response time.
Vertical dotted lines represent software deploys.

biggestmarketplace_4

Graph showing a microservice technical markers.

biggestmarketplace_5

3. Client-side performance monitors example:

Graph showing weight of one of our web pages.

biggestmarketplace_6

Graph showing a speed index of the same page. The correlation is clearly visible.

biggestmarketplace_7

4. Annotations example:

An annotation showing start of a PR campaign.

biggestmarketplace_8

5. Checks and Alerts example:

biggestmarketplace_9

The post Monitoring the Biggest Marketplace for Service at Fiverr appeared first on Official Fiverr Blog.

Leave a Comment