Welcome!

Blog Feed Post

From Scala Unified Logging to Full System ObservabilityPart 1 of 3: Our Original State of Logging

Jonathan is a platform engineer at VictorOps, responsible for system scalability and performance. This is Part 1 in a 3-part series on system visibility, the detection part of incident management.

These days, with infrastructures spanning tens, hundreds, even thousands of running instances, piping a log file into less is no longer an acceptable means of log research and debugging. Instead, sending them to a log aggregation service with products like Sumo, Elastic, or Splunk is commonplace because searchability is king.

Unfortunately, the pursuit of searchability can lead to undesirable side effects like unreadable, inconsistent, and just plain ugly log statements. It invades our codebases with even more custom formatting (on top of string interpolation, etc) that’s not only distracting but also hard to do well with anything less than super-human string-formatting skills. In short, the side effect on log statements can be detrimental.

Logging is just the starting point

First off, let’s make sure to bring a little context to this pursuit. At VictorOps, we use logging as a research and debugging tool. However, logging isn’t and shouldn’t provide the primary heartbeat on your systems. That’s when metrics comes into the picture, which we’ll discuss in Part 3 of this series. Before going there, let’s talk about where we started with logging at VictorOps. We needed some major improvements in this first line of troubleshooting for when things go bad.

As Dave Hahn, a senior SRE from Netflix, recently shared with us, “Be willing to have a problem before you solve it.” In line with that advice, we recently identified multiple problems to solve relating to the research and debugging done through our logs. To top it all off, when I noticed that our logging interfaces were not unified, it became clear that it was time to make both our logging interface and log output great again.

I hope that our experience at VictorOps will give you ideas on how to improve logging at your organization.

The current state: how we use logs at VictorOps

Sumo Logic is our logging platform and we use it heavily throughout the development lifecycle.

There are four primary ways we use logs:

  1. To get visibility on what’s going on during releases. Through logs, we can see if there are errors that persist after a release. If so, there is probably a hole in our altering – some problem that we aren’t yet monitoring.
  2. To create VictorOps incidents for relevant alerts. When we know that a particular log statement is indicating a problem where someone needs to get involved, we hook a scheduled Sumo search up to the VictorOps platform to create an incident out of it. The goal for most of these alerts is to migrate them to a metric based alert instead of basing it on a log statement – more on that in our metrics discussions in Part 3 of this series.
  3. To see how something is working in production. We might want to see how some new feature is working in production, so we’ll review the log statements. The production environment is always the most valuable for feedback because that’s where real customers have real accounts, alerts, users, and escalation policies. It may look fine in staging, but if there is a use case we didn’t test for that shows up in production, you can see the details in a log.
  4. To investigate high-dimensionality information. Organization, user, and API key (and for that matter, any sort of UUID) are all great examples of metadata that typically won’t be available in a metric and thus logging (or eventing) is where we’ll find that data.

We had three main players in our logging

We have a Scala backend that used three different logging frameworks. Some code used the SLF4J logging framework. SLF4J is widely used and provides a rich feature set. Other code within Akka actors used the Akka actor logging, which has a scaled down interface and feature set and is configured to use SLF4J. Some of our Play code used Play’s own logging, which is extremely simplistic, and is also configured to use SLF4J. All of these were configured with the SLF4J native Logback implementation. Here are some details:

SLF4J

SLF4J is likely the most widely used java logging facade with multiple implementations and a massive user base. Performance is dependent solely on your configuration of the appender that you’re using. By default, logback uses a synchronous appender, but you can easily configure an asynchronous appender. A synchronous appender will use the calling thread to actually write the log statement to file/network, whereas an asynchronous appender lightens the processing load on the calling thread by simply handing over the log statement to the appender to write to file/network at some point in the future.

Akka logging

Akka’s actor-based logging is event driven and is easily configured to use SLF4J. In the actor itself, you say log.info(“this message”), and behind the scenes, it sends an event to the system’s event stream and it’s done. This takes up almost no overhead to create that log statement because it goes somewhere else to be written.

Play logging

Play has its own simplistic logger that’s much more stripped down than the Akka logger and by default uses SLF4J. Play offers up to two arguments: the string that you’re logging, and an optional exception. The most recent version (2.6.x) has added support for SLF4J markers.

Why change how we do logging?

Strategy concerns

These concerns have to do with the various strategies taken by these different logging interfaces.

  • Call-site performance: All SLF4J interfaces rely on the caller to provide pre-computed strings and arguments prior to checking if that log level (info, debug, trace, etc.) is enabled. There are simple ways around this, like Play’s interface that uses a by-name argument for the string. This essentially creates an anonymous function that is executed only after the log level has been checked. For example, without by-name arguments, the statement below will require the mkString method to execute on a potentially large collection prior to the info method checking whether or not info level log statements are enabled. log.info(s“Team $team has users: ${users.mkString(separator)}”)
  • Conflicting interfaces: The largest effect of conflicting interfaces is developer confusion and frustration. The next problem is that it leads to incorrect log statements. If logs are to save you when things go awry, then an incorrect log statement is like a carabiner with a broken arm–looks like a useful thing but is completely useless for the intended user. For example, below are the error methods from these three interfaces. Notice how the location of the Throwable argument changes? Now, imagine working in a codebase where all three of these interfaces are being used. A little scary.
    • SLF4J: void error(String msg, Throwable t)
    • Akka: def error(cause: Throwable, message: String): Unit
    • Play: def error(message: ⇒ String, error: ⇒ Throwable)
  • Appender performance: All three of these have configurable backends and appenders, but it is worth noting that any interface you use will need to have its configuration examined. Most default appenders are synchronous and therefore write the log statement to file (or whatever destination) at the call-site. However, this can be changed easily by configuring an asynchronous appender. This clearly improves call-site performance by requiring only the string to be built before asynchronously handing it off to the appender, which will write the statement to file on its own time.

Developer concerns

How did using multiple logging libraries affect the developers?

  • Too many decisions: Choosing between three different loggers for any given class.
  • Conflicting interfaces: From a developer perspective, this causes confusion and requires you to pay more attention to your logger than you really should.
  • Inconsistency: Having more than one logger in a class, which is clearly unnecessary, and having different types inconsistently named, e.g. log and logger.

Functionality needs

What functionality do the developers need for a maintainable codebase and effective log portfolio?

  • Unified interface: A single interface allows you to add new features in one place and enables the power of easily refactoring logging on a large scale.
  • Support for log variables: Extracting specific information from a log statement is easier if it’s been given special formatting. Once standardized, this can be utilized in our Sumo queries.
  • Implicit loggers for utility classes: Utility classes lack their own identity in terms of data flow. Implicitly passing in a logger, which has identifying information from the caller (its class and log variables), provides rich log statements within utility code.
  • Further consistency: This equates to icing on the cake. Things like a very simple Logging trait to standardize the log field name, logger name (used when writing the log statements), and the logger identity (based on log variables).

Up next

Now that we’ve set the stage, in Part 2, we’ll explore how we addressed these concerns in order to make logging great again.

The post From Scala Unified Logging to Full System Observability
Part 1 of 3: Our Original State of Logging
appeared first on VictorOps.

Read the original blog entry...

More Stories By VictorOps Blog

VictorOps is making on-call suck less with the only collaborative alert management platform on the market.

With easy on-call scheduling management, a real-time incident timeline that gives you contextual relevance around your alerts and powerful reporting features that make post-mortems more effective, VictorOps helps your IT/DevOps team solve problems faster.

Latest Stories
SYS-CON Events announced today that SourceForge has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. SourceForge is the largest, most trusted destination for Open Source Software development, collaboration, discovery and download on the web serving over 32 million viewers, 150 million downloads and over 460,000 active development projects each and every month.
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
Today most companies are adopting or evaluating container technology - Docker in particular - to speed up application deployment, drive down cost, ease management and make application delivery more flexible overall. As with most new architectures, this dream takes significant work to become a reality. Even when you do get your application componentized enough and packaged properly, there are still challenges for DevOps teams to making the shift to continuous delivery and achieving that reducti...
SYS-CON Events announced today that Daiya Industry will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Daiya Industry specializes in orthotic support systems and assistive devices with pneumatic artificial muscles in order to contribute to an extended healthy life expectancy. For more information, please visit https://www.daiyak...
SYS-CON Events announced today that Nihon Micron will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Nihon Micron Co., Ltd. strives for technological innovation to establish high-density, high-precision processing technology for providing printed circuit board and metal mount RFID tags used for communication devices. For more inf...
SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
SYS-CON Events announced today that Suzuki Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Suzuki Inc. is a semiconductor-related business, including sales of consuming parts, parts repair, and maintenance for semiconductor manufacturing machines, etc. It is also a health care business providing experimental research for...
"Our strategy is to focus on the hyperscale providers - AWS, Azure, and Google. Over the last year we saw that a lot of developers need to learn how to do their job in the cloud and we see this DevOps movement that we are catering to with our content," stated Alessandro Fasan, Head of Global Sales at Cloud Academy, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
21st International Cloud Expo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Me...
Many organizations adopt DevOps to reduce cycle times and deliver software faster; some take on DevOps to drive higher quality and better end-user experience; others look to DevOps for a clearer line-of-sight to customers to drive better business impacts. In truth, these three foundations go together. In this power panel at @DevOpsSummit 21st Cloud Expo, moderated by DevOps Conference Co-Chair Andi Mann, industry experts will discuss how leading organizations build application success from all...
SYS-CON Events announced today that mruby Forum will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. mruby is the lightweight implementation of the Ruby language. We introduce mruby and the mruby IoT framework that enhances development productivity. For more information, visit http://forum.mruby.org/.
Cloud-based disaster recovery is critical to any production environment and is a high priority for many enterprise organizations today. Nearly 40% of organizations have had to execute their BCDR plan due to a service disruption in the past two years. Zerto on IBM Cloud offer VMware and Microsoft customers simple, automated recovery of on-premise VMware and Microsoft workloads to IBM Cloud data centers.
Elon Musk is among the notable industry figures who worries about the power of AI to destroy rather than help society. Mark Zuckerberg, on the other hand, embraces all that is going on. AI is most powerful when deployed across the vast networks being built for Internets of Things in the manufacturing, transportation and logistics, retail, healthcare, government and other sectors. Is AI transforming IoT for the good or the bad? Do we need to worry about its potential destructive power? Or will we...
Why Federal cloud? What is in Federal Clouds and integrations? This session will identify the process and the FedRAMP initiative. But is it sufficient? What is the remedy for keeping abreast of cutting-edge technology? In his session at 21st Cloud Expo, Rasananda Behera will examine the proposed solutions: Private or public or hybrid cloud Responsible governing bodies How can we accomplish?