This is the first in a series on “All Things Kubernetes”. Each blog will focus on highlighting the common issues that people encounter with Kubernetes and how they can overcome these issues. In this blog, we will describe the common challenges that are encountered with log aggregation and practical approaches to overcome them.
You just rolled out your first application on Kubernetes and you are starting to hear complaints from your users that they are seeing random HTTP 500 errors.
Why is it suddenly so hard and time consuming to identify the root cause of the issue that is causing the 500 errors?
We have observed that many customers who have recently transitioned to a cloud native architecture face issues like this. All of them come from a world where their traditional monolithic app only had a few places where logs were generated. As they adopted and rolled out a modern microservice-based architecture, inherent deficiencies in their previous logging models were amplified.
Why do Logs Matter?
Logs provide access to critical information necessary to monitor the health of an application as well as diagnose issues quickly. Logs are typically closest to the underlying issue and therefore can provide extremely rich context that can help the developer get to the bottom of the issue quickly. Timely access to logs is critical, not just for production but can also dramatically increase developer productivity during the development and testing phase.
Challenges with Logging on Kubernetes
Cloud native applications are typically operated in Kubernetes clusters. These applications comprise multiple microservices that interact with one another. There are a number of new challenges that have to be addressed to ensure log aggregation works well.
Challenge #1 - Requests are routed between microservices that may be operating on completely different nodes. Logging in just one place is not an option.
Challenge #2 - A typical production environment for a popular application can have hundreds of pods during peak usage that are all producing logs. Customers need to seamlessly scale up and down along with the pods with zero manual intervention.
Challenge #3 - By default, Kubernetes does not maintain a history of logs. When pods are terminated or rescheduled, the logs from an older instance of a pod may not be available anymore. You cannot afford to have blind spots with their logs.
Challenge #4 - All the challenges above are amplified multifold when the application needs to be operated across multiple Kubernetes clusters that are geographically distributed and across disparate infrastructure providers.
A Practical Architectural Pattern for Logging on Kubernetes
We see customers standardize on an architectural pattern that is both practical and simple to manage, operate and maintain. It is comprised of three logical components:
1. Log Collection & Aggregation
The most common and practical approach is to deploy a node-level logging agent on each node of your cluster. This agent is ideally a Fluentd Operator for Kubernetes (KFO) that has access to log files of all application containers running on that node.
The container will log to stdout or stderr. For applications that only log to files on disk, configure them to write to a file that resides at a path currently being watched by the logging-agent. The advantages of a “node level logging agent” approach are:
- Only one agent per node is required (Unlike sidecar containers that need to be created for each application running in your cluster)
- No changes to the application are needed
- Log caching and rotation can be configured
2. Log Storage & Archival
This component’s primary job is to provide as a highly scalable and resilient interface for the log collectors/aggregator component.
A very good option is AWS S3. This is an object store service that is designed for 99.999999999% (11 9's) of durability. Since there is nothing to manage or maintain, this provides the guarantee there will be no loss in logging data due to infrastructure failures.
AWS S3 has advanced built-in security capabilities such as access control and encryption. It also maintains compliance programs such as PCI, HIPAA, etc. Finally, it has integrated lifecycle policies to automatically transfer historic data to lower cost storage classes.
3. Log Indexer & Visualizer
This component ingests the logs on an ongoing basis from the log storage component to create inverted indices to ensure efficient search is possible. It also provides an interface for users to search the logs efficiently.
A combination of Elasticsearch and Kibana is an extremely popular choice. Elasticsearch supports a multi-master architecture with ability to shard data for high availability. Kibana provides an excellent user interface with sophisticated query capabilities that make it easy to process application logs, correlate and debug issues.
Customers that intend to use AWS’ S3 for log storage and archiving will benefit from using AWS’ ElasticSearch as a Service. This provides a fully managed ELK as a service that can be operated at scale with zero down time and no operational overhead.
Most importantly, since both components are within AWS’ infrastructure, customers will not suffer from latency issues or network transfer costs to a 3rd party service.
How Rafay helps Customers with Log Aggregation
Zero Code Setup
A guided workflow dramatically improves developer productivity and operational simplicity. The entire setup and configuration for log aggregation for multi-cluster deployments can be performed in a few seconds. Developers and Operators can focus on their application instead of worrying about how to tune and optimize infrastructural components.
Simplified Configuration of the Log Aggregation Endpoint
No YAML Learning Curve
With Rafay, developers and operations no longer have to deal with the learning curve associated with writing and maintaining the YAML configuration for Fluentd and every infrastructural component in Kubernetes. These are automatically generated and used behind the scenes by the Rafay platform. Once enabled, Rafay automatically provisions Fluentd and all necessary infrastructural components on the managed Kubernetes clusters.
The Traditional Way with YAML
The Simplified Way with Rafay
Dynamic Configuration Updates
Updates to configuration such as updates to credentials are dynamically picked up in seconds without the need to republish/restart the application.
Ongoing Health and Performance Monitoring
Once the logging infrastructure is deployed and operational, the Rafay platform continuously monitors its health and will make necessary adjustments.
Managed Software Updates
Customers do not have to deal with the operational burden associated with ongoing software updates of the logging infrastructure components. Rafay’s specialists monitor, evaluate, test and release updates on an ongoing basis.
Log aggregation is a mission critical capability. Failures related to log aggregation can mask serious issues in production deployments. Log aggregation in Kubernetes requires you to think and operate differently. It is possible to overcome the challenges described above by using a combination of industry best practices and tools that help you implement the best practices efficiently.