Developing the Ability to Infer Internal States of a System Based on the System’s External Outputs

Published in

Nerd For Tech

6 min readMar 16, 2021

Even if you have written the perfect code, a node may fail, a connection may time out, or participating servers may act arbitrarily. Being able to identify and fix the problem as quickly as possible before it affects customers or your organization’s reputation is a competitive advantage. Observability is a solution to minimizes the impacts of this challenge. In this text we will discuss each of the 3 pillars of observability, presenting open-source solutions to start your knowledge journey.

Observability

One of the biggest challenges in creating, developing, and deploying software products is heterogeneous environments. When your systems are distributed, several things can go wrong. Even if you have written the perfect code, a node may fail, a connection may time out, or participating servers may act arbitrarily. Being able to identify and fix the problem as quickly as possible before it reduces the performance of the entire system can become a competitive advantage. Observability is a solution to minimizes the impacts of this challenge.

Honeycomb, in its Guide to Achieving Observability, describes observability as the ability to request random questions about your production environment without having to know in advance what you want to ask. We understand observability as the ability to visualize the functioning of your system from the outside. An observable system provides all the information you need in real-time to resolve any day-to-day dilemmas you may have about a system.

It is important to emphasize that monitoring and observability are different concepts. New Relic defines that monitoring gives instrumentation to software teams, which collects data about their systems and allows them to respond quickly in the event of errors and problems. Practicing observability is a broader field of implementation traversing concepts, tools, and best practices. The 3 pillars of observability are Logs, Metrics, and Tracing. In this text we will discuss each of the 3, presenting open-source solutions to start your knowledge journey.

Logs

A log file is a document that records either events that happen in an operating system or in a software that is running. Logging is the act of keeping a log. This enables the developers to understand what the code is doing and how the work-flow is. The ELK stack(Elasticsearch, Logstash e Kibana) is a collection of three open-source software that helps in providing real-time insights.

We are going to introduce a simple implementation using the ELK stack for python code. In a real system, it is necessary to previously analyze the mass of data that will be generated to define an efficient structure for a large scale logging.

import logging
import random

logging.basicConfig(filename="logFile.txt",
                    filemode='a',
                    format='%(asctime)s %(levelname)s-%(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S')
for i in xrange(0,15):
    x=random.randint(0,2)
    if(x==0):
        logging.warning('Log Message')
    elif(x==1):
        logging.critical('Log Message')
    else:
        logging.error('Log Message')

In Python, logging can be done at 5 different levels that each respectively indicate the type of event. They are the following: Info, Debug, Warning, Error, and Critical. For setting up Elasticsearch, Logstash, and Kibana we should first download the three open-source software from their respective links. Unzip the files and put them all in the project folder. In separate terminals run the following commands:

$ bin\kibana
   (...)$ bin\elasticsearch
   (...)

To check that the services are running open localhost:5621 and localhost:9600. Before starting Logstash, a configuration file is created in which the details of the input file, output location, and filter methods are specified. Take a look at filter{grok{…}} line. This is a Grok filter plugin. Grok is a great way to parse unstructured log data into something structured and queryable.

input{
 file{
 path => "full/path/to/log_file/location/logFile.txt"
 start_position => "beginning"
 }
}
filter
{
 grok{
 match => {"message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:log-level}-%{GREEDYDATA:message}"}
 }
    date {
    match => ["timestamp", "ISO8601"]
  }
}
output{
 elasticsearch{
 hosts => ["localhost:9200"]
 index => "index_name"}
stdout{codec => rubydebug}
}

Now save the file in Logstash folder and start the Logstash service.

$ bin\logstash –f logstash-simple.conf

Metrics

Metrics are a numerical representation of data that can be used to determine a service or component’s overall behavior over time. Metrics comprise a set of attributes that carry information about SLAs, SLOs, and SLIs. Unlike an event log, which records specific events, metrics are a measured value derived from system performance. Prometheus is a monitoring system and time-based database for services and applications. It collects your target metrics at certain intervals, evaluates rules, and can also trigger alerts. Prometheus supports the use of a combination of instrumentation and exporters. Among several exporters, we can highlight the Node Exporter. This is used to monitor the host hardware and kernel metrics.

*Illustration of a prometheus architecture — Retrive from* *Grafana*

You can launch a Prometheus container for trying it out with. For real systems, a more robust and scalable solution would be more appropriate. After running it will be reachable at localhost:9090

$ docker run --name prometheus -d -p 127.0.0.1:9090:9090 prom/prometheus

The Prometheus Node Exporter is a single static binary that once you’ve downloaded and extracted it, you can run on your host:

$ wget https://github.com/prometheus/node_exporter/releases/download/v*/node_exporter-*.*-amd64.tar.gz
$ tar xvfz node_exporter-*.*-amd64.tar.gz
$ cd node_exporter-*.*-amd64
$ ./node_exporter

After starting, you should see output like this exposing metrics on port 9100:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 3.8996e-05
go_gc_duration_seconds{quantile="0.25"} 4.5926e-05
go_gc_duration_seconds{quantile="0.5"} 5.846e-05
# etc.

The Prometheus instance needs to be properly configured to access Node Exporter metrics. The following prometheus.yml example configuration file will tell the Prometheus instance to scrape, and how frequently, from the Node Exporter.

global:
  scrape_interval: 15sscrape_configs:
- job_name: node
  static_configs:
  - targets: ['localhost:9100']

You can start it up, using the --config.file flag to point to the Prometheus configuration that you created above.

Tracing

Distributed tracing is a way to observe and understand the entire chain of events between microservices. In this type of architecture, a single call in the application can invoke different services that interact with each other. How do developers and engineers isolate a problem when an error occurs or a request is slow? We need to find a way to track all connections. It is in this scenario that tracing comes into play. Jaeger is open source software for tracking transactions between distributed services. It uses distributed tracking to follow a request path across different microservices.

*Illustration of architecture with Kafka as intermediate buffer — Retrive from* *Jaeger*

The simplest way to start e is to use the pre-built image published to DockerHub. In real solutions, study the use of more robust solutions using Kubernetes.

$ docker run -d --name jaeger -p 16686:16686 -p 6831:6831/udp jaegertracing/all-in-one:1.9

It supports the OpenTracing standard for the code instrumentation. Once instrumenting it will send the data to Jaeger which will display them in a simple UI.

import sys
import time
import logging
import random
from jaeger_client import Config
from opentracing_instrumentation.request_context import get_current_span, span_in_context

def init_tracer(service):
    logging.getLogger('').handlers = []
    logging.basicConfig(format='%(message)s', level=logging.DEBUG)    
    config = Config(
        config={
            'sampler': {
                'type': 'const',
                'param': 1,
            },
            'logging': True,
        },
        service_name=service,
    )
    return config.initialize_tracer()

def booking_mgr(movie):
    with tracer.start_span('booking') as span:
        span.set_tag('Movie', movie)
        with span_in_context(span):
            cinema_details = check_cinema(movie)
            showtime_details = check_showtime(cinema_details)
            book_show(showtime_details)

def check_cinema(movie):
    with tracer.start_span('CheckCinema', child_of=get_current_span()) as span:
        with span_in_context(span):
            num = random.randint(1,30)
            time.sleep(num)
            cinema_details = "Cinema Details"
            flags = ['false', 'true', 'false']
            random_flag = random.choice(flags)
            span.set_tag('error', random_flag)
            span.log_kv({'event': 'CheckCinema' , 'value': cinema_details })
            return cinema_details

assert len(sys.argv) == 2
tracer = init_tracer('booking')
movie = sys.argv[1]
booking_mgr(movie)
# yield to IOLoop to flush the spans
time.sleep(2)
tracer.close()

Conclusion

When your systems are distributed, several things can go wrong. Being able to identify and fix the problem as quickly as possible is important before it changes the performance of the entire system or affects customers or your organization’s reputation. Observability is the solution to minimize the impact of this challenge. It can be divided into 3 pillars. Those are Logs, Metrics, and Tracing. In this text we show each of the 3 pillars, presenting open-source solutions that will help start your knowledge journey. Practicing observability helps you to improve your understanding of your product.