Metrinome – Continuous Monitoring and Security Validation of Distributed Systems

https://prd-wret.s3.us-west-2.amazonaws.com/assets/palladium/production/s3fs-public/styles/full_width/public/thumbnails/image/close-up-code-coding-239898.jpg
Credit: Lorenzo Cafaro (Public Domain)

Distributed enterprise systems consist of a collection of interlinked services and components that exchange information to collectively implement functionality in support of (sometimes mission critical) workflows. Systematic experimental testing and continuous runtime monitoring of these large scale distributed systems, including event interpretation and aggregation, are key to ensuring that the system’s implementation functions as expected and that its security is not compromised.

To illustrate the need, consider an example Information Management System (IMS) that enables sharing of sensitive information between information publishing and consuming clients. Problems associated with configuration management can easily lead to situations in which the IMS allows unauthenticated clients to participate in information exchanges or allows unauthorized information to be disseminated to consumers. Furthermore, the loose coupling between subscribers and the IMS can lead to situations in which the IMS is unavailable and consumers believe that no new information is being published, causing significant misunderstandings across information sharing relationships. Finally, remnant vulnerabilities in the IMS can cause failures to happen at any time and cause significant damage to mission execution if not dealt with in a real-time manner. Unavailability of information sharing directly reduces situational awareness, loss of integrity can give adversaries control over mission execution, and loss of confidentiality can be detrimental to the reputation of actors and/or mission goals in general.

Monitoring and validation of IMS and client operations can aid in detection, diagnosis, and correction of situations like this. This is particularly important since 92% of reported vulnerabilities are located at the applications layer [1]. Despite the importance of experimental validation and continuous monitoring, and the increased support to adopt security assessment as part of the software development life cycle, current approaches suffer from a number of shortcomings that limit their application in continuous monitoring situations and their use in the validation of assurance claims.

First, current test practices favor unit tests over integrated tests for establishing correct functionality. Unit testing, e.g., performed via Junit [2], checks program functionality piece-by-piece but provides little to assess the overall information assurance claims of a system under test. Various tools exist for actively assessing the security of distributed systems, e.g., Nessus [3] and HP Fortify [4] to name a few, but their functionality is achieved by running specialized unit tests for security properties against either the code or the running system. In contrast, integrated end-to-end testing tools, such as YourKit [5] or Grinder [6], focus on performance and scalability. These tools enable operators to find bottlenecks or provision computing resources, but lack metrics associated with assessing security and correct functionality.

Second, integrated and end-to-end testing and experimentation is often postponed until software artifacts have matured significantly. This is because integrated testing and experimentation can be time consuming and effort intensive and the perception is that the cost of manually performing experiments early on frequently outweighs the benefits.

Finally, common testing and metrics frameworks add additional dependencies to existing systems, in the form of additional libraries that need to be loaded into the system under test and lines of code being added in support of instrumentation. This not only increases software complexity but more importantly can cause version dependency issues. It can also have unintended side effects on certification and accreditation as the software now has additional code that must be certified but that is not part of the core functionality, i.e., it is part of the continuous monitoring.

This article describes Metrinome, a metrics framework written in Java that is specifically designed to provide a platform for structured continuous security assessments throughout the software lifecycle. The novelty of Metrinome lies in its loose coupling with the system under test and its integration of end-to-end testing with continuous application-level remote monitoring. Specifically, Metrinome provides (1) runtime computation of a wide range of metrics from log messages generated by distributed components during system execution, (2) execution of assertions over the metrics to determine correct functionality while the system is operating, and (3) improved situational awareness via dashboard views and generation of experimentation reports. The outputs of Metrinome-based assessments can be used as input to Certification and Assessment (C&A) processes to precisely doc-ument the assertions that were previously checked to hold true in the system. Metrinome is available free of charge to government entities through AFRL.

II. Related Work

A. SNMP Dashboards

A number of management platforms exist that use the Simple Network Management Protocol (SNMP) for monitoring devices and nodes. Network Management Information System (NMIS) [7] operates at the networking level and enables monitoring, fault detection, and configuration management of large complex networks. Its main metrics deal with device reachability, availability, and performance. HP OpenView, IBM Tivoli, and Nagios provide similar functionality. Unlike these platforms, Metrinome specializes on monitoring at the application level and execution of fine-grained assertions.

B. Distributed Testing

Software Testing Automation Framework (STAF) [8] is an open source multi-platform, multi-language framework that enables a set of functionalities including logging, monitoring and process invocation for the main purpose of testing. STAF operates in a peer environment; a network of STAF-enabled machines is built by running STAF agents across a set of networked hosts. In contrast to STAF, the goal of Metrinome is more focused and hence no agents are required to be installed. Avoiding agents not only leads to reduced maintenance costs but also significantly reduces the attack surface across networked systems under test. Due to their complimentary nature, we have used Metrinome in conjunction with STAF for continuous testing and integration.

C. Application-level Metrics Frameworks

Several application-level metrics frameworks exist to monitor and measure the performance of applications. For example, Javasimon [9] exposes an API which can be placed into the code and allows inline computation of count metrics and measurement of durations. Metrics [10] is similar to Javasimon but allows data to be streamed to other reporting systems, e.g., Ganglia [11] and Graphite [12].

An important distinction between Metrinome and the above mentioned frameworks is Metrinome’s use of log messages to provide the same monitoring functionality. This makes Metrinome loosely coupled with the system being monitored and makes it applicable to any application that generates log messages, e.g., using Log4j or Logback.

D. Reporting/Graphing Backends

Ganglia, Graphite, and Splunk [13] are examples of highly popular platforms that offer the ability to search, analyze, and visualize data in real-time. Typically these frameworks consist of a processing backend that collects and stores the data. They also use statistical methods that provide new insight and intelligence about the data. Metrinome provides functionalities that intersect with the above mentioned applications, such as dashboard views and experimentation reports. One difference is that Metrinome focuses less on scalability but rather on ensuring correct execution of a system under test through the validation of assertions.

E. SIEM Platforms

Security Information and Event Management platforms (SIEMs), e.g. ArcSight [14], adopt many of the technologies described above, such as SNMP dashboards and reporting backends, to provide users with the ability to query, and analyze security threats generated by both hardware and software applications. Unlike Metrinome, these platforms require the deployment of agents on networked hosts to collect and report events.

III. Design and Architecture

Metrinome is designed to achieve specific objectives in portability and ease of use.

  • Portability Metrinome can monitor a system inde-pendent of the implementation of the system.
  • Minimal coding overhead – Rather than adding new instrumentation libraries to monitored processes (caus-ing versioning conflicts and Java classpath pollution), Metrinome interfaces with existing logging and auditing frameworks, e.g., Logback [15].
  • Ease of use – To be of immediate use to experimenters and administrators, it should be easy to specify metrics and assertions that must hold over the metrics in a systematic way. In addition, results of metric computation need to be readily accessible by humans or other programs through a well-defined Application Programming Interface (API) and Graphical User Interface (GUI).

2014-01-30_1541

Figure 1: Metrinome High-Level Architecture

Figure 1 provides an overview of Metrinome high level architecture. Metrinome works with a set of monitored processes that have the ability to send log messages over TCP connections to the ingest API provided by the Metrics Server. Ingestion is performed via simple logging configuration changes on the monitored processes, e.g., by specifying the use of a SocketAppender in Logback to send certain log messages remotely to the Metrics Server over TCP connections in addition to or instead of sending those messages to the console or a local file.

Due to the fact that log messages issued by different processes may be similar, particularly if the processes are executing the same code base on different physical machines, the Metrics Server requires a descriptive unique process name associated with a specific logging instance as part of the log message. This requirement has already been built into most of the logging and auditing framework, enabling filtering of messages based on process names within Metrinome. The processing performed by Metrinome on received messages is defined using a XML-based Domain Specific Language (DSL), describing concepts such as sections, metrics, functions, and assertions. The Metrinome DSL allows administrators to specify processing logic in one file that can be dynamically loaded into the Metrics Server.

Finally, to ease access to information, Metrinome offers two interfaces: (1) a GUI, implemented in HTML and accessible through common web Browsers using HTTP(s), and (2) a RESTful [16] secure Web Services API for use by external programs.

Want to find out more about this topic?

Request a FREE Technical Inquiry!