SUPERMON

SUPERMON

Welcome to the supermon web page. This page has been revised from it's older format to allow easier updating and cross referencing for iusers who are wishing to quickly seek answers to their questions.

Supermon is a high-speed cluster monitoring system that emphasizes LowPerturbation, high SamplingRates, and an extensible DataProtocol and ProgrammingInterface. It has been shown to scale from the most basic single processor machines to large scale, 1024 (2048) processor clusters. Although it is most frequently used on Linux-based clusters, the infrastructure is in no way limited to this operating system provided that data can be extracted from the operating system and presented in the correct format.

The project page containing file downloads and the CVS repository is here.

A list of FrequentlyAskedQuestions is available. Some PublishedPapers (and presentations) are also available. We are currently in the process of compiling performance results and comparisons with other monitoring systems to help users choose which monitoring system best fits their needs.

Supermon is a exible set of tools for high speed, scalable cluster monitoring. Node behavior can be monitored much faster than with other commonly used methods (e.g., rstatd). In addition, Supermon uses a data protocol based on symbolic expressions (S-expressions) at all levels of Supermon, from individual nodes to entire clusters. This contributes to Supermon's scalability and allows it to function in a heterogeneous environment.

Mon and Supermon

Two small server programs move data from the kernel to clients, and provide that data via TCP at both single and multiple node levels. At a single node, a kernel module provides data in its two /proc entries (see above). The mon server acts as a filter between /proc and the TCP clients: It parses the s-expressions found in /proc, adds a minimal amount of information, and passes that data to clients on demand. For each client that connects to it, mon maintains a bitmask reflecting the data fields that particular client requests in a sample. That way, mon filters data and reduces wasteful network traffic. A second server - Supermon - lets clients see a snapshot of a set of nodes in each sample. Supermon connects to nodes that run mon servers, and concentrates their data. It then presents the data sampled from many mon servers in a single data sample. The data format provided to clients by Supermon is identical to mon's data format. That allows many Supermon servers to be created, each sampling from a subset of the nodes within a cluster. New Supermon servers could then be started to connect to the Supermon servers already monitoring portions of the cluster.


Hierarchical Supermon servers improve performance in situations where a cluster has many nodes and sampling rates are high. Supermon provides a bitmask-based filter for each client (similar to mon), which is then used to improve efficiency between the Supermon/mon and Supermon/client connections.

 

Updated 08-13-2008