February 21, 2007
My OpenSSI contributions.
Let me preface this by pointing out that the overwhelming majority of my contributions were made prior to August 2000, when I left Compaq for a position at BSDi. Despite that fact, however, perusal of the code itself shows me that it has changed little in the intervening seven years, and then for the most part only in terms of bug fixes and minor enhancements.
During my six-year stint as part of the SSI cluster group headed by Bruce Walker at four different companies, I worked on most parts of the code, including many that were specific to SVR4 or Unixware. Of the parts that made it to OpenSSI and CI-Linux, I did minor work in the virtual process (vproc) support code, the IPC code and a few other areas. The majority of my work, though, and the work of which I am most proud, was in the Cluster Membership Service (CLMS) and related work in the Internode Communication Subsystem (ICS).
In 1995 our company, Locus, became part of Platinum Technologies. At that time our largest customer was Tandem Computers; we understood that we would write a skeletal CLMS and the folks at Tandem would replace it with a more complete one. In 1996 Platinum sold the group and the software to Tandem, at which point we found out that the module the Tandem folks had intended to replace our CLMS was not sufficient for the needs of the cluster itself. We therefore had to replace an interim, skeletal CLMS with one that was more complete. That ended up being my job.
From 1996 to roughly 1998 I worked on that, splitting the existing single file into several and adding a large number of new features. Today CLMS is still structured largely as I designed it then. Among the new features was support for elimination of single points of failure for certain key services needed by the cluster, such as the node running the init process or the "surrogate origin" node for processes whose true origin node has left the cluster. It allowed more than one node to offer those key services and, if the node currently running one or more key services failed, it performed a graceful reassignment of those services to a new node.
In addition to the key service work I did a lot of work in CLMS on the details of handling nodes coming into and going out of the cluster. While when I left there was no facility for a graceful node exit, we of course had to handle ungraceful exits, such as when a node crashed. This is well known to be a pretty complex problem. I generally solved it by serializing nodeups, allowing only one node to come in at a time, and blocking nodeups entirely when nodedown processing was going on. Nodedown processing of course cannot proceed one node at a time, since such processing for one down node may depend on cleaning up a resource actually held by another node which is also down. I could go into more detail that would be overkill for a "brief" discussion, so suffice it to say that the problem was complex and I spent a lot of time on it.
The other major piece of CLMS work was in late 1997 and early 1998 and was actually a combination of two requirements. First, there was a requirement that we eliminate the single point of failure for CLMS itself, which at the time ran on only one node for the life of the cluster. (And when I say "for the life of the cluster," I mean by definition, because losing the single CLMS master meant losing the entire cluster.) This was originally dealt with by the simple expedient of arbitrarily assigning a takeover node for the CLMS master, which polled all the nodes it knew and determined the cluster membership from the intersection of the responses and its own list of "up" nodes. Any node thought to be up some nodes but not by others was kicked out unconditionally. This scheme, however, was seriously complicated by the second requirement.
The second requirement was that applications must see cluster- and node- state transitions as coherent and monotonic with respect to time. In other words, a process that migrates from one node to another must never see on the destination node a cluster or node state that is older than the state seen on the origin node. One other problem with which I had to deal was potentially node-state-sensitive data in transit between nodes when a state transition took place, a longstanding problem that now had a potential solution.
This ended up being a more complex problem that the one I described earlier, but as this is getting long I'll only describe it briefly. The design called for a vector of node transitions and a monotonic timestamp as the mechanism used to order transitions, along with a two-stage commit mechanism and a lock used to block applications from seeing transitions while they were taking place. Each node in the cluster kept a history of outstanding or recent state transitions; the new CLMS constructed a list of "canonical" transitions from the union of the the history kept by each node in the cluster, inserting dummy transitions as necessary to keep the list consistent. It then rolled that list forward on all nodes, at the end of which process all nodes would agree on the state of the cluster and all nodes therein. Of course, there were other cases with which I had to deal, but this was the overall design.
One note: It's not clear in the above description, but CLMS has a quite close relationship to ICS and therefore I did a lot of work there, as well.
It was a fun and interesting project and quite complex. The nicest thing about it, though, was when I looked at the code again, for the first time in (at the time) five years, in 2003. To my mild surprise I found the code understandable, mostly because I went to great lengths to make it so when I wrote it.
While most of the documents supporting this are lost to history or dusty archives, the source itself is available online. The most interesting bits that relate to this document are on the CI-Linux page. The source itself is here and here. Interesting files are pretty much everything beginning with "clms_," in particular clms_api.c, clms_client.c, clms_failover.c, clms_master.c, clms_mgmt.c, clms_subr.c and clms_svcmgmt.c. Also much of the ics_ stuff.
As it happens, most of the routines with commentary that looks like the example below were mine, although by this point most of them have changes made by others:/* * icssvr_llhandle_init() * Initialize a handle that has not been used * * Description: * This routine is called by ICS high level code to initialize the * low-level ICS portion of a handle. * * Parameters: * handle the handle to be initialized * sleep_flag this routine can sleep waiting for resources/memory * * Return value: * If sleep_flag is TRUE, then this routine always returns 0 (success). */Posted by Frank at February 21, 2007 1:02 PM | TrackBack
Hi there. As a user of OpenSSI, it's interesting to hear about its history from one of its "fathers." Thank you for your work!
Posted by: Taeho at February 28, 2007 1:48 AMYou're welcome. I hope to at some point write a bit more about it; there are a couple of things to which I gave short shrift and they could stand to be expanded a bit.
Posted by: Frank at February 28, 2007 9:12 AM