Recent Entries
Links
Other Tech Blogs
Referrers
Search


Archives
Powered by
Movable Type 4.1

April 8, 2008

Documenting the undocumented

A quick note regarding Linux "tasks."

In an exchange regarding a race between timer ticks (and updates of task->utime versus task->signal->(somepointer)->utime), Roland McGrath told me that

current->exit_state can go from zero to nonzero only by current running code in the do_exit path. current does not progress on that path while current is inside one update_process_times call.

The key bit of knowledge that I was lacking and that this (when placed with other information I had gained in the extended email exchange) was that a “task” is really a Linux scheduling unit. From the scheduler’s perspective, “task” is the same as “thread.” The only thing that makes a set of threads into a multithreaded process is that they share a signal_struct (and their memory map, of course, and a several other things that, as Roland says, are implicitly required to be shared when signal_struct is shared). So a “task” can only be executed on a single cpu at any time, it can’t be executed on more than one cpu at a time. Therefore if a “task” is executing and is interrupted, the value of “current” at the interrupt will be that task, which is entirely suspended for the duration of the interrupt.

Posted by Frank at 11:17 AM | Comments (0) | TrackBack (0)

October 18, 2007

Software Engineering

A few hard-won lessons regarding software and coding.

This is a short list of Rules to Live By, as generated by some twenty years of experience writing code that, at one time or another, undoubtedly broke all of these rules. This is about programming in C and mostly in the kernel, so add spices as needed for taste.

Always bear in mind that it might be YOU who in five years has to come back and maintain the code you’re writing. As anyone who has done this knows, if you haven’t looked at a piece of code in a few years, it may as well have been written by someone else.

In general, no matter what your favorite indentation scheme might be, use the indentation scheme of the code that you’re changing. Mixed indentation just makes the code more confusing and harder to read. Many, many bugs have been overlooked because the screwed-up indentation made the code look right when it in fact was broken.

Only change indentation when doing a wholesale overhaul of a file or routine. If it’s your new file, use whatever scheme suits you, but if it’s existing code, don’t gratuitously change it. Gratuitous whitespace changes just increase the size of the changeset and make it more difficult to track the history of the code. Only change indentation when you’re essentially replacing a routine or when the indentation is already screwed up and you’re making it consistent while you replace most of the code.

USE COMMENTARY! C is not self-documenting. I had a disagreement not long ago with someone who claimed that well-written code doesn’t need commentary. This is someone who codes almost exclusively in object-oriented languages, but even there I think that he’s just wrong. Commentary is absolutely necessary. Of course, if you write commentary of the form

        a++;    /* Add one to a. */

then you’re asking for me to come slap you. The commentary should be concise and should provide the information that the code itself does not provide. The code itself provides the “what,” good commentary provides the “why.” Decent commentary also enumerates the assumptions made by the person writing the code. If code has side effects, commentary should list them and point to the code affected. If there are races, the races should be described and, if it’s a relatively complex race, commentary should give the solution rationale. If there are invisible assumptions behind the code (along the lines of “the value of ‘x’ is filled in by another thread” or “this race doesn’t matter because we don’t care who wins” or even “we have to do things in this particular way because our algorithm depends on doing so, for these reasons…”), COMMENT THOSE ASSUMPTIONS. This is probably the single most important kind of commentary one can write.

If you change the code and, due to your change, the commentary no longer matches what the code does, fix it! If the commentary is broken, the code is broken.

Never reinvent the wheel. If there’s already a routine that does the job, use it. If there isn’t, look again before writing your own. It may be that a very minor change will generalize an existing routine. The less code you write, the fewer bugs you’ll introduce.

“Measure twice, cut once.” No amount of after-the-fact bug fixing can compensate for insufficient design. I have found that by spending most of my time on design, the actual implementation goes amazingly quickly and the only bugs tend to be minor ones. If you find yourself constantly adding and removing code to cover cases you originally didn’t handle, you didn’t spend enough time on the design. If you do it right, even if you don’t actually code all the cases up front, adding those cases when necessary is pretty trivial.

Don’t try to write C++/FORTRAN/<your favorite language> when you’re actually writing C. Use C constructs, don’t try to force the shape of another language on it. One can certainly write object-oriented code in C even though the language doesn’t support all the C++/Java/etc. bells and whistles. Just because something doesn’t obviously look object-oriented doesn’t mean that it isn’t. (This is another place where good commentary comes in, btw.)

Your code will be around long after you’ve gone on to other things. It might even outlive you. Try to make sure that those who have to look at it after you find reason to appreciate your ability to write clean, well-commented and clear code that’s easy to understand and maintain. Otherwise, you could have those people cursing you loudly and at length. Remember, that person cursing you could, in fact, be YOU, five years from now.

If you want a very, very good negative example, just look at the xfrm code in Linux 2.6. I did. I had to change it. I can only hope that the person who wrote it is condemned to maintain it.

Posted by Frank at 3:21 PM | Comments (0) | TrackBack (0)

February 21, 2007

Operating Systems

My OpenSSI contributions.

Let me preface this by pointing out that the overwhelming majority of my contributions were made prior to August 2000, when I left Compaq for a position at BSDi. Despite that fact, however, perusal of the code itself shows me that it has changed little in the intervening seven years, and then for the most part only in terms of bug fixes and minor enhancements.

During my six-year stint as part of the SSI cluster group headed by Bruce Walker at four different companies, I worked on most parts of the code, including many that were specific to SVR4 or Unixware. Of the parts that made it to OpenSSI and CI-Linux, I did minor work in the virtual process (vproc) support code, the IPC code and a few other areas. The majority of my work, though, and the work of which I am most proud, was in the Cluster Membership Service (CLMS) and related work in the Internode Communication Subsystem (ICS).

In 1995 our company, Locus, became part of Platinum Technologies. At that time our largest customer was Tandem Computers; we understood that we would write a skeletal CLMS and the folks at Tandem would replace it with a more complete one. In 1996 Platinum sold the group and the software to Tandem, at which point we found out that the module the Tandem folks had intended to replace our CLMS was not sufficient for the needs of the cluster itself. We therefore had to replace an interim, skeletal CLMS with one that was more complete. That ended up being my job.

From 1996 to roughly 1998 I worked on that, splitting the existing single file into several and adding a large number of new features. Today CLMS is still structured largely as I designed it then. Among the new features was support for elimination of single points of failure for certain key services needed by the cluster, such as the node running the init process or the “surrogate origin” node for processes whose true origin node has left the cluster. It allowed more than one node to offer those key services and, if the node currently running one or more key services failed, it performed a graceful reassignment of those services to a new node.

In addition to the key service work I did a lot of work in CLMS on the details of handling nodes coming into and going out of the cluster. While when I left there was no facility for a graceful node exit, we of course had to handle ungraceful exits, such as when a node crashed. This is well known to be a pretty complex problem. I generally solved it by serializing nodeups, allowing only one node to come in at a time, and blocking nodeups entirely when nodedown processing was going on. Nodedown processing of course cannot proceed one node at a time, since such processing for one down node may depend on cleaning up a resource actually held by another node which is also down. I could go into more detail that would be overkill for a “brief” discussion, so suffice it to say that the problem was complex and I spent a lot of time on it.

The other major piece of CLMS work was in late 1997 and early 1998 and was actually a combination of two requirements. First, there was a requirement that we eliminate the single point of failure for CLMS itself, which at the time ran on only one node for the life of the cluster. (And when I say “for the life of the cluster,” I mean by definition, because losing the single CLMS master meant losing the entire cluster.) This was originally dealt with by the simple expedient of arbitrarily assigning a takeover node for the CLMS master, which polled all the nodes it knew and determined the cluster membership from the intersection of the responses and its own list of “up” nodes. Any node thought to be up some nodes but not by others was kicked out unconditionally. This scheme, however, was seriously complicated by the second requirement.

The second requirement was that applications must see cluster- and node- state transitions as coherent and monotonic with respect to time. In other words, a process that migrates from one node to another must never see on the destination node a cluster or node state that is older than the state seen on the origin node. One other problem with which I had to deal was potentially node-state-sensitive data in transit between nodes when a state transition took place, a longstanding problem that now had a potential solution.

This ended up being a more complex problem that the one I described earlier, but as this is getting long I’ll only describe it briefly. The design called for a vector of node transitions and a monotonic timestamp as the mechanism used to order transitions, along with a two-stage commit mechanism and a lock used to block applications from seeing transitions while they were taking place. Each node in the cluster kept a history of outstanding or recent state transitions; the new CLMS constructed a list of “canonical” transitions from the union of the the history kept by each node in the cluster, inserting dummy transitions as necessary to keep the list consistent. It then rolled that list forward on all nodes, at the end of which process all nodes would agree on the state of the cluster and all nodes therein. Of course, there were other cases with which I had to deal, but this was the overall design.

One note: It’s not clear in the above description, but CLMS has a quite close relationship to ICS and therefore I did a lot of work there, as well.

It was a fun and interesting project and quite complex. The nicest thing about it, though, was when I looked at the code again, for the first time in (at the time) five years, in 2003. To my mild surprise I found the code understandable, mostly because I went to great lengths to make it so when I wrote it.

While most of the documents supporting this are lost to history or dusty archives, the source itself is available online. The most interesting bits that relate to this document are on the CI-Linux page. The source itself is here and here. Interesting files are pretty much everything beginning with “clms_,” in particular clms_api.c, clms_client.c, clms_failover.c, clms_master.c, clms_mgmt.c, clms_subr.c and clms_svcmgmt.c. Also much of the ics_ stuff.

As it happens, most of the routines with commentary that looks like the example below were mine, although by this point most of them have changes made by others:
/*
 * icssvr_llhandle_init()
 *	Initialize a handle that has not been used
 *
 * Description:
 *	This routine is called by ICS high level code to initialize the
 *	low-level ICS portion of a handle.
 *
 * Parameters:
 *	handle		the handle to be initialized
 *	sleep_flag	this routine can sleep waiting for resources/memory
 *
 * Return value:
 *	If sleep_flag is TRUE, then this routine always returns 0 (success).
 */
 
Posted by Frank at 1:02 PM | Comments (2) | TrackBack (0)

February 7, 2007

Operating Systems

UFS2 version of ffsrecov for FreeBSD.

I just released a heavily modified version of John-Mark Gurney’s ffsrecov, adapted to use libufs and to work (only) with UFS2 file systems. I call it ffs2recov and it is available here.

I wrote this a couple of years ago so that I could recover a file system that had been stomped by a misconfigured RAID controller. It worked well enough for me to recover a couple of hundred gigabytes worth of data, which was a great relief (although some stuff was gone forever, sigh). I had intended to polish it up and release it long before now, but I’ve never managed to get around to doing the polishing. In particular, while it has a nice little summary of implemented options, the manpage needs a lot of work.

On the positive side, however, I extended it to be a lot more robust in the face of corrupt pointers and file system offsets, so it doesn’t just fall over when it sees garbage in a block address or whatnot.

I’m releasing it under the BSD two-clause license, with due credit to John-Mark. It’s my hope that someone else will take it, clean it up a bit, rewrite the manpage and maybe make a port out of it. If you do and you need a place to host the distfile, let me know.

Since I released it, a couple of people have cleaned it up a tiny bit and I’ve integrated their changes. There’s now a port in the FreeBSD ports collection, sysutils/ffs2recov.

Posted by Frank at 10:36 PM | Comments (0) | TrackBack (0)

January 5, 2006

Operating Systems

A new way of doing things.

I just read The Internet Is Broken at Technology Review. In it, David Clark at MIT asserts that the Internet needs to be, not fixed, but replaced with something better, something designed from the start to deal with the problems and the sheer size of the current Internet.

I think he’s right. In fact, I’ve been thinking along those lines for some time, only I’ve had something a bit more radical in mind.

From 1994 to 2000 I worked with Bruce Walker and others through four different companies on single-system-image clustering software. Our team worked on what eventually became OpenSSI. The system embodied a lot of very good ideas, but unfortunately was a victim of indifference and the “good is the enemy of the best” phenomenon. It’s still alive, after a fashion, but it never really took off. Personally, I think that our biggest mistake was in not thinking big enough.

Imagine a world in which your PDA, laptop computer, watch, DVR and desktop computer all talk to each other automatically, sharing not only data but computing cycles, memory, disk space and network connectivity. One in which the environment you use at home is exactly the same one that you use at work or in your car, right down to the fonts and wallpaper (not to mention the applications). One in which you never have to worry about a virus or trojan horse, since the authentication and authorization mechanisms necessary to prevent such things are built into the fabric of the system.

Imagine being notified on your laptop that your car needs gas or that it’s time for its regular maintenance since it has passed 3500 miles since the last one. Automatically, because agents that run all the time have contacted the car and gotten its state without your intervention.

Imagine being in the middle of some particularly important and knotty problem at work when it’s time to go home, then going home and picking up exactly where you left off, without doing anything special at all.

Imagine never, ever being offline because the mesh that your personal devices are a part of maintains its connectivity constantly, despite failure of individual components. Imagine building a sensor network or a compute cluster just by getting the appropriate hardware and doing some minor configuration.

Some of this is already happening, with meshes, grid computing, metropolitan-scale wireless networks, sensor nets and mobile ad-hoc networking. Some of it is a bit science-fictional. It’s all possible, though, even practical. All of the problems have been or are being solved. The challenge is in designing the system to be robust enough to be useful and then in implementing that design.

And I have a few ideas about how to do it.

Posted by Frank at 10:03 PM | Comments (2)

September 2, 2004

Tips and tricks

64-bit divides on Intel 32-bit machines.

I recently had a need to perform a divide of a 64-bit signed integer by a 32-bit unsigned integer in an Pentium IV-based realtime system (running RTLinux and Linux 2.4.22). After some grubbing around on Google I came across John Eckerdal’s Assembly Gems page, which features (among many other routines) a generalized routine for dividing 64-bit unsigned integers.

That routine is wonderful, but I needed to divide a signed 64-bit integer by an unsigned 32-bit integer. This is really just a special case of Eckerdal’s routine, though, with a small wrapper to handle the sign of the dividend, viz:

if (dividend < 0) {
	neg = 1;
	dividend = -dividend;
}
__asm__ __volatile__(
"	movl	%2, %%eax\n"	/* Dividend low into eax.     */
"	movl	%3, %%edx\n"	/* Dividend high into edx.    */
"	movl	%4, %%ebx\n"	/* Divisor into ebx.          */
"	xorl	%%ecx, %%ecx\n"	/* Clear temp quotient high.  */
"	cmpl	%%ebx, %%edx\n"	/* If dividend high < divisor,*/
				/* result will fit in 32 bits,*/
				/* so we can use one divide.  */
"	jb	1f\n"		/* Yes, skip next divide.     */
"	movl	%%eax, %%ecx\n" /* Save dividend low in ecx.  */
/* Divide <dividend high>:<dividend low> by divisor.          */
"	divl	%%ebx\n"	/* Quotient in eax, remainder */
				/* in edx.                    */
"	xchgl	%%ecx, %%eax\n"	/* Quotient high to ecx,      */
				/* dividend low to eax.       */
/* Divide <high remainder>:<dividend low> by divisor.         */
"1:	divl	%%ebx\n"	/* Quotient low to eax,       */
				/* remainder to edx.          */
"	movl	%%ecx, %1\n"	/* Quotient high to result high. */
"	movl	%%eax, %0"	/* Quotient low to result low.   */
	: "=g" (ll_low(result)), "=g" (ll_high(result))
	: "g" (ll_low(dividend)), "g" (ll_high(dividend)),
	  "g" (divisor)
	: "memory", "eax", "ebx", "ecx", "edx", "cc");
if (neg)
	result = -result;

The commentary pretty much says it all. This is made a lot simpler by the constraints of my problem, in that I only have to divide by a 32-bit unsigned quantity, so I can skip most of the work Eckerdal does. My only extension is to convert the dividend to unsigned beforehand and then convert the result back to signed afterwards.

Simple, effective, and many times faster than the default Gnu C __divdi3() call to do the same thing.

Posted by Frank at 8:31 AM | TrackBack (0)

July 15, 2004

Documenting the undocumented

The path of a forwarded packet.

I’ve just had occasion to figure out (with the help of kgdb) the path a forwarded packet takes through the Linux 2.6 stack. In terms of just called functions, here it is:

netif_receive_skb()	from network driver
ip_rcv()		via packet_type.func
ip_rcv_finish()		via NF_HOOK(PF_INET,
				    NF_IP_PRE_ROUTING,
				    skb, dev, NULL, ip_rcv_finish);
	calls ip_route_input() (which calls ip_route_input_slow())
	to route packet internally
dst_input()		directly
ip_forward()		via skb->dst.input(); set up by
			ip_route_input_slow()
	calls xfrm4_policy_check(NULL, XFRM_POLICY_FWD, skb)
		may drop packet if blocked here
	calls xfrm4_route_forward(skb)
		may drop packet if blocked here
ip_forward_finish()	via NF_HOOK(PF_INET, NF_IP_FORWARD,
				    skb, skb->dev,
				    rt->u.dst.dev, ip_forward_finish);
dst_output()
ip_output()		via skb->dst.output(), op cit
ip_finish_output()	if no fragmenting needed
ip_finish_output2()	via NF_HOOK(PF_INET, NF_IP_POST_ROUTING,
				    skb, NULL, dev, ip_finish_output2);
...then device output routine...
Posted by Frank at 1:48 PM | TrackBack (0)

June 16, 2004

Documenting the undocumented

Mapping the Linux networking stack.

The DataTAG folks have produced the final draft of “A Map of the Networking Code in Linux Kernel 2.4.20. It is an excellent reference for those working in Linux networking. Especially considering that the stuff is barely documented anywhere else.

They also refer to a set of lectures by Dr. Jerry Cooperstein at Axian concerning changes made in Linux 2.6, the slides for which are available on their pubs site. These are extremely sketchy but still somewhat useful, at least in that they give one some idea of where to look for the changes.

Posted by Frank at 12:55 PM | TrackBack (0)

May 4, 2004

Tips and tricks

64-bit atomic operations.

At least on the Intel architecture, it is possible to do atomic operations on 64-bit entities even when running on a 32-bit processor, as long as it’s a Pentium or later. This is courtesy of the cmpxchg8b instruction, which does a compare-and-exchange of the 64-bit value in %edx:%edx with the target. With the lock prefix, it’s atomic. This helps a lot when having to avoid locks; it turns out that the RTLinux scheduler has no provision to handle priority inversion, so use of a spinlock in certain situations can lead pretty quickly to deadlock. I had to implement a global ‘clock’ in 64 bits and this instruction saved my bacon.

To read the value:

__asm__ __volatile__(
"       xorl            %%eax, %%eax\n"
"       xorl            %%edx, %%edx\n"
"       xorl            %%ebx, %%ebx\n"
"       xorl            %%ecx, %%ecx\n"
"lock;  cmpxchg8b       %2\n"
"       movl            %%eax, %0\n"
"       movl            %%edx, %1"
        : "=m" (ll_low(out)), "=m" (ll_high(out))
        : "o" (target)
        : "memory", "eax", "ebx", "ecx", "edx", "cc");

The “ll_low()” and “ll_high()” macros are Linuxisms to isolate the low and high word, respectively, of a 64-bit entity. They are defined as:

#define ll_low(x)       *(((unsigned int*)&(x))+0)
#define ll_high(x)      *(((unsigned int*)&(x))+1)

To increment the target by a given value:

__asm__ __volatile__(
"       movl            %0, %%eax\n"
"       movl            %0+4, %%edx\n"
"       movl            %1, %%ebx\n"
"       xorl            %%ecx, %%ecx\n"
"1:    addl            %%eax, %%ebx\n"
"       adcl            %%edx, %%ecx\n"
"lock;  cmpxchg8b       %0\n"
"       jnz             1b"
        : "+o" (target)
        : "m" (incr)
        : "memory", "eax", "ebx", "ecx", "edx", "cc");

I loop and reincrement if the compare failed; in that case %edx:%eax is reloaded with the target and I have to retry the operation.

Enjoy. It’s a lot faster than using spinlocks.

Posted by Frank at 10:07 PM | Comments (9) | TrackBack (0)

March 21, 2004

Metadiscussion

A new weblog.

This is a place for me to post and keep various interesting bits and pieces of technical information I come across, derive or invent. It probably won’t be very useful to anyone else but if you want to read it, feel free.

Otherwise you might want to take a look at my political weblog, I Protest.

Posted by Frank at 9:42 PM | TrackBack (0)
Copyright 2004-2008 Frank Mayhar
All Rights Reserved