<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" 
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
    xmlns:admin="http://webns.net/mvcb/"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:content="http://purl.org/rss/1.0/modules/content/">

  <channel>
    <title>Frank&apos;s Technotes</title>
    <link>http://www.exit.com/blog/technotes/</link>
    <description>A place for interesting technical tidbits, links, algorithms and other useful pieces of information.</description>
    <dc:language>en-us</dc:language>
    <dc:creator>frank@exit.com</dc:creator>
    <dc:rights>Copyright 2008 Frank Mayhar</dc:rights>
    <dc:date>2008-04-08T11:17:33-08:00</dc:date>
    <admin:generatorAgent rdf:resource="http://www.movabletype.org/?v=3.34" />
    <admin:errorReportsTo rdf:resource="mailto:frank@exit.com"/>
    <sy:updatePeriod>hourly</sy:updatePeriod>
    <sy:updateFrequency>1</sy:updateFrequency>
    <sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>

    <item>
      <title>A quick note regarding Linux &quot;tasks.&quot;</title>
      <link>http://www.exit.com/blog/archives/000433.html</link>
      <description>In an exchange regarding a race between timer ticks (and updates of task-&gt;utime versus task-&gt;signal-&gt;(somepointer)-&gt;utime), Roland McGrath told me that current-&gt;exit_state can go from zero to nonzero only by current running code in the do_exit path. current does not progress...</description>
      <guid isPermaLink="false">433@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>In an exchange regarding a race between timer ticks (and updates of task->utime versus task->signal->(somepointer)->utime), Roland McGrath told me that
</p>
<blockquote>
current->exit_state can go from zero to nonzero only by current running code in the do_exit path.  current does not progress on that path while current is inside one update_process_times call.
</blockquote>
<p>The key bit of knowledge that I was lacking and that this (when placed with other information I had gained in the extended email exchange) was that a "task" is really a Linux scheduling unit. From the scheduler's perspective, "task" is the same as "thread."  The only thing that makes a set of threads into a multithreaded process is that they share a signal_struct (and their memory map, of course, and a several other things that, as Roland says, are implicitly required to be shared when signal_struct is shared).  So a "task" can only be executed on a single cpu at any time, it can't be executed on more than one cpu at a time.  Therefore if a "task" is executing and is interrupted, the value of "current" at the interrupt will be that task, which is entirely suspended for the duration of the interrupt.</p>]]></content:encoded>
      <dc:subject>Documenting the undocumented</dc:subject>
      <dc:date>2008-04-08T11:17:33-08:00</dc:date>
    </item>
    <item>
      <title>A few hard-won lessons regarding software and coding.</title>
      <link>http://www.exit.com/blog/archives/000432.html</link>
      <description>This is a short list of Rules to Live By, as generated by some twenty years of experience writing code that, at one time or another, undoubtedly broke all of these rules. This is about programming in C and mostly...</description>
      <guid isPermaLink="false">432@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>This is a short list of Rules to Live By, as generated by some twenty years of experience writing code that, at one time or another, undoubtedly broke all of these rules.  This is about programming in C and mostly in the kernel, so add spices as needed for taste.</p>

<p>Always bear in mind that it might be <strong>YOU</strong> who in five years has to come back and maintain the code you're writing.  As anyone who has done this knows, if you haven't looked at a piece of code in a few years, it may as well have been written by someone else.</p>

<p>In general, no matter what your favorite indentation scheme might be, use the indentation scheme of the code that you're changing.  Mixed indentation just makes the code more confusing and harder to read.  Many, many bugs have been overlooked because the screwed-up indentation made the code look right when it in fact was broken.</p>

<p>Only change indentation when doing a wholesale overhaul of a file or routine.  If it's your new file, use whatever scheme suits you, but if it's existing code, don't gratuitously change it.  Gratuitous whitespace changes just increase the size of the changeset and make it more difficult to track the history of the code.  Only change indentation when you're essentially replacing a routine or when the indentation is already screwed up and you're making it consistent <em>while</em> you replace most of the code.</p>

<p><strong>USE COMMENTARY</strong>!  C is not self-documenting.  I had a disagreement not long ago with someone who claimed that well-written code doesn't need commentary.  This is someone who codes almost exclusively in object-oriented languages, but even there I think that he's just wrong.  Commentary is absolutely necessary.  Of course, if you write commentary
of the form</p>
<pre>
        a++;    /* Add one to a. */
</pre>
<p>then you're asking for me to come slap you.  The commentary should be concise and should provide the information that the code itself does not provide.  The code itself provides the "what," good commentary provides the "why."  Decent commentary also enumerates the assumptions
made by the person writing the code.  If code has side effects, commentary should list them and point to the code affected.  If there are races, the races should be described and, if it's a relatively complex race, commentary should give the solution rationale.  If there are invisible assumptions behind the code (along the lines of "the value of 'x' is filled in by another thread" or "this race doesn't matter because we don't care who wins" or even "we have to do things in this particular way because our algorithm depends on doing so, for these reasons..."), <strong>COMMENT THOSE ASSUMPTIONS</strong>.  This is probably the single most important kind of commentary one can write.</p>

<p>If you change the code and, due to your change, the commentary no longer matches what the code does, fix it!  If the commentary is broken, the code is broken.</p>

<p>Never reinvent the wheel.  If there's already a routine that does the job, use it.  If there isn't, look again before writing your own.  It may be that a very minor change will generalize an existing routine.  The less code you write, the fewer bugs you'll introduce.</p>

<p>"Measure twice, cut once."  No amount of after-the-fact bug fixing can compensate for insufficient design.  I have found that by spending most of my time on design, the actual implementation goes amazingly quickly and the only bugs tend to be minor ones.  If you find yourself constantly adding and removing code to cover cases you originally didn't handle, you didn't spend enough time on the design.  If you do it right, even if you don't actually code all the cases up front, adding those cases when necessary is pretty trivial.</p>

<p>Don't try to write C++/FORTRAN/&lt;your favorite language&gt; when you're actually writing C.  Use C constructs, don't try to force the shape of another language on it.  One can certainly write object-oriented code in C even though the language doesn't support all the C++/Java/etc. bells and whistles.  Just because something doesn't obviously <em>look</em> object-oriented doesn't mean that it isn't.  (This is another place where good commentary comes in, btw.)</p>

<p>Your code will be around long after you've gone on to other things.  It might even outlive you.  Try to make sure that those who have to look at it after you find reason to appreciate your ability to write clean, well-commented and clear code that's easy to understand and maintain.  Otherwise, you could have those people cursing you loudly and at length.  Remember, that person cursing you could, in fact, be YOU, five years from now.</p>

<p>If you want a very, very good negative example, just look at the xfrm code in Linux 2.6.  I did.  I had to change it.  I can only hope that the person who wrote it is condemned to maintain it.</p>]]></content:encoded>
      <dc:subject>Software Engineering</dc:subject>
      <dc:date>2007-10-18T15:21:16-08:00</dc:date>
    </item>
    <item>
      <title>My OpenSSI contributions.</title>
      <link>http://www.exit.com/blog/archives/000430.html</link>
      <description> Let me preface this by pointing out that the overwhelming majority of my contributions were made prior to August 2000, when I left Compaq for a position at BSDi. Despite that fact, however, perusal of the code itself shows...</description>
      <guid isPermaLink="false">430@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>
Let me preface this by pointing out that the overwhelming majority of my
contributions were made prior to August 2000, when I left Compaq for a
position at BSDi.  Despite that fact, however, perusal of the code itself
shows me that it has changed little in the intervening seven years, and
then for the most part only in terms of bug fixes and minor enhancements.
</p>
<p>
During my six-year stint as part of the SSI cluster group headed by Bruce
Walker at four different companies, I worked on most parts of the code,
including many that were specific to SVR4 or Unixware.  Of the parts that
made it to OpenSSI and CI-Linux, I did minor work in the virtual process
(vproc) support code, the IPC code and a few other areas.  The majority of
my work, though, and the work of which I am most proud, was in the Cluster
Membership Service (CLMS) and related work in the Internode Communication
Subsystem (ICS).
</p>
<p>
In 1995 our company, Locus, became part of Platinum Technologies.  At that time
our largest customer was Tandem Computers; we understood that we would
write a skeletal CLMS and the folks at Tandem would replace it with a
more complete one.  In 1996 Platinum sold the group and the software to
Tandem, at which point we found out that the module the Tandem folks had
intended to replace our CLMS was not sufficient for the needs of the
cluster itself.  We therefore had to replace an interim, skeletal CLMS
with one that was more complete.  That ended up being my job.
</p>
<p>
From 1996 to roughly 1998 I worked on that, splitting the existing single
file into several and adding a large number of new features.  Today CLMS
is still structured largely as I designed it then.  Among the new features
was support for elimination of single points of failure for certain key
services needed by the cluster, such as the node running the init process
or the "surrogate origin" node for processes whose true origin node has
left the cluster.  It allowed more than one node to offer those key
services and, if the node currently running one or more key services
failed, it performed a graceful reassignment of those services to a new
node.
</p>
<p>
In addition to the key service work I did a lot of work in CLMS on the
details of handling nodes coming into and going out of the cluster.  While
when I left there was no facility for a graceful node exit, we of course
had to handle ungraceful exits, such as when a node crashed.  This is
well known to be a pretty complex problem.  I generally solved it by
serializing nodeups, allowing only one node to come in at a time, and
blocking nodeups entirely when nodedown processing was going on.  Nodedown
processing of course cannot proceed one node at a time, since such
processing for one down node may depend on cleaning up a resource actually
held by another node which is also down.  I could go into more detail that
would be overkill for a "brief" discussion, so suffice it to say that the
problem was complex and I spent a lot of time on it.
</p>
<p>
The other major piece of CLMS work was in late 1997 and early 1998 and
was actually a combination of two requirements.  First, there was a
requirement that we eliminate the single point of failure for CLMS itself,
which at the time ran on only one node for the life of the cluster.  (And
when I say "for the life of the cluster," I mean <u>by definition</u>, because
losing the single CLMS master meant losing the entire cluster.)  This was
originally dealt with by the simple expedient of arbitrarily assigning a
takeover node for the CLMS master, which polled all the nodes it knew
and determined the cluster membership from the intersection of the
responses and its own list of "up" nodes.  Any node thought to be up some
nodes but not by others was kicked out unconditionally.  This scheme,
however, was seriously complicated by the second requirement.
</p>
<p>
The second requirement was that applications must see cluster- and node-
state transitions as coherent and monotonic with respect to time.  In
other words, a process that migrates from one node to another must never
see on the destination node a cluster or node state that is older than the
state seen on the origin node.  One other problem with which I had to
deal was potentially node-state-sensitive data in transit between nodes
when a state transition took place, a longstanding problem that now had a
potential solution.
</p>
<p>
This ended up being a more complex problem that the one I described
earlier, but as this is getting long I'll only describe it briefly.  The
design called for a vector of node transitions and a monotonic timestamp
as the mechanism used to order transitions, along with a two-stage commit
mechanism and a lock used to block applications from seeing transitions
while they were taking place.  Each node in the cluster kept a history of
outstanding or recent state transitions; the new CLMS constructed a list
of "canonical" transitions from the union of the the history kept by each
node in the cluster, inserting dummy transitions as necessary to keep the list consistent.  It then
rolled that list forward on all nodes, at the end of which process all
nodes would agree on the state of the cluster and all nodes therein.  Of
course, there were other cases with which I had to deal, but this was the
overall design.
</p>
<p>
One note:  It's not clear in the above description, but CLMS has a quite
close relationship to ICS and therefore I did a lot of work there, as
well.
</p>
<p>
It was a fun and interesting project and quite complex.  The nicest thing
about it, though, was when I looked at the code again, for the first time
in (at the time) five years, in 2003.  To my mild surprise I found the
code understandable, mostly because I went to great lengths to make it so
when I wrote it.
</p>
<p>
While most of the documents supporting this are lost to history or dusty
archives, the source itself is available online.  The most interesting
bits that relate to this document are on <a href="http://sourceforge.net/projects/ci-linux/" target="_top">the CI-Linux page</a>.

The source itself is <a href="http://ci-linux.cvs.sourceforge.net/ci-linux/ci/kernel/cluster/clms/">here</a>
and <a href="http://ci-linux.cvs.sourceforge.net/ci-linux/ci/kernel/cluster/ics/">here</a>.

Interesting files are pretty much everything beginning with "clms_," in
particular clms_api.c, clms_client.c, clms_failover.c, clms_master.c,
clms_mgmt.c, clms_subr.c and clms_svcmgmt.c.  Also much of the ics_ stuff.
</p>
As it happens, most of the routines with commentary that looks like the
example below were mine, although by this point most of them have changes
made by others:
<pre>
/*
 * icssvr_llhandle_init()
 *	Initialize a handle that has not been used
 *
 * Description:
 *	This routine is called by ICS high level code to initialize the
 *	low-level ICS portion of a handle.
 *
 * Parameters:
 *	handle		the handle to be initialized
 *	sleep_flag	this routine can sleep waiting for resources/memory
 *
 * Return value:
 *	If sleep_flag is TRUE, then this routine always returns 0 (success).
 */
 </pre>]]></content:encoded>
      <dc:subject>Operating Systems</dc:subject>
      <dc:date>2007-02-21T13:02:21-08:00</dc:date>
    </item>
    <item>
      <title>UFS2 version of ffsrecov for FreeBSD.</title>
      <link>http://www.exit.com/blog/archives/000429.html</link>
      <description>I just released a heavily modified version of John-Mark Gurney&apos;s ffsrecov, adapted to use libufs and to work (only) with UFS2 file systems. I call it ffs2recov and it is available here. I wrote this a couple of years ago...</description>
      <guid isPermaLink="false">429@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>I just released a heavily modified version of John-Mark Gurney's ffsrecov, adapted to use libufs and to work (only) with UFS2 file systems.  I call it ffs2recov and it is available <a href="http://www.exit.com/Archives/FreeBSD/ffs2recov-1.0.tar.bz2">here</a>.</p>

<p>I wrote this a couple of years ago so that I could recover a file system that had been stomped by a misconfigured RAID controller.  It worked well enough for me to recover a couple of hundred gigabytes worth of data, which was a great relief (although some stuff was gone forever, sigh).  I had intended to polish it up and release it long before now, but I've never managed to get around to doing the polishing.  In particular, while it has a nice little summary of implemented options, the manpage needs a lot of work.</p>

<p>On the positive side, however, I extended it to be a lot more robust in the face of corrupt pointers and file system offsets, so it doesn't just fall over when it sees garbage in a block address or whatnot.</p>

<p>I'm releasing it under the BSD two-clause license, with due credit to John-Mark.  It's my hope that someone else will take it, clean it up a bit, rewrite the manpage and maybe make a port out of it.  If you do and you need a place to host the distfile, let me know.</p>

<p>Since I released it, a couple of people have cleaned it up a <i>tiny</i> bit and I've integrated their changes.  There's now a port in the FreeBSD ports collection, sysutils/ffs2recov.</p>]]></content:encoded>
      <dc:subject>Operating Systems</dc:subject>
      <dc:date>2007-02-07T22:36:06-08:00</dc:date>
    </item>
    <item>
      <title>A new way of doing things.</title>
      <link>http://www.exit.com/blog/archives/000416.html</link>
      <description>I just read The Internet Is Broken at Technology Review. In it, David Clark at MIT asserts that the Internet needs to be, not fixed, but replaced with something better, something designed from the start to deal with the problems...</description>
      <guid isPermaLink="false">416@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>I just read <a title="The Impact of Emerging Technologies: The Internet Is Broken" href="http://www.technologyreview.com/infotech/wtr_16051,258,p1.html">The Internet Is Broken</a> at <a title="The Impact of Emerging Technologies" href="http://www.technologyreview.com/index.aspx">Technology Review</a>.  In it, David Clark at MIT asserts that the Internet needs to be, not fixed, but replaced with something better, something designed from the start to deal with the problems and the sheer size of the current Internet.</p>

<p>I think he's right.  In fact, I've been thinking along those lines for some time, only I've had something a bit more radical in mind.</p>

<p>From 1994 to 2000 I worked with Bruce Walker and others through four different companies on single-system-image clustering software.  Our team worked on what eventually became <a href="http://www.openssi.org/">OpenSSI</a>.  The system embodied a lot of very good ideas, but unfortunately was a victim of indifference and the "good is the enemy of the best" phenomenon.  It's still alive, after a fashion, but it never really took off.  Personally, I think that our biggest mistake was in not thinking big enough.</p>

<p>Imagine a world in which your PDA, laptop computer, watch, DVR and desktop computer all talk to each other automatically, sharing not only data but computing cycles, memory, disk space and network connectivity.  One in which the environment you use at home is exactly the same one that you use at work or in your car, right down to the fonts and wallpaper (not to mention the applications).  One in which you never have to worry about a virus or trojan horse, since the authentication and authorization mechanisms necessary to prevent such things are built into the fabric of the system.</p>

<p>Imagine being notified on your laptop that your car needs gas or that it's time for its regular maintenance since it has passed 3500 miles since the last one.  Automatically, because agents that run all the time have contacted the car and gotten its state without your intervention.</p>

<p>Imagine being in the middle of some particularly important and knotty problem at work when it's time to go home, then going home and picking up exactly where you left off, without doing anything special at all.</p>

<p>Imagine never, ever being offline because the mesh that your personal devices are a part of maintains its connectivity constantly, despite failure of individual components.  Imagine building a sensor network or a compute cluster just by getting the appropriate hardware and doing some minor configuration.</p>

<p>Some of this is already happening, with meshes, grid computing, metropolitan-scale wireless networks, sensor nets and mobile ad-hoc networking.  Some of it is a bit science-fictional.  It's all possible, though, even practical.  All of the problems have been or are being solved.  The challenge is in designing the system to be robust enough to be useful and then in implementing that design.</p>

<p>And I have a few ideas about how to do it.</p>]]></content:encoded>
      <dc:subject>Operating Systems</dc:subject>
      <dc:date>2006-01-05T22:03:07-08:00</dc:date>
    </item>
    <item>
      <title>64-bit divides on Intel 32-bit machines.</title>
      <link>http://www.exit.com/blog/archives/000383.html</link>
      <description>I recently had a need to perform a divide of a 64-bit signed integer by a 32-bit unsigned integer in an Pentium IV-based realtime system (running RTLinux and Linux 2.4.22). After some grubbing around on Google I came across John...</description>
      <guid isPermaLink="false">383@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>I recently had a need to perform a divide of a 64-bit signed integer by a 32-bit unsigned integer in an Pentium IV-based realtime system (running RTLinux and Linux 2.4.22).  After some grubbing around on <a href="http://www.google.com/">Google</a> I came across John Eckerdal's <a title="The Assembly Gems Page" href="http://www.df.lth.se/~john_e/gems.html">Assembly Gems</a> page, which features (among many other routines) a generalized routine for <a title="Dividing 64-bit unsigned integers" href="http://www.df.lth.se/~john_e/gems/gem002a.html">dividing 64-bit unsigned integers</a>.</p>

<p>That routine is wonderful, but I needed to divide a <i>signed</i> 64-bit integer by an unsigned <i>32-bit</i> integer.  This is really just a special case of Eckerdal's routine, though, with a small wrapper to handle the sign of the dividend, viz:</p>

<pre>
if (dividend &lt; 0) {
	neg = 1;
	dividend = -dividend;
}
__asm__ __volatile__(
"	movl	%2, %%eax\n"	/* Dividend low into eax.     */
"	movl	%3, %%edx\n"	/* Dividend high into edx.    */
"	movl	%4, %%ebx\n"	/* Divisor into ebx.          */
"	xorl	%%ecx, %%ecx\n"	/* Clear temp quotient high.  */
"	cmpl	%%ebx, %%edx\n"	/* If dividend high < divisor,*/
				/* result will fit in 32 bits,*/
				/* so we can use one divide.  */
"	jb	1f\n"		/* Yes, skip next divide.     */
"	movl	%%eax, %%ecx\n" /* Save dividend low in ecx.  */
/* Divide &lt;dividend high&gt;:&lt;dividend low&gt; by divisor.          */
"	divl	%%ebx\n"	/* Quotient in eax, remainder */
				/* in edx.                    */
"	xchgl	%%ecx, %%eax\n"	/* Quotient high to ecx,      */
				/* dividend low to eax.       */
/* Divide &lt;high remainder&gt;:&lt;dividend low&gt; by divisor.         */
"1:	divl	%%ebx\n"	/* Quotient low to eax,       */
				/* remainder to edx.          */
"	movl	%%ecx, %1\n"	/* Quotient high to result high. */
"	movl	%%eax, %0"	/* Quotient low to result low.   */
	: "=g" (ll_low(result)), "=g" (ll_high(result))
	: "g" (ll_low(dividend)), "g" (ll_high(dividend)),
	  "g" (divisor)
	: "memory", "eax", "ebx", "ecx", "edx", "cc");
if (neg)
	result = -result;
</pre>

<p>The commentary pretty much says it all.  This is made a lot simpler by the constraints of my problem, in that I only have to divide by a 32-bit unsigned quantity, so I can skip most of the work Eckerdal does.  My only extension is to convert the dividend to unsigned beforehand and then convert the result back to signed afterwards.</p>

<p>Simple, effective, and many times faster than the default Gnu C <i>__divdi3()</i> call to do the same thing.</p>]]></content:encoded>
      <dc:subject>Tips and tricks</dc:subject>
      <dc:date>2004-09-02T08:31:39-08:00</dc:date>
    </item>
    <item>
      <title>The path of a forwarded packet.</title>
      <link>http://www.exit.com/blog/archives/000375.html</link>
      <description>I&apos;ve just had occasion to figure out (with the help of kgdb) the path a forwarded packet takes through the Linux 2.6 stack. In terms of just called functions, here it is: netif_receive_skb() from network driver ip_rcv() via packet_type.func ip_rcv_finish()...</description>
      <guid isPermaLink="false">375@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>I've just had occasion to figure out (with the help of kgdb) the path a forwarded packet takes through the Linux 2.6 stack.  In terms of just called functions, here it is:</p>

<pre>
netif_receive_skb()	from network driver
ip_rcv()		via packet_type.func
ip_rcv_finish()		via NF_HOOK(PF_INET,
				    NF_IP_PRE_ROUTING,
				    skb, dev, NULL, ip_rcv_finish);
	calls ip_route_input() (which calls ip_route_input_slow())
	to route packet internally
dst_input()		directly
ip_forward()		via skb->dst.input(); set up by
			ip_route_input_slow()
	calls xfrm4_policy_check(NULL, XFRM_POLICY_FWD, skb)
		may drop packet if blocked here
	calls xfrm4_route_forward(skb)
		may drop packet if blocked here
ip_forward_finish()	via NF_HOOK(PF_INET, NF_IP_FORWARD,
				    skb, skb->dev,
				    rt->u.dst.dev, ip_forward_finish);
dst_output()
ip_output()		via skb->dst.output(), op cit
ip_finish_output()	if no fragmenting needed
ip_finish_output2()	via NF_HOOK(PF_INET, NF_IP_POST_ROUTING,
				    skb, NULL, dev, ip_finish_output2);
...then device output routine...
</pre>
]]></content:encoded>
      <dc:subject>Documenting the undocumented</dc:subject>
      <dc:date>2004-07-15T13:48:48-08:00</dc:date>
    </item>
    <item>
      <title>Mapping the Linux networking stack.</title>
      <link>http://www.exit.com/blog/archives/000369.html</link>
      <description>The DataTAG folks have produced the final draft of &quot;A Map of the Networking Code in Linux Kernel 2.4.20. It is an excellent reference for those working in Linux networking. Especially considering that the stuff is barely documented anywhere else....</description>
      <guid isPermaLink="false">369@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>The <a href="http://datatag.web.cern.ch/datatag/">DataTAG</a> folks have produced the final draft of "<a title="A Map of the Networking Code in Linux Kernel 2.4.20" href="http://datatag.web.cern.ch/datatag/papers/tr-datatag-2004-1.pdf">A Map of the Networking Code in Linux Kernel 2.4.20</a>.  It is an excellent reference for those working in Linux networking.  Especially considering that the stuff is barely documented anywhere else.</p>

<p>They also refer to a set of lectures by Dr. Jerry Cooperstein at <a title="Axian - Software Consulting and Training" href="http://www.axian.com/">Axian</a> concerning changes made in Linux 2.6, the slides for which are available <a href="http://www.axian.com/publications.php" title="Axian - Publications">on their pubs site</a>.  These are extremely sketchy but still somewhat useful, at least in that they give one some idea of where to look for the changes.</p>]]></content:encoded>
      <dc:subject>Documenting the undocumented</dc:subject>
      <dc:date>2004-06-16T12:55:44-08:00</dc:date>
    </item>
    <item>
      <title>64-bit atomic operations.</title>
      <link>http://www.exit.com/blog/archives/000361.html</link>
      <description>At least on the Intel architecture, it is possible to do atomic operations on 64-bit entities even when running on a 32-bit processor, as long as it&apos;s a Pentium or later. This is courtesy of the cmpxchg8b instruction, which does...</description>
      <guid isPermaLink="false">361@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>At least on the Intel architecture, it is possible to do atomic operations on 64-bit entities even when running on a 32-bit processor, as long as it's a Pentium or later.  This is courtesy of the cmpxchg8b instruction, which does a compare-and-exchange of the 64-bit value in %edx:%edx with the target.  With the lock prefix, it's atomic.  This helps a lot when having to avoid locks; it turns out that the RTLinux scheduler has no provision to handle priority inversion, so  use of a spinlock in certain situations can lead pretty quickly to deadlock.  I had to implement a global 'clock' in 64 bits and this instruction saved my bacon.</p>

<p>To read the value:</p>

<pre>
__asm__ __volatile__(
"       xorl            %%eax, %%eax\n"
"       xorl            %%edx, %%edx\n"
"       xorl            %%ebx, %%ebx\n"
"       xorl            %%ecx, %%ecx\n"
"lock;  cmpxchg8b       %2\n"
"       movl            %%eax, %0\n"
"       movl            %%edx, %1"
        : "=m" (ll_low(out)), "=m" (ll_high(out))
        : "o" (target)
        : "memory", "eax", "ebx", "ecx", "edx", "cc");
</pre>

<p>The "ll_low()" and "ll_high()" macros are Linuxisms to isolate the low and high word, respectively, of a 64-bit entity.  They are defined as:</p>

<pre>
#define ll_low(x)       *(((unsigned int*)&(x))+0)
#define ll_high(x)      *(((unsigned int*)&(x))+1)
</pre>

<p>To increment the target by a given value:</p>

<pre>
__asm__ __volatile__(
"       movl            %0, %%eax\n"
"       movl            %0+4, %%edx\n"
"       movl            %1, %%ebx\n"
"       xorl            %%ecx, %%ecx\n"
"1:    addl            %%eax, %%ebx\n"
"       adcl            %%edx, %%ecx\n"
"lock;  cmpxchg8b       %0\n"
"       jnz             1b"
        : "+o" (target)
        : "m" (incr)
        : "memory", "eax", "ebx", "ecx", "edx", "cc");
</pre>

<p>I loop and reincrement if the compare failed; in that case %edx:%eax is reloaded with the target and I have to retry the operation.</p>

<p>Enjoy.  It's a <i>lot</i> faster than using spinlocks.<br />
</p>]]></content:encoded>
      <dc:subject>Tips and tricks</dc:subject>
      <dc:date>2004-05-04T22:07:39-08:00</dc:date>
    </item>
    <item>
      <title>A new weblog.</title>
      <link>http://www.exit.com/blog/archives/000351.html</link>
      <description>This is a place for me to post and keep various interesting bits and pieces of technical information I come across, derive or invent. It probably won&apos;t be very useful to anyone else but if you want to read it,...</description>
      <guid isPermaLink="false">351@http://www.exit.com/blog/technotes/</guid>
      <content:encoded><![CDATA[<p>This is a place for me to post and keep various interesting bits and pieces of technical information I come across, derive or invent.  It probably won't be very useful to anyone else but if you want to read it, feel free.</p>

<p>Otherwise you might want to take a look at my political weblog, <a href="http://www.exit.com/blog/frank/" title="I Protest.">I Protest.</a><br />
</p>]]></content:encoded>
      <dc:subject>Metadiscussion</dc:subject>
      <dc:date>2004-03-21T21:42:43-08:00</dc:date>
    </item>


  </channel>
</rss>