Other Tech Blogs
Links
Referrers
May 2004
Sun Mon Tue Wed Thu Fri Sat
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          
Search


Powered by
Movable Type 3.34

May 4, 2004

Tips and tricks

64-bit atomic operations.

At least on the Intel architecture, it is possible to do atomic operations on 64-bit entities even when running on a 32-bit processor, as long as it's a Pentium or later. This is courtesy of the cmpxchg8b instruction, which does a compare-and-exchange of the 64-bit value in %edx:%edx with the target. With the lock prefix, it's atomic. This helps a lot when having to avoid locks; it turns out that the RTLinux scheduler has no provision to handle priority inversion, so use of a spinlock in certain situations can lead pretty quickly to deadlock. I had to implement a global 'clock' in 64 bits and this instruction saved my bacon.

To read the value:

__asm__ __volatile__(
"       xorl            %%eax, %%eax\n"
"       xorl            %%edx, %%edx\n"
"       xorl            %%ebx, %%ebx\n"
"       xorl            %%ecx, %%ecx\n"
"lock;  cmpxchg8b       %2\n"
"       movl            %%eax, %0\n"
"       movl            %%edx, %1"
        : "=m" (ll_low(out)), "=m" (ll_high(out))
        : "o" (target)
        : "memory", "eax", "ebx", "ecx", "edx", "cc");

The "ll_low()" and "ll_high()" macros are Linuxisms to isolate the low and high word, respectively, of a 64-bit entity. They are defined as:

#define ll_low(x)       *(((unsigned int*)&(x))+0)
#define ll_high(x)      *(((unsigned int*)&(x))+1)

To increment the target by a given value:

__asm__ __volatile__(
"       movl            %0, %%eax\n"
"       movl            %0+4, %%edx\n"
"       movl            %1, %%ebx\n"
"       xorl            %%ecx, %%ecx\n"
"1:    addl            %%eax, %%ebx\n"
"       adcl            %%edx, %%ecx\n"
"lock;  cmpxchg8b       %0\n"
"       jnz             1b"
        : "+o" (target)
        : "m" (incr)
        : "memory", "eax", "ebx", "ecx", "edx", "cc");

I loop and reincrement if the compare failed; in that case %edx:%eax is reloaded with the target and I have to retry the operation.

Enjoy. It's a lot faster than using spinlocks.

Posted by Frank at May 4, 2004 10:07 PM | TrackBack
Comments

I think I need code like this to deal with a 64-bit counter I have that is incremented at interrupt time and read concurrently (potentially) by other tasks or processors. However some of my co-workers see a statement by Intel that aligned 64-bit reads and writes are atomic and think there is no reason for the special code, EVEN though the compiler we are using emits 2 32-bit (back-to-back) MOVs to do the loads and stores. Opinion??

Posted by: Jim Knoke at June 16, 2004 1:02 PM

Where is that statement by Intel? As far as I know, this is the only way to do atomic 64-bit ops on 32-bit Intel x86 architecture. That is in fact what the instruction was designed for.

If you're dealing with a 64-bit processor, I have to say that I don't know; I haven't yet had the opportunity to deal with that. I would guess, though, that there are more 64-bit atomic ops available to you in that case.

Posted by: Frank at June 16, 2004 1:33 PM

We are looking at volume 3 of the IA-32 manual, section 7.1.1. Some of the confusion seems to have stemmed from the old PLOCK logic in the 486 that apparently automatically locked 2 bus cycles in some situations, but apparently that logic is gone. Also there was speculation that the CPU would see multiple memory ops that were "close" together in the instruction stream and combine them into the same bus cycle where possible.

Anyway I agree with you.

At first, I thought the read code you show didn't need the lock prefix (since the 64-bit memory read should be atomic). But then I see that the instruction is going to issue a write, even when it is writing the same value it read.

Posted by: Jim Knoke at June 21, 2004 11:25 AM

I have a sneaking suspicion that the confusion stems from Intel's use of the term "word" when they say, in that section, that reading or writing a byte, word aligned on a 16-bit boundary or doubleword aligned on a 32-bit boundary is guaranteed to be atomic. Throughout the three volumes, they use "word" to imply 16 bits, what to folks in the 32-bit world is a "short" or halfword. So their guarantee of the atomicity of "doubleword" loads and stores is only a guarantee of the atomicity of 32-bit loads and stores.

As far as I know, according to the published documentation, the locked cmpxchg8b instruction is the only way to guarantee atomicity of a 64-bit operation on any ia32 Intel processor.

Posted by: Frank at June 23, 2004 8:06 AM

The newer versions of the Intel manual also claim atomicity of "quadword" reads and writes, as long as the item is quadword aligned.

I think this means that a movq, for example, would be atomic (without any lock prefix). cmpxchg8 I think would do atomic transfers, but is not atomic overall (without the lock prefix) because it is R-M-W.

I have people here still telling me that Linux reads a 64-bit aligned quantity (timer?) *without* any special instructions and without any locking. And these people conclude that we don't need to do anything special either (for a simple 64-bit read), even though our compiler is doing the read with 2 separate 32-bit MOVs.

I have sent an e-mail to Intel, but don't really expect to get a real answer. I have perused other Intel hardware manuals and can't find the definitive answer. I don't see a great way to test it, given that I don't have hardware diagnostic tools.

Posted by: Jim Knoke at June 28, 2004 8:31 AM

Yes, I see that in Volume 3, the System Programming Guide, on page 7-3, section 7.1.1, "Guaranteed Atomic Operations," they write

The Pentium 4, Intel Xeon, and P6 family, and Pentium processors guarantee that the following additional memory operations will always be carried out atomically:
  • reading or writing a quadword aligned on a 64-bit boundary
  • 16-bit accesses to uncached memory locations that fit within a 32-bit data bus

That tells me that a 64-bit aligned access will be atomic but I have to assume that that's only if it is done in one operation. This means, yes, using a movq or one of the related operations. The problem I have with this is that movq and siblings use the "XMM" or "MMX" registers; these are associated with the SSE and MMX extended instructions and are really intended for multimedia operations. I can state categorically that your coworkers are incorrect with respect to the Linux timer. From commentary in kernel/timer.c, lines 905-906 (my line numbers may differ from the vanilla distribution):

The 64-bit jiffies value is not atomic - you MUST NOT read it without sampling the sequence number in xtime_lock.

There are occasions where one can compromise on atomicity to gain speed, but that's a different topic.

Posted by: Frank at August 5, 2004 10:41 PM

Thanks for posting this. I'm trying to get it to work in gas/gcc, but bump into this translation of %0+4... "-28(%ebp)+4" is a syntax error. What is the magic to get the half of the counter directly into a register instead of referenced via the ebp reg?

movl -28(%ebp), %eax
movl -28(%ebp)+4, %edx
movl -32(%ebp), %ebx
xorl %ecx, %ecx
1: addl %eax, %ebx
adcl %edx, %ecx
lock; cmpxchg8b -28(%ebp)
jnz 1b

Posted by: Ken at August 6, 2004 10:34 AM

This is where you want to use the ll_low()/ll_high() incantation as in the first example. The second example expects a static rather than a stack-relative address; for the latter you want to turn the second movl into

	movl %1,%%edx

(bumping the later parameters accordingly). Then for the target use ll_low(target)/ll_high(target). It will produce (in your example)

	movl -28(%ebp), %eax
	movl -24(%ebp),%edx

The ll_low()/ll_high() macros are defined in asm/system.h.

Posted by: Frank at August 6, 2004 10:57 AM

In the atomic add, doesn't ebx:ecx need to be reloaded if the exchange fails? Otherwise eax:edx will be added to ebx:ecx again.

__asm__ __volatile__(
" movl %0, %%eax\n"
" movl %0+4, %%edx\n"
"1: movl %1, %%ebx\n"
" xorl %%ecx, %%ecx\n"
" addl %%eax, %%ebx\n"
" adcl %%edx, %%ecx\n"
"lock; cmpxchg8b %0\n"
" jnz 1b"
: "+o" (target)
: "m" (incr)
: "memory", "eax", "ebx", "ecx", "edx", "cc");

Posted by: injinj at November 5, 2004 3:54 AM

All Rights Reserved