From: bmc@kiowa.eng.sun.com (Bryan Cantrill)
Subject: Re: But why?
Date: 1996/10/29
Message-ID: <5543dq$qjd@engnews2.Eng.Sun.COM>
references: <54990q$n5e@caip.rutgers.edu>
organization: Brown University
keywords: none
newsgroups: comp.sys.sun.hardware
In article <54990q$n5e@caip.rutgers.edu>,
David S. Miller wrote:
>bacon@mtu.edu (Jeff Bacon) writes:
>
>Since I have been able to find an intelligent posting in this thread,
>I will respond to it and explain what I can as chief architect of the
>SparcLinux port.
>
>> of course, then, the obvious question that comes up is WHY is it that
>> solaris has such higher overhead costs in doing things?
>>
>> obviously there's more code to work through to do any given thing.
>> someone must have thought it necessary. but...why?
>>
>> obviously it's got lots of extra crud from SVR4. why not pitch it?
>>
>
>The answer to this is pretty straight forward actually. The main
>points of interest are:
>
>1) Solaris's networking stack, in all of it's incantations (one breed
> of it was the Lochman code in 2.0, 2.1 and early 2.2 releases, then
> it was rewritten by another company for 2.3 onward) is SVR4 streams
> based. The performance penalty, even with lots of tricks, for
> using a SVR4 streams networking architecture is well known.
> Someone who happens to have a 2.2 Solaris CD around, or even a 2.3
> Solaris CD, should install that thing and run lmbench on it to see
> what "pure Streams based networking" without the tricks can really
> do.
>
> Linux on the other hand has a "no bullshit" networking architecture
> that is not streams based. Yet we also take advantage of the many
> known networking performance enhancements that exist in the
> research realm (ie. copy/checksum, the van jacobson hacks, etc.)
>
>2) Linux is light weight, Solaris is a pig.
>
> One of the most critical things that contributes to performance
> is cache/tlb footprint of the operating system. Linux being small
> (yet still provide a full POSIX unix environment!) solves the cache
> footprint problem in a big way. I've solved the TLB footprint
> using Linux's small size and a Sparc specific trick.
>
> The MMU's present on the sun4m/sun4d line of Sun machines possess a
> three level page table scheme. Using this, one has the capability
> to use the normal 4k sizes pages, and also larger 256k and 16MB
> sized pages. The average TLB on these machines has 32 or 64
> entries to cache these pte's, if the entry is not in the TLB
> hardware has to go out to the memory bus and walk the software page
> tables to "reload" the TLB so that the translation can be
> satisfied.
>
> This "miss processing" is very expensive. Under SunOS and Solaris,
> they do not take advantage of the 16MB and 256k sized pages to map
> the operating system. Therefore those two systems take many misses
> in the TLB during even the most rudimentary trap into the kernel.
> However under Linux the TLB misses for the OS are quite minimal.
> In fact I will gave an example:
>
> On your average SparcClassic with a 32 entry TLB, consider such
> a machine with 24MB of memory installed. Under Linux I can map
> the entire operating system (sans IO device register mappings
> and Lance Ethernet DMA) in 3 (count 'em, 3!) TLB entries. These
> 3 entries are enough to allow the kernel to access an arbitrary
> physical page from kernel space.
>
> Under Solaris, the OS would need 3 + (24MB / 256K) + (24MB / 4K)
> TLB entries to map this same amount of space. For a great many
> number of operations, it is quite easy for an OS with this page
> table strategy to blow the entire user context out of the
> hardware TLB. Which in turn means more processor stalls (in
> fact many) for both the user level processes and the operating
> system.
>
> Result? Severe degradation in performance for the latter
> scheme.
>
>3) Every BSD and SVR4 based system today, except for Linux, has a very
> broken System call mechanism.
>
> You'd think that when people put together function call conventions
> for a particular processor, the OS people would take a look at this
> and find a way to take advantage of this. In fact, believe it or
> not, they have not to this very day.
>
> Linux from day one, takes advantage of the procedure call
> conventions on a particular architecture so that it can process
> system calls in the most expediant way possible. I will give
> an example on the Sparc to prove this:
>
> Consider your average 3 argument system call. The user level
> code does something like this:
>
> mov %arg0, %o0
> mov %arg1, %o1
> mov %arg2, %o3
> mov SYSTEM_CALL_NUMBER, %g1
> t SYSCALL_TRAP
>
> At this point control reaches the operating system, it must
> prepare to handle this request from the user. On the Sparc, this
> is either a two step or a three step process depending upon
> whether you are doing it in the traditional broken UNIX way or the
> clean, fast, and superior Linux way. First I will show the Linux
> method for doing this:
>
> 1) Step one, jump onto the kernel stack for this task
> and make sure the kernel has a register window to
> operate in safely.
>
> For Linux the code path for this runs at ~18 instructions
> for the common case (the kernel already has a valid
> register to use so now saving needs to be done). It runs
> at ~42 instructions for the second most common case (the
> kernels needs to allocate a new register window and the
> user has a valid stack pointer) and ~82 instructions for
> the least common case (kernel needs a window, user has
> an invalid stack pointer, thus the kernel needs to save
> the user's window into a special per-task save area).
>
> 2) Take the system call number, check for valid value, use
> this to offset into a table of system call function ptrs.
> Move arguments into place and perform the syscall.
>
> Basically this is a simple operations a looks something
> like:
>
> sll %g1, 2, %l4 ! produce offset
> ld [%l7 + %g1], %l7 ! syscall ptr base was in %l7
> SAVE_ALL ! perform step #1 above
> mov %i0, %o0
> mov %i1, %o1
> mov %i2, %o2
> mov %i3, %o3
> mov %i4, %o4
> jmp %l7, %o7
> mov %i5, %o5
>
> That is it, that is the entire system call under Linux.
>
> Under Solaris/SunOS things are wildly different. Step one is
> basically the same, but step 2 is disgustingly inefficient for
> those systems. Basically they do:
>
> 2) Call common system_call() C function.
>
> 3) This routine allocates a "system call argument package"
> structure on the kernel stack. This is wasteful because
> we already have all of this information in registers or
> in guarenteed save areas.
>
> 4) Then this routine determines the function to call, and
> passed this "package" of arguments to the routine.
>
> 5) Every system call which expects arguments then must
> "unpack" this structure to get at the copy of the arguments
> again highly inefficient.
>
> For every system call the system performs, you eat this unnecessary
> overhead under SunOS/Solaris, under Linux only the bare minimum is
> performed to do the system call successfully.
>
>4) Solaris cannot even do it's own optimizations correctly because
> SunPRO is a broken compiler.
>
> I won't make such a statement without explaining this with real
> facts, here goes.
>
> A neat part of the Sparc ABI is that it leaves you with a few
> processor registers that the C compiler is not allowed to use
> in the code it produces. Two of which are "%g6" and "%g7".
> A problem in unix kernels is that you are constantly accessing
> the current tasks control structure ('proc' and 'uarea' on
> traditional UNIX's, the 'task_struct" under Linux). Hey, why
> not put these pieces of information in those "extra" registers
> and avoid the address computation all the time? Yes, very
> brilliant idea.
>
> Under Solaris the trap entry code places the uarea and proc ptr
> in %g6 and %g7. Under Linux the trap entry code places the
> current processes task_struct in %g6. Now here comes where the
> implementations differ.
>
> Under Solaris all of the so called "locore" (basically all the
> gook which has to be written in raw assembly) code can directly
> take advantage of this. However, the C code cannot do this
> because SunPRO lacks a way for you to tell the compiler that
> "hey you don't need to load things, it's already in these
> hard coded registers" So they have the C code call these little
> assembly stubs to get the values:
>
> get_uarea:
> retl
> mov %g6, %o0
>
> get_proc:
> retl
> mov %g7, %o0
>
> That is gross, why even do the optimization in the first place?
>
> Now GCC has a way to fully take advantage of such an optimization,
> basically all I have to do is put the following in a header file.
>
> register struct task_struct *current asm("g6");
>
> Tada, now GCC will fully understand what I have done for it.
> Under SparcLinux this optimization alone took away 115 instructions
> in the scheduler sources, and it took ~50 instructions out of the
> exit() handling, and it took ~65 instructions out of the fork()
> handling.
>
>So my question always is, in matters such as these. Who are these
>processor cycles for anyways, the kernel or the user? Think about
>this when you consider how much overhead is being saved from one OS to
>another, and to what scale this is occurring.
>
>I hope that explains some of it, and gives people at least some sort
>of idea of the kinds of things that makes Linux scream on just about
>any hardware. If people would like more explainations like the above,
>I'd be more than happy to chat with you via email about it or
>similar. I love talking about performance issues on various
>processors and systems.
>
>Oh, and one thing that has not been mentioned yet in this thread (and
>yes NetBSD/OpenBSD both have this as well, good work guys). That
>SparcLinux kernel that gets all of this incredible performance runs on
>both sun4c and sun4m machines. Sun Engineers way back when scratched
>their heads for months and couldn't figure out a way to pull it off
>(you need a seperate kernel image depending upon whether you are
>running on a sun4m or a sun4c, for SunOS/Solaris). And on top of that
>Linux obviously pulls it off efficiently.
>
>One final note. When you have to deal with SunSOFT to report a bug,
>how "important" do you have (ie. Fortune 500?) to be and how big of a
>customer do you have to be (multi million dollar purchases?) to get
>direct access to Sun's Engineers at Sun Quentin? With Linux, all you
>have to do is send me or one of the other SparcLinux hackers an email
>and we will attend to your bug in due time. We have too much pride in
>our system to ignore you and not fix the bug.
>
>David S. Miller
>davem@caip.rutgers.edu
>
Have you ever kissed a girl?
- Bryan
----------------------------------------------------------------------
Bryan Cantrill, Solaris Performance. bmc@eng.sun.com (415) 786-3652