Re: Avoiding OOM on overcommit...?

From: Jesse Pollard (pollard@tomcat.admin.navo.hpc.mil)
Date: Fri Mar 24 2000 - 16:21:40 EST

  • Next message: Jesse Pollard: "Re: Avoiding OOM on overcommit...?"

     Sandy Harris <sandy@storm.ca>:
    > Alan Cox wrote:
    > >
    > > > The explanation is not the issue. The issue is having programs that
    > > > act correctly (check malloc() returns, don't follow null pointers,
    > > > etc.) fail in ways that are quite difficult for their programmers
    > > > to prevent.
    > >
    > > Indeed. And its completely valid to kill them off. Right now nobody
    > > has a serious commercial requirement for non-overcommit on Linux.
    >
    > I'm seriously confused by this thread.
    >
    > If I understand correctly, programs can malloc() memory, get back a
    > valid pointer, then segfault later when they try to use the memory.
    > This strikes me as quite obviously Wrong.
    >
    > I can see no question about whether this is a problem. To me, it clearly
    > is, and a ghastly one at that. This breaks the basic behaviour of
    > important system calls, making memory allocation unreliable.
    >
    > It is not clear to me why anyone would implement overcommit in the
    > first place or, once the obvious problem is pointed out, why there
    > should be discussion of anything other than how to fix the blunder.
    >
    > On the other hand, I see several people, some of whom I know are
    > competent, defending the existing behaviour.
    >
    > Methinks not all those folk are clueless. More likely, I've missed
    > something basic here. Would someone please explain it?
    >
    > Would it make everyone happy to have tunable parameters for this?
    > Using R for RAM, S for swap, max Commitable is:
    >
    > C = aR + bS
    >
    > and sys admin sets a, b. Default is perhaps a = 1, b = .5 to keep
    > paging to reasonable levels, but you might have a = b = 10 on one
    > system to allow heavy overcommit and a = 1, b = 0 on another for
    > minimum swapping.

    This is a bit oversimplified - lets expand it somewhat:

       let p = number of concurrent processes permitted
           r = minimum guaranteed resident (in R) pages (resident set size quota)
           v = virtual pages that a user is permitted to use (virtual quota)

    Each user gets some value of v. On some systems r is a global limit
    that applies to all users.

    Let r be a global limit assigned to all processes - then:

            p = R/r (each process is guarantted this amount)

    p can now be considered a process quota for the entire system. When p gets
            exceeded you get the "can't fork - no process slots" message.

    Note: There are other formulas for this. p could be arbitrarily set, and
            r computed fro R/p. This has the problem that a process may be
            guaranteed only one page, which is an unusable system.

            A second note - since r is the guaranteed resident, the process
            may actually have more. If the system needs resident pages for
            a different (new) processs, then all processes exceeding this
            amount could be trimmed (paged out) to get resident pages for
            the new process. This does require periodic processing (daemon
            level, not kernel) to balance the resident pages fairly among
            all processes. This could be done with several ways - enhance
            CPU utilization, attempt to minimize page fault rate... different
            policies, different goals.

    Let k be the number of pages that root will use for daemons and kernel
            operation

    Let x be the current amount of memory given to users. This is computed
    by:
            x = sum (v of all users logged in) + k

    note: x <= R+S or there is an overcommit. I'll ignore mmap for
                            this discussion because this appears to affect
                            r and S (the users datafile could be considered an
                            extension of swap, the page mapping done affects the
                            number of resident pages).

    How do these new parameters fit in:
            r is a performance tool that reduces the page fault rate and hence
               a load on I/O. It also improves CPU utilization for a process, and
               general overhead.
            v is a user limit that must be shared among all processes belonging
               to the user. This imposes several problems, which is what the
               occasionally loud discussions are about.

    These additional parameters expand the formula to a closer-to-reality
    form used in large systems. None of this goes into how to implement them.
    Most systems have weight factors that are added in based on the user,
    the resource being delt with (resident vs swap may get different weights)
    CPU accounting, etc.

    The more information available to the kernel the easier it is to handle user
    processes. Some systems' linker (ld) put (or allow to be put) a request for
    an initial stack size. This allows exec to specificly allocate that amount
    of memory. If the users stack exceeds this the users program is aborted (not
    fun when it is the shell, but it happens).

    Memory is allocated to a process in sbrk, fork, exec, stack page fault.

    Most of the discussion is about fork, with references to stack page fault.

    A fork makes a copy of a processes page table for the new process. At the
    same time, each data page is marked for copy-on-write (COW). Copy on write
    is a performance optimization that eliminates the overhead of having to
    copy each page (even those stored in swap) of a potentially large program.
    The symantics for fork still call for the copy, hence the copy on write,
    a read of a page is equivalent, no matter which process does it, therefore
    it doesn't need to be copied on read. If the page is modified, then a new
    physical resident page must be allocated to the page table of whichever
    process (parent or child).

    NOTE: the fork provides the appearance of a duplicate of the entire process
    AND the assumption that any or all pages in the new process may be written
    to (minus the executable text). If the child or parent process returns from
    the routine that executed the fork and proceeds to call a different function,
    the stack is modified. Since it is part of the COW pages, it must be
    duplicated. This calls for another change in resident set, for one or the other
    process (which ever did it first). In all cases the old page is no longer
    marked COW, and new page is a copy of the old one.

    IF the fork subtracts the total number of pages (size of the parent) from
    the current value of v (users quota) then the resulting value of v may be < 0.
    This would abort the users process.

    This is the simple method of handling user memory quotas, and the most
    painfull.

    The objections are:
      1. the system may not be out of resources
      2. the users process could execute exec, which completely replaces the
         memory copied with a new executable image, then the entire allocation
         for the duplicate is unnecessary.
      3. The new process could write data out and exit. No modification, but access
         some memory. This could cause the parent to duplicate the pages since
         the parent may be updating them (it doesn't stop just because it did
         a fork).

    This was the origin of the "vfork" branch. The BSD vfork did a fork, but
    put the parent process to sleep until the child process terminated, or did
    an exec. This made COW unnecessary, and much closer to a thread - except that
    a thread has a separate stack, and is still tied into its' parent process.

    I consider objection 2 and 3 to be valid, and want some way to alleviate them.
    I think #1 is irrelevent because management has already informed the user
    (via quota v) that More Will Not Be Given. (takes an act of management to
    change that decision).

    The one I suggested:

            let the child fork, but have a limit of 5-10 pages. IF the child:
            terminates then resources are released
            does an exec then new memory limits are generated.
            IF the child uses up the minimal resource, then full resources
                    must be allocated, and the potential for v < 0 occurs again.

    This doesn't really work since the parent process could start modifying
    data pages immediately. The COW pages would/could be duplicated very quickly
    exhausting the 5-10 page limit. If the pages weren't marked COW, then the
    parent would be able to modify pages out from under the child (and vice versa).
    This is more like a thread and is not appropriate for most uses of fork.

    One system I ran across (I think it was HP, but perhaps not) tried to solve
    this by putting the new process at a significantly higher priority than the
    parent process. This had several advantages:

    1. The new process started processing very quickly - its' time slice would
       give it enough time to exec a new image quickly (at which time the
       priority dropped back down).
    2. The delay on the parent process due to the priority, prevented it from
       modifying data out from under the child.

    This works on a uniprocessor system, but less well on multi-cpu system. The
    second CPU might take the parent process and start executing it. Now it is
    back to the same case as original.

    This approach might still work IF:
     a. the scheduler knew not to resume a parent process for one time slice after
        a fork (what is one time slice for instance, the fork may happen right at
        the end of the parents' slice
     b. the child still needs a 5-10 page minimal quota (for stack) and the rest
        of the conditions from the suggestion.
     c. the parents' delay is canceled if the child terminates or executes an
        exec.

    Problem here is that the parent could end up waiting for almost 2 time
    slices.

    Possible workarounds for the 2 timeslice delay would be:

    a. do scheduling in units of 1/2 the current amount. This would allow
       the fork to set a 1/2 slice delay, which at worst case would be one
       time slice.
    b. don't know - there must be one.

    None of the current solutions look really good. They CAN guarantee the
    system doesn't go OOM, simply because the total draconian limits do
    not allow it. Users would go out of resources while the system did have
    them.

    The determination of which is better is, in the end, up to the owners
    of the system.

    I want the option.

    BTW, I beleve the "commercial requirement" is just having someone
    offer to pay for it via a support and development contract. (I can't
    do that, unfortunately)
    ---------------------------------------------------------------------
    Jesse I Pollard, II
    Email: pollard@navo.hpc.mil

    Any opinions expressed are solely my own.

    -
    To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
    the body of a message to majordomo@vger.rutgers.edu
    Please read the FAQ at http://www.tux.org/lkml/



    This archive was generated by hypermail 2b29 : Fri Mar 24 2000 - 16:42:38 EST