openwrt/staging/blogic.git
16 years agomm: extract do_pages_move() out of sys_move_pages()
Brice Goglin [Sun, 19 Oct 2008 03:27:17 +0000 (20:27 -0700)]
mm: extract do_pages_move() out of sys_move_pages()

To prepare the chunking, move the sys_move_pages() code that is used when
nodes!=NULL into do_pages_move().  And rename do_move_pages() into
do_move_page_to_node_array().

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: don't vmalloc a huge page_to_node array for do_pages_stat()
Brice Goglin [Sun, 19 Oct 2008 03:27:16 +0000 (20:27 -0700)]
mm: don't vmalloc a huge page_to_node array for do_pages_stat()

do_pages_stat() does not need any page_to_node entry for real.  Just pass
the pointers to the user-space page address array and to the user-space
status array, and have do_pages_stat() traverse the former and fill the
latter directly.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: stop returning -ENOENT from sys_move_pages() if nothing got migrated
Brice Goglin [Sun, 19 Oct 2008 03:27:15 +0000 (20:27 -0700)]
mm: stop returning -ENOENT from sys_move_pages() if nothing got migrated

A patchset reworking sys_move_pages().  It removes the possibly large
vmalloc by using multiple chunks when migrating large buffers.  It also
dramatically increases the throughput for large buffers since the lookup
in new_page_node() is now limited to a single chunk, causing the quadratic
complexity to have a much slower impact.  There is no need to use any
radix-tree-like structure to improve this lookup.

sys_move_pages() duration on a 4-quadcore-opteron 2347HE (1.9Gz),
migrating between nodes #2 and #3:

length move_pages (us) move_pages+patch (us)
4kB 126 98
40kB 198 168
400kB 963 937
4MB 12503 11930
40MB 246867 11848

Patches #1 and #4 are the important ones:
1) stop returning -ENOENT from sys_move_pages() if nothing got migrated
2) don't vmalloc a huge page_to_node array for do_pages_stat()
3) extract do_pages_move() out of sys_move_pages()
4) rework do_pages_move() to work on page_sized chunks
5) move_pages: no need to set pp->page to ZERO_PAGE(0) by default

This patch:

There is no point in returning -ENOENT from sys_move_pages() if all pages
were already on the right node, while we return 0 if only 1 page was not.
Most application don't know where their pages are allocated, so it's not
an error to try to migrate them anyway.

Just return 0 and let the status array in user-space be checked if the
application needs details.

It will make the upcoming chunked-move_pages() support much easier.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemory hotplug: release memory regions in PAGES_PER_SECTION chunks
Nathan Fontenot [Sun, 19 Oct 2008 03:27:14 +0000 (20:27 -0700)]
memory hotplug: release memory regions in PAGES_PER_SECTION chunks

During hotplug memory remove, memory regions should be released on a
PAGES_PER_SECTION size chunks.  This mirrors the code in add_memory where
resources are requested on a PAGES_PER_SECTION size.

Attempting to release the entire memory region fails because there is not
a single resource for the total number of pages being removed.  Instead
the resources for the pages are split in PAGES_PER_SECTION size chunks as
requested during memory add.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodocumentation: clarify dirty_ratio and dirty_background_ratio description
Andrea Righi [Sun, 19 Oct 2008 03:27:13 +0000 (20:27 -0700)]
documentation: clarify dirty_ratio and dirty_background_ratio description

The current documentation of dirty_ratio and dirty_background_ratio is a
bit misleading.

In the documentation we say that they are "a percentage of total system
memory", but the current page writeback policy, intead, is to apply the
percentages to the dirtyable memory, that means free pages + reclaimable
pages.

Better to be more explicit to clarify this concept.

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomemory_probe: fix wrong sysfs file attribute
Shaohua Li [Sun, 19 Oct 2008 03:27:12 +0000 (20:27 -0700)]
memory_probe: fix wrong sysfs file attribute

This attribute just has a write operation.

[akpm@linux-foundation.org: use S_IWUSR as suggested by Randy]
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agosetup_per_zone_pages_min(): take zone->lock instead of zone->lru_lock
Gerald Schaefer [Sun, 19 Oct 2008 03:27:11 +0000 (20:27 -0700)]
setup_per_zone_pages_min(): take zone->lock instead of zone->lru_lock

This replaces zone->lru_lock in setup_per_zone_pages_min() with zone->lock.
There seems to be no need for the lru_lock anymore, but there is a need for
zone->lock instead, because that function may call move_freepages() via
setup_zone_migrate_reserve().

Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agohugepage: support ZERO_PAGE()
KOSAKI Motohiro [Sun, 19 Oct 2008 03:27:10 +0000 (20:27 -0700)]
hugepage: support ZERO_PAGE()

Presently hugepage doesn't use zero page at all because zero page is only
used for coredumping and hugepage can't core dump.

However we have now implemented hugepage coredumping.  Therefore we should
implement the zero page of hugepage.

Implementation note:

o Why do we only check VM_SHARED for zero page?
  normal page checked as ..

static inline int use_zero_page(struct vm_area_struct *vma)
{
        if (vma->vm_flags & (VM_LOCKED | VM_SHARED))
                return 0;

        return !vma->vm_ops || !vma->vm_ops->fault;
}

First, hugepages are never mlock()ed.  We aren't concerned with VM_LOCKED.

Second, hugetlbfs is a pseudo filesystem, not a real filesystem and it
doesn't have any file backing.  Thus ops->fault checking is meaningless.

o Why don't we use zero page if !pte.

!pte indicate {pud, pmd} doesn't exist or some error happened.  So we
shouldn't return zero page if any error occurred.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Mel Gorman <mel@skynet.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agocoredump_filter: add hugepage dumping
KOSAKI Motohiro [Sun, 19 Oct 2008 03:27:08 +0000 (20:27 -0700)]
coredump_filter: add hugepage dumping

Presently hugepage's vma has a VM_RESERVED flag in order not to be
swapped.  But a VM_RESERVED vma isn't core dumped because this flag is
often used for some kernel vmas (e.g.  vmalloc, sound related).

Thus hugepages are never dumped and it can't be debugged easily.  Many
developers want hugepages to be included into core-dump.

However, We can't read generic VM_RESERVED area because this area is often
IO mapping area.  then these area reading may change device state.  it is
definitly undesiable side-effect.

So adding a hugepage specific bit to the coredump filter is better.  It
will be able to hugepage core dumping and doesn't cause any side-effect to
any i/o devices.

In additional, libhugetlb use hugetlb private mapping pages as anonymous
page.  Then, hugepage private mapping pages should be core dumped by
default.

Then, /proc/[pid]/core_dump_filter has two new bits.

 - bit 5 mean hugetlb private mapping pages are dumped or not. (default: yes)
 - bit 6 mean hugetlb shared mapping pages are dumped or not.  (default: no)

I tested by following method.

% ulimit -c unlimited
% ./crash_hugepage  50
% ./crash_hugepage  50  -p
% ls -lh
% gdb ./crash_hugepage core
%
% echo 0x43 > /proc/self/coredump_filter
% ./crash_hugepage  50
% ./crash_hugepage  50  -p
% ls -lh
% gdb ./crash_hugepage core

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

#include "hugetlbfs.h"

int main(int argc, char** argv){
char* p;
int ch;
int mmap_flags = MAP_SHARED;
int fd;
int nr_pages;

while((ch = getopt(argc, argv, "p")) != -1) {
switch (ch) {
case 'p':
mmap_flags &= ~MAP_SHARED;
mmap_flags |= MAP_PRIVATE;
break;
default:
/* nothing*/
break;
}
}
argc -= optind;
argv += optind;

if (argc == 0){
printf("need # of pages\n");
exit(1);
}

nr_pages = atoi(argv[0]);
if (nr_pages < 2) {
printf("nr_pages must >2\n");
exit(1);
}

fd = hugetlbfs_unlinked_fd();
p = mmap(NULL, nr_pages * gethugepagesize(),
 PROT_READ|PROT_WRITE, mmap_flags, fd, 0);

sleep(2);

*(p + gethugepagesize()) = 1; /* COW */
sleep(2);

/* crash! */
*(int*)0 = 1;

return 0;
}

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Kawai Hidehiro <hidehiro.kawai.ez@hitachi.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: print out meminit for memmap
Yinghai Lu [Sun, 19 Oct 2008 03:27:06 +0000 (20:27 -0700)]
mm: print out meminit for memmap

Improve debuggability of memory setup problems.

Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: hugetlb.c make functions static, use NULL rather than 0
Harvey Harrison [Sun, 19 Oct 2008 03:27:06 +0000 (20:27 -0700)]
mm: hugetlb.c make functions static, use NULL rather than 0

mm/hugetlb.c:265:17: warning: symbol 'resv_map_alloc' was not declared. Should it be static?
mm/hugetlb.c:277:6: warning: symbol 'resv_map_release' was not declared. Should it be static?
mm/hugetlb.c:292:9: warning: Using plain integer as NULL pointer
mm/hugetlb.c:1750:5: warning: symbol 'unmap_ref_private' was not declared. Should it be static?

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
Acked-by: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: rewrite vmap layer
Nick Piggin [Sun, 19 Oct 2008 03:27:03 +0000 (20:27 -0700)]
mm: rewrite vmap layer

Rewrite the vmap allocator to use rbtrees and lazy tlb flushing, and
provide a fast, scalable percpu frontend for small vmaps (requires a
slightly different API, though).

The biggest problem with vmap is actually vunmap.  Presently this requires
a global kernel TLB flush, which on most architectures is a broadcast IPI
to all CPUs to flush the cache.  This is all done under a global lock.  As
the number of CPUs increases, so will the number of vunmaps a scaled
workload will want to perform, and so will the cost of a global TLB flush.
 This gives terrible quadratic scalability characteristics.

Another problem is that the entire vmap subsystem works under a single
lock.  It is a rwlock, but it is actually taken for write in all the fast
paths, and the read locking would likely never be run concurrently anyway,
so it's just pointless.

This is a rewrite of vmap subsystem to solve those problems.  The existing
vmalloc API is implemented on top of the rewritten subsystem.

The TLB flushing problem is solved by using lazy TLB unmapping.  vmap
addresses do not have to be flushed immediately when they are vunmapped,
because the kernel will not reuse them again (would be a use-after-free)
until they are reallocated.  So the addresses aren't allocated again until
a subsequent TLB flush.  A single TLB flush then can flush multiple
vunmaps from each CPU.

XEN and PAT and such do not like deferred TLB flushing because they can't
always handle multiple aliasing virtual addresses to a physical address.
They now call vm_unmap_aliases() in order to flush any deferred mappings.
That call is very expensive (well, actually not a lot more expensive than
a single vunmap under the old scheme), however it should be OK if not
called too often.

The virtual memory extent information is stored in an rbtree rather than a
linked list to improve the algorithmic scalability.

There is a per-CPU allocator for small vmaps, which amortizes or avoids
global locking.

To use the per-CPU interface, the vm_map_ram / vm_unmap_ram interfaces
must be used in place of vmap and vunmap.  Vmalloc does not use these
interfaces at the moment, so it will not be quite so scalable (although it
will use lazy TLB flushing).

As a quick test of performance, I ran a test that loops in the kernel,
linearly mapping then touching then unmapping 4 pages.  Different numbers
of tests were run in parallel on an 4 core, 2 socket opteron.  Results are
in nanoseconds per map+touch+unmap.

threads           vanilla         vmap rewrite
1                 14700           2900
2                 33600           3000
4                 49500           2800
8                 70631           2900

So with a 8 cores, the rewritten version is already 25x faster.

In a slightly more realistic test (although with an older and less
scalable version of the patch), I ripped the not-very-good vunmap batching
code out of XFS, and implemented the large buffer mapping with vm_map_ram
and vm_unmap_ram...  along with a couple of other tricks, I was able to
speed up a large directory workload by 20x on a 64 CPU system.  I believe
vmap/vunmap is actually sped up a lot more than 20x on such a system, but
I'm running into other locks now.  vmap is pretty well blown off the
profiles.

Before:
1352059 total                                      0.1401
798784 _write_lock                              8320.6667 <- vmlist_lock
529313 default_idle                             1181.5022
 15242 smp_call_function                         15.8771  <- vmap tlb flushing
  2472 __get_vm_area_node                         1.9312  <- vmap
  1762 remove_vm_area                             4.5885  <- vunmap
   316 map_vm_area                                0.2297  <- vmap
   312 kfree                                      0.1950
   300 _spin_lock                                 3.1250
   252 sn_send_IPI_phys                           0.4375  <- tlb flushing
   238 vmap                                       0.8264  <- vmap
   216 find_lock_page                             0.5192
   196 find_next_bit                              0.3603
   136 sn2_send_IPI                               0.2024
   130 pio_phys_write_mmr                         2.0312
   118 unmap_kernel_range                         0.1229

After:
 78406 total                                      0.0081
 40053 default_idle                              89.4040
 33576 ia64_spinlock_contention                 349.7500
  1650 _spin_lock                                17.1875
   319 __reg_op                                   0.5538
   281 _atomic_dec_and_lock                       1.0977
   153 mutex_unlock                               1.5938
   123 iget_locked                                0.1671
   117 xfs_dir_lookup                             0.1662
   117 dput                                       0.1406
   114 xfs_iget_core                              0.0268
    92 xfs_da_hashname                            0.1917
    75 d_alloc                                    0.0670
    68 vmap_page_range                            0.0462 <- vmap
    58 kmem_cache_alloc                           0.0604
    57 memset                                     0.0540
    52 rb_next                                    0.1625
    50 __copy_user                                0.0208
    49 bitmap_find_free_region                    0.2188 <- vmap
    46 ia64_sn_udelay                             0.1106
    45 find_inode_fast                            0.1406
    42 memcmp                                     0.2188
    42 finish_task_switch                         0.1094
    42 __d_lookup                                 0.0410
    40 radix_tree_lookup_slot                     0.1250
    37 _spin_unlock_irqrestore                    0.3854
    36 xfs_bmapi                                  0.0050
    36 kmem_cache_free                            0.0256
    35 xfs_vn_getattr                             0.0322
    34 radix_tree_lookup                          0.1062
    33 __link_path_walk                           0.0035
    31 xfs_da_do_buf                              0.0091
    30 _xfs_buf_find                              0.0204
    28 find_get_page                              0.0875
    27 xfs_iread                                  0.0241
    27 __strncpy_from_user                        0.2812
    26 _xfs_buf_initialize                        0.0406
    24 _xfs_buf_lookup_pages                      0.0179
    24 vunmap_page_range                          0.0250 <- vunmap
    23 find_lock_page                             0.0799
    22 vm_map_ram                                 0.0087 <- vmap
    20 kfree                                      0.0125
    19 put_page                                   0.0330
    18 __kmalloc                                  0.0176
    17 xfs_da_node_lookup_int                     0.0086
    17 _read_lock                                 0.0885
    17 page_waitqueue                             0.0664

vmap has gone from being the top 5 on the profiles and flushing the crap
out of all TLBs, to using less than 1% of kernel time.

[akpm@linux-foundation.org: cleanups, section fix]
[akpm@linux-foundation.org: fix build on alpha]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Krzysztof Helt <krzysztof.h1@poczta.fm>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agommap.c: deinline a few functions
Denys Vlasenko [Sun, 19 Oct 2008 03:27:01 +0000 (20:27 -0700)]
mmap.c: deinline a few functions

__vma_link_file and expand_downwards functions are not small, yeat they
are marked inline.  They probably had one callsite sometime in the past,
but now they have more.  In order to prevent similar thing, I also
deinlined expand_upwards, despite it having only pne callsite.  Nowadays
gcc auto-inlines such static functions anyway.  In find_extend_vma, I
removed one extra level of indirection.

Patch is deliberately generated with -U $BIGNUM to make
it easier to see that functions are big.

Result:

# size */*/mmap.o */vmlinux
   text    data     bss     dec     hex filename
   9514     188      16    9718    25f6 0.org/mm/mmap.o
   9237     188      16    9441    24e1 deinline/mm/mmap.o
6124402  858996  389480 7372878  70804e 0.org/vmlinux
6124113  858996  389480 7372589  707f2d deinline/vmlinux

Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agofs: buffer lock use lock bitops
Nick Piggin [Sun, 19 Oct 2008 03:27:00 +0000 (20:27 -0700)]
fs: buffer lock use lock bitops

trylock_buffer and unlock_buffer open and close a critical section.
Hence, we can use the lock bitops to get the desired memory ordering.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: page lock use lock bitops
Nick Piggin [Sun, 19 Oct 2008 03:26:59 +0000 (20:26 -0700)]
mm: page lock use lock bitops

trylock_page, unlock_page open and close a critical section. Hence,
we can use the lock bitops to get the desired memory ordering.

Also, mark trylock as likely to succeed (and remove the annotation from
callers).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: unlockless reclaim
Nick Piggin [Sun, 19 Oct 2008 03:26:58 +0000 (20:26 -0700)]
mm: unlockless reclaim

unlock_page is fairly expensive.  It can be avoided in page reclaim
success path.  By definition if we have any other references to the page
it would be a bug anyway.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: pagecache insertion fewer atomics
Nick Piggin [Sun, 19 Oct 2008 03:26:57 +0000 (20:26 -0700)]
mm: pagecache insertion fewer atomics

Setting and clearing the page locked when inserting it into swapcache /
pagecache when it has no other references can use non-atomic page flags
operations because no other CPU may be operating on it at this time.

This saves one atomic operation when inserting a page into pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomlock: make mlock error return Posixly Correct
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:56 +0000 (20:26 -0700)]
mlock: make mlock error return Posixly Correct

Rework Posix error return for mlock().

Posix requires error code for mlock*() system calls for some conditions
that differ from what kernel low level functions, such as
get_user_pages(), return for those conditions.  For more info, see:

http://marc.info/?l=linux-kernel&m=121750892930775&w=2

This patch provides the same translation of get_user_pages()
error codes to posix specified error codes in the context
of the mlock rework for unevictable lru.

[akpm@linux-foundation.org: fix build]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomlock: revert mainline handling of mlock error return
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:56 +0000 (20:26 -0700)]
mlock: revert mainline handling of mlock error return

This change is intended to make mlock() error returns correct.
make_page_present() is a lower level function used by more than mlock().
Subsequent patch[es] will add this error return fixup in an mlock specific
path.

Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: don't accumulate scan pressure on unrelated lists
Johannes Weiner [Sun, 19 Oct 2008 03:26:55 +0000 (20:26 -0700)]
vmscan: don't accumulate scan pressure on unrelated lists

During each reclaim scan we accumulate scan pressure on unrelated lists
which will result in bogus scans and unwanted reclaims eventually.

Scanning lists with few reclaim candidates results in a lot of rotation
and therefor also disturbs the list balancing, putting even more
pressure on the wrong lists.

In a test-case with much streaming IO, and therefor a crowded inactive
file page list, swapping started because

  a) anon pages were reclaimed after swap_cluster_max reclaim
  invocations -- nr_scan of this list has just accumulated

  b) active file pages were scanned because *their* nr_scan has also
  accumulated through the same logic.  And this in return created a
  lot of rotation for file pages and resulted in a decrease of file
  list priority, again increasing the pressure on anon pages.

The result was an evicted working set of anon pages while there were
tons of inactive file pages that should have been taken instead.

Signed-off-by: Johannes Weiner <hannes@saeurebad.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: kill unused lru functions
KOSAKI Motohiro [Sun, 19 Oct 2008 03:26:54 +0000 (20:26 -0700)]
vmscan: kill unused lru functions

Several LRU manupuration function are not used now.  So they can be
removed.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomlock: count attempts to free mlocked page
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:53 +0000 (20:26 -0700)]
mlock: count attempts to free mlocked page

Allow free of mlock()ed pages.  This shouldn't happen, but during
developement, it occasionally did.

This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: unevictable LRU scan sysctl
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:53 +0000 (20:26 -0700)]
vmscan: unevictable LRU scan sysctl

This patch adds a function to scan individual or all zones' unevictable
lists and move any pages that have become evictable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.

Adds sysctl to scan all nodes, and per node attributes to individual
nodes' zones.

Kosaki: If evictable page found in unevictable lru when write
/proc/sys/vm/scan_unevictable_pages, print filename and file offset of
these pages.

[akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
[kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoswap: cull unevictable pages in fault path
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:52 +0000 (20:26 -0700)]
swap: cull unevictable pages in fault path

In the fault paths that install new anonymous pages, check whether the
page is evictable or not using lru_cache_add_active_or_unevictable().  If
the page is evictable, just add it to the active lru list [via the pagevec
cache], else add it to the unevictable list.

This "proactive" culling in the fault path mimics the handling of mlocked
pages in Nick Piggin's series to keep mlocked pages off the lru lists.

Notes:

1) This patch is optional--e.g., if one is concerned about the
   additional test in the fault path.  We can defer the moving of
   nonreclaimable pages until when vmscan [shrink_*_list()]
   encounters them.  Vmscan will only need to handle such pages
   once, but if there are a lot of them it could impact system
   performance.

2) The 'vma' argument to page_evictable() is require to notice that
   we're faulting a page into an mlock()ed vma w/o having to scan the
   page's rmap in the fault path.   Culling mlock()ed anon pages is
   currently the only reason for this patch.

3) We can't cull swap pages in read_swap_cache_async() because the
   vma argument doesn't necessarily correspond to the swap cache
   offset passed in by swapin_readahead().  This could [did!] result
   in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
   cull in this path.

4) Move set_pte_at() to after where we add page to lru to keep it
   hidden from other tasks that might walk the page table.
   We already do it in this order in do_anonymous() page.  And,
   these are COW'd anon pages.  Is this safe?

[riel@redhat.com: undo an overzealous code cleanup]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmstat: mlocked pages statistics
Nick Piggin [Sun, 19 Oct 2008 03:26:51 +0000 (20:26 -0700)]
vmstat: mlocked pages statistics

Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

[kosaki.motohiro@jp.fujitsu.com: fix incorrect Mlocked field of /proc/meminfo]
[lee.schermerhorn@hp.com: mlocked-pages: add event counting with statistics]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agommap: handle mlocked pages during map, remap, unmap
Rik van Riel [Sun, 19 Oct 2008 03:26:50 +0000 (20:26 -0700)]
mmap: handle mlocked pages during map, remap, unmap

Originally by Nick Piggin <npiggin@suse.de>

Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate().  Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed.  Remove
PageMlocked() status when page truncated from file.

[akpm@linux-foundation.org: cleanup]
[kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
[kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
[lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
[akpm@linux-foundation.org: remove bogus kerneldoc token]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamewzawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomlock: downgrade mmap sem while populating mlocked regions
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:49 +0000 (20:26 -0700)]
mlock: downgrade mmap sem while populating mlocked regions

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to very
long lock hold times attempting to fault in a large memory region to mlock
it into memory.  This can hold off other faults against the mm
[multithreaded tasks] and other scans of the mm, such as via /proc.  To
alleviate this, downgrade the mmap_sem to read mode during the population
of the region for locking.  This is especially the case if we need to
reclaim memory to lock down the region.  We [probably?] don't need to do
this for unlocking as all of the pages should be resident--they're already
mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and
mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode.
 Changing all callers appears to be way too much effort at this point.
So, restore write mode before returning.  Note that this opens a window
where the mmap list could change in a multithreaded process.  So, at least
for mlock_fixup(), where we could be called in a loop over multiple vmas,
we check that a vma still exists at the start address and that vma still
covers the page range [start,end).  If not, we return an error, -EAGAIN,
and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if
the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller deal
with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  However, I occassionally see delays while unlocking or
unmapping a large mlocked region.  Should we also downgrade the mmap_sem
for the unlock path?

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodoc: unevictable LRU and mlocked pages documentation
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:47 +0000 (20:26 -0700)]
doc: unevictable LRU and mlocked pages documentation

Documentation for unevictable lru list and its usage.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomlock: mlocked pages are unevictable
Nick Piggin [Sun, 19 Oct 2008 03:26:44 +0000 (20:26 -0700)]
mlock: mlocked pages are unevictable

Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.

This is achieved through various strategies:

1) add yet another page flag--PG_mlocked--to indicate that
   the page is locked for efficient testing in vmscan and,
   optionally, fault path.  This allows early culling of
   unevictable pages, preventing them from getting to
   page_referenced()/try_to_unmap().  Also allows separate
   accounting of mlock'd pages, as Nick's original patch
   did.

   Note:  Nick's original mlock patch used a PG_mlocked
   flag.  I had removed this in favor of the PG_unevictable
   flag + an mlock_count [new page struct member].  I
   restored the PG_mlocked flag to eliminate the new
   count field.

2) add the mlock/unevictable infrastructure to mm/mlock.c,
   with internal APIs in mm/internal.h.  This is a rework
   of Nick's original patch to these files, taking into
   account that mlocked pages are now kept on unevictable
   LRU list.

3) update vmscan.c:page_evictable() to check PageMlocked()
   and, if vma passed in, the vm_flags.  Note that the vma
   will only be passed in for new pages in the fault path;
   and then only if the "cull unevictable pages in fault
   path" patch is included.

4) add try_to_unlock() to rmap.c to walk a page's rmap and
   ClearPageMlocked() if no other vmas have it mlocked.
   Reuses as much of try_to_unmap() as possible.  This
   effectively replaces the use of one of the lru list links
   as an mlock count.  If this mechanism let's pages in mlocked
   vmas leak through w/o PG_mlocked set [I don't know that it
   does], we should catch them later in try_to_unmap().  One
   hopes this will be rare, as it will be relatively expensive.

Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():

  New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
  because current get_user_pages() can't grab PROT_NONE pages theresore it
  cause PROT_NONE pages can't munlock.

[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoSHM_LOCKED pages are unevictable
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:43 +0000 (20:26 -0700)]
SHM_LOCKED pages are unevictable

Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
kept on the normal LRU, since scanning them is a waste of time and might
throw off kswapd's balancing algorithms.  Place them on the unevictable
LRU list instead.

Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
memory regions as unevictable.  Then these pages will be culled off the
normal LRU lists during vmscan.

Add new wrapper function to clear the mapping's unevictable state when/if
shared memory segment is munlocked.

Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
the shmem segment's mapping [struct address_space] for evictability now
that they're no longer locked.  If so, move them to the appropriate zone
lru list.

Changes depend on [CONFIG_]UNEVICTABLE_LRU.

[kosaki.motohiro@jp.fujitsu.com: revert shm change]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoRamfs and Ram Disk pages are unevictable
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:42 +0000 (20:26 -0700)]
Ramfs and Ram Disk pages are unevictable

Christoph Lameter pointed out that ram disk pages also clutter the LRU
lists.  When vmscan finds them dirty and tries to clean them, the ram disk
writeback function just redirties the page so that it goes back onto the
active list.  Round and round she goes...

With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
longer the case, as ram disk pages are no longer maintained on the lru.
[This makes them unmigratable for defrag or memory hot remove, but that
can be addressed by a separate patch series.] However, the ramfs pages
behave like ram disk pages used to, so:

Define new address_space flag [shares address_space flags member with
mapping's gfp mask] to indicate that the address space contains all
unevictable pages.  This will provide for efficient testing of ramfs pages
in page_evictable().

Also provide wrapper functions to set/test the unevictable state to
minimize #ifdefs in ramfs driver and any other users of this facility.

Set the unevictable state on address_space structures for new ramfs
inodes.  Test the unevictable state in page_evictable() to cull
unevictable pages.

These changes depend on [CONFIG_]UNEVICTABLE_LRU.

[riel@redhat.com: undo the brd.c part]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Debugged-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoUnevictable LRU Page Statistics
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:40 +0000 (20:26 -0700)]
Unevictable LRU Page Statistics

Report unevictable pages per zone and system wide.

Kosaki Motohiro added support for memory controller unevictable
statistics.

[riel@redhat.com: fix printk in show_free_areas()]
[akpm@linux-foundation.org: fix units in /proc/vmstats]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agounevictable lru: add event counting with statistics
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:40 +0000 (20:26 -0700)]
unevictable lru: add event counting with statistics

Fix to unevictable-lru-page-statistics.patch

Add unevictable lru infrastructure vm events to the statistics patch.
Rename the "NORECL_" and "noreclaim_" symbols and text strings to
"UNEVICTABLE_" and "unevictable_", respectively.

Currently, both the infrastructure and the mlocked pages event are
added by a single patch later in the series.  This makes it difficult
to add or rework the incremental patches.  The events actually "belong"
with the stats, so pull them up to here.

Also, restore the event counting to putback_lru_page().  This was removed
from previous patch in series where it was "misplaced".  The actual events
weren't defined that early.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoUnevictable LRU Infrastructure
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:39 +0000 (20:26 -0700)]
Unevictable LRU Infrastructure

When the system contains lots of mlocked or otherwise unevictable pages,
the pageout code (kswapd) can spend lots of time scanning over these
pages.  Worse still, the presence of lots of unevictable pages can confuse
kswapd into thinking that more aggressive pageout modes are required,
resulting in all kinds of bad behaviour.

Infrastructure to manage pages excluded from reclaim--i.e., hidden from
vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
them from vmscan.

Kosaki Motohiro added the support for the memory controller unevictable
lru list.

Pages on the unevictable list have both PG_unevictable and PG_lru set.
Thus, PG_unevictable is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.

The unevictable infrastructure is enabled by a new mm Kconfig option
[CONFIG_]UNEVICTABLE_LRU.

A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
not a page may be evictable.  Subsequent patches will add the various
!evictable tests.  We'll want to keep these tests light-weight for use in
shrink_active_list() and, possibly, the fault path.

To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from non-evictable to evictable state,
the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
-- tests the "evictability" of a page after placing it on the LRU, before
dropping the reference.  If the page has become unevictable,
putback_lru_page() will redo the 'putback', thus moving the page to the
unevictable list.  This way, we avoid "stranding" evictable pages on the
unevictable list.

[akpm@linux-foundation.org: fix fallout from out-of-order merge]
[riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
[nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
[kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
[kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
[kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
[kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
[kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Debugged-by: Benjamin Kidwell <benjkidwell@yahoo.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agopageflag helpers for configed-out flags
Lee Schermerhorn [Sun, 19 Oct 2008 03:26:37 +0000 (20:26 -0700)]
pageflag helpers for configed-out flags

Define proper false/noop inline functions for noreclaim page flags when
!defined(CONFIG_UNEVICTABLE_LRU)

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomore aggressively use lumpy reclaim
Rik van Riel [Sun, 19 Oct 2008 03:26:36 +0000 (20:26 -0700)]
more aggressively use lumpy reclaim

During an AIM7 run on a 16GB system, fork started failing around 32000
threads, despite the system having plenty of free swap and 15GB of
pageable memory.  This was on x86-64, so 8k stacks.

If a higher order allocation fails, we can either:
- keep evicting pages off the end of the LRUs and hope that
  we eventually create a contiguous region; this is somewhat
  unlikely if the system is under enough stress by new
  allocations
- after trying normal eviction for a bit, use lumpy reclaim

This patch switches the system to lumpy reclaim if the VM is having
trouble freeing enough pages, using the same threshold for detection as
used by pageout congestion wait.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: add newly swapped in pages to the inactive list
Rik van Riel [Sun, 19 Oct 2008 03:26:36 +0000 (20:26 -0700)]
vmscan: add newly swapped in pages to the inactive list

Swapin_readahead can read in a lot of data that the processes in memory
never need.  Adding swap cache pages to the inactive list prevents them
from putting too much pressure on the working set.

This has the potential to help the programs that are already in memory,
but it could also be a disadvantage to processes that are trying to get
swapped in.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: fix pagecache reclaim referenced bit check
Rik van Riel [Sun, 19 Oct 2008 03:26:35 +0000 (20:26 -0700)]
vmscan: fix pagecache reclaim referenced bit check

Moving referenced pages back to the head of the active list creates a huge
scalability problem, because by the time a large memory system finally
runs out of free memory, every single page in the system will have been
referenced.

Not only do we not have the time to scan every single page on the active
list, but since they have will all have the referenced bit set, that bit
conveys no useful information.

A more scalable solution is to just move every page that hits the end of
the active list to the inactive list.

We clear the referenced bit off of mapped pages, which need just one
reference to be moved back onto the active list.

Unmapped pages will be moved back to the active list after two references
(see mark_page_accessed).  We preserve the PG_referenced flag on unmapped
pages to preserve accesses that were made while the page was on the active
list.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: second chance replacement for anonymous pages
Rik van Riel [Sun, 19 Oct 2008 03:26:34 +0000 (20:26 -0700)]
vmscan: second chance replacement for anonymous pages

We avoid evicting and scanning anonymous pages for the most part, but
under some workloads we can end up with most of memory filled with
anonymous pages.  At that point, we suddenly need to clear the referenced
bits on all of memory, which can take ages on very large memory systems.

We can reduce the maximum number of pages that need to be scanned by not
taking the referenced state into account when deactivating an anonymous
page.  After all, every anonymous page starts out referenced, so why
check?

If an anonymous page gets referenced again before it reaches the end of
the inactive list, we move it back to the active list.

To keep the maximum amount of necessary work reasonable, we scale the
active to inactive ratio with the size of memory, using the formula
active:inactive ratio = sqrt(memory in GB * 10).

Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
instead of by the amount of memory present in the system.

[kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
[kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: split LRU lists into anon & file sets
Rik van Riel [Sun, 19 Oct 2008 03:26:32 +0000 (20:26 -0700)]
vmscan: split LRU lists into anon & file sets

Split the LRU lists in two, one set for pages that are backed by real file
systems ("file") and one for pages that are backed by memory and swap
("anon").  The latter includes tmpfs.

The advantage of doing this is that the VM will not have to scan over lots
of anonymous pages (which we generally do not want to swap out), just to
find the page cache pages that it should evict.

This patch has the infrastructure and a basic policy to balance how much
we scan the anon lists and how much we scan the file lists.  The big
policy changes are in separate patches.

[lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
[kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
[kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
[hugh@veritas.com: memcg swapbacked pages active]
[hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
[akpm@linux-foundation.org: fix /proc/vmstat units]
[nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
[kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
[kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agodefine page_file_cache() function
Rik van Riel [Sun, 19 Oct 2008 03:26:30 +0000 (20:26 -0700)]
define page_file_cache() function

Define page_file_cache() function to answer the question:
is page backed by a file?

Originally part of Rik van Riel's split-lru patch.  Extracted to make
available for other, independent reclaim patches.

Moved inline function to linux/mm_inline.h where it will be needed by
subsequent "split LRU" and "noreclaim" patches.

Unfortunately this needs to use a page flag, since the PG_swapbacked state
needs to be preserved all the way to the point where the page is last
removed from the LRU.  Trying to derive the status from other info in the
page resulted in wrong VM statistics in earlier split VM patchsets.

The total number of page flags in use on a 32 bit machine after this patch
is 19.

[akpm@linux-foundation.org: fix up out-of-order merge fallout]
[hugh@veritas.com: splitlru: shmem_getpage SetPageSwapBacked sooner[
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: free swap space on swap-in/activation
Rik van Riel [Sun, 19 Oct 2008 03:26:23 +0000 (20:26 -0700)]
vmscan: free swap space on swap-in/activation

If vm_swap_full() (swap space more than 50% full), the system will free
swap space at swapin time.  With this patch, the system will also free the
swap space in the pageout code, when we decide that the page is not a
candidate for swapout (and just wasting swap space).

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoswap: use an array for the LRU pagevecs
KOSAKI Motohiro [Sun, 19 Oct 2008 03:26:19 +0000 (20:26 -0700)]
swap: use an array for the LRU pagevecs

Turn the pagevecs into an array just like the LRUs.  This significantly
cleans up the source code and reduces the size of the kernel by about 13kB
after all the LRU lists have been created further down in the split VM
patch series.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: Use an indexed array for LRU variables
Christoph Lameter [Sun, 19 Oct 2008 03:26:14 +0000 (20:26 -0700)]
vmscan: Use an indexed array for LRU variables

Currently we are defining explicit variables for the inactive and active
list.  An indexed array can be more generic and avoid repeating similar
code in several places in the reclaim code.

We are saving a few bytes in terms of code size:

Before:

   text    data     bss     dec     hex filename
4097753  573120 4092484 8763357  85b7dd vmlinux

After:

   text    data     bss     dec     hex filename
4097729  573120 4092484 8763333  85b7c5 vmlinux

Having an easy way to add new lru lists may ease future work on the
reclaim code.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agovmscan: move isolate_lru_page() to vmscan.c
Nick Piggin [Sun, 19 Oct 2008 03:26:09 +0000 (20:26 -0700)]
vmscan: move isolate_lru_page() to vmscan.c

On large memory systems, the VM can spend way too much time scanning
through pages that it cannot (or should not) evict from memory.  Not only
does it use up CPU time, but it also provokes lock contention and can
leave large systems under memory presure in a catatonic state.

This patch series improves VM scalability by:

1) putting filesystem backed, swap backed and unevictable pages
   onto their own LRUs, so the system only scans the pages that it
   can/should evict from memory

2) switching to two handed clock replacement for the anonymous LRUs,
   so the number of pages that need to be scanned when the system
   starts swapping is bound to a reasonable number

3) keeping unevictable pages off the LRU completely, so the
   VM does not waste CPU time scanning them. ramfs, ramdisk,
   SHM_LOCKED shared memory segments and mlock()ed VMA pages
   are keept on the unevictable list.

This patch:

isolate_lru_page logically belongs to be in vmscan.c than migrate.c.

It is tough, because we don't need that function without memory migration
so there is a valid argument to have it in migrate.c.  However a
subsequent patch needs to make use of it in the core mm, so we can happily
move it to vmscan.c.

Also, make the function a little more generic by not requiring that it
adds an isolated page to a given list.  Callers can do that.

Note that we now have '__isolate_lru_page()', that does
something quite different, visible outside of vmscan.c
for use with memory controller.  Methinks we need to
rationalize these names/purposes. --lts

[akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomm: cleanup to make remove_memory() arch-neutral
Badari Pulavarty [Sun, 19 Oct 2008 03:25:58 +0000 (20:25 -0700)]
mm: cleanup to make remove_memory() arch-neutral

There is nothing architecture specific about remove_memory().
remove_memory() function is common for all architectures which support
hotplug memory remove.  Instead of duplicating it in every architecture,
collapse them into arch neutral function.

[akpm@linux-foundation.org: fix the export]
Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Gary Hade <garyhade@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoserial_txx9: use %lx for iobase
Atsushi Nemoto [Sun, 19 Oct 2008 03:25:53 +0000 (20:25 -0700)]
serial_txx9: use %lx for iobase

Fix a warning caused by commit 0c8946d97ae7d2d6691f8290a10faa63453b63f8
(serial: Make uart_port's ioport "unsigned long".)

Signed-off-by: Atsushi Nemoto <anemo@mba.ocn.ne.jp>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Josip Rodin <joy@entuzijast.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agotpm: don't export static functions
Stephen Rothwell [Sun, 19 Oct 2008 03:25:46 +0000 (20:25 -0700)]
tpm: don't export static functions

Today's linux-next build (powerpc_allyesconfig) failed like this:

drivers/char/tpm/tpm.c:1162: error: __ksymtab_tpm_dev_release causes a section type conflict

Caused by commit 253115b71fa06330bd58afbe01ccaf763a8a0cf1 ("The
tpm_dev_release function is only called for platform devices, not pnp")
which exported a static function.

Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Rajiv Andrade <srajiv@linux.vnet.ibm.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agomailmap: add Mark Brown
Mark Brown [Sun, 19 Oct 2008 03:25:41 +0000 (20:25 -0700)]
mailmap: add Mark Brown

A couple of commits have a broken real name - fix them up.

Signed-off-by: Mark Brown <broonie@sirena.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoanon_vma_prepare: properly lock even newly allocated entries
Linus Torvalds [Sun, 19 Oct 2008 17:32:20 +0000 (10:32 -0700)]
anon_vma_prepare: properly lock even newly allocated entries

The anon_vma code is very subtle, and we end up doing optimistic lookups
of anon_vmas under RCU in page_lock_anon_vma() with no locking.  Other
CPU's can also see the newly allocated entry immediately after we've
exposed it by setting "vma->anon_vma" to the new value.

We protect against the anon_vma being destroyed by having the SLAB
marked as SLAB_DESTROY_BY_RCU, so the RCU lookup can depend on the
allocation not being destroyed - but it might still be free'd and
re-allocated here to a new vma.

As a result, we should not do the anon_vma list ops on a newly allocated
vma without proper locking.

Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
16 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6
Linus Torvalds [Fri, 17 Oct 2008 22:43:52 +0000 (15:43 -0700)]
Merge git://git./linux/kernel/git/gregkh/usb-2.6

* git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb-2.6: (94 commits)
  USB: remove err() macro from more usb drivers
  USB: remove err() macro from usb misc drivers
  USB: remove err() macro from usb core code
  USB: remove err() macro from usb class drivers
  USB: remove use of err() in drivers/usb/serial
  USB: remove info() macro from usb mtd drivers
  USB: remove info() macro from usb input drivers
  USB: remove info() macro from usb network drivers
  USB: remove info() macro from remaining usb drivers
  USB: remove info() macro from usb/misc drivers
  USB: remove info() macro from usb/serial drivers
  USB: remove warn macro from HID core
  USB: remove warn() macro from usb drivers
  USB: remove warn() macro from usb net drivers
  USB: remove warn() macro from usb media drivers
  USB: remove warn() macro from usb input drivers
  usb/fsl_qe_udc: clear data toggle on clear halt request
  usb/fsl_qe_udc: fix response to get status request
  fsl_usb2_udc: Fix oops on probe failure.
  fsl_usb2_udc: Add a wmb before priming endpoint.
  ...

16 years agoMerge branch 'drm-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied...
Linus Torvalds [Fri, 17 Oct 2008 22:09:20 +0000 (15:09 -0700)]
Merge branch 'drm-next' of git://git./linux/kernel/git/airlied/drm-2.6

* 'drm-next' of git://git.kernel.org/pub/scm/linux/kernel/git/airlied/drm-2.6: (44 commits)
  drm/i915: fix ioremap of a user address for non-root (CVE-2008-3831)
  drm: make CONFIG_DRM depend on CONFIG_SHMEM.
  radeon: fix PCI bus mastering support enables.
  radeon: add RS400 family support.
  drm/radeon: add support for RS740 IGP chipsets.
  i915: GM45 has GM965-style MCH setup.
  i915: Don't run retire work handler while suspended
  i915: Map status page cached for chips with GTT-based HWS location.
  i915: Fix up ring initialization to cover G45 oddities
  i915: Use non-reserved status page index for breadcrumb
  drm: Increment dev_priv->irq_received so i915_gem_interrupts count works.
  drm: kill drm_device->irq
  drm: wbinvd is cache coherent.
  i915: add missing return in error path.
  i915: fixup permissions on gem ioctls.
  drm: Clean up many sparse warnings in i915.
  drm: Use ioremap_wc in i915_driver instead of ioremap, since we always want WC.
  drm: G33-class hardware has a newer 965-style MCH (no DCC register).
  drm: Avoid oops in GEM execbuffers with bad arguments.
  DRM: Return -EBADF on bad object in flink, and return curent name if it exists.
  ...

16 years agoMerge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Fri, 17 Oct 2008 22:08:47 +0000 (15:08 -0700)]
Merge branch 'for_linus' of git://git./linux/kernel/git/mchehab/linux-2.6

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6: (95 commits)
  V4L/DVB (9296): Patch to remove warning message during cx88-dvb compilation
  V4L/DVB (9294): gspca: Add a stop sequence in t613.
  V4L/DVB (9293): gspca: Separate and fix the sensor dependant sequences in t613.
  V4L/DVB (9292): gspca: Call the control setting functions at init time in t613.
  V4L/DVB (9291): gspca: Do not set the white balance temperature by default in t613.
  V4L/DVB (9290): gspca: Adjust the sensor init sequences in t613.
  V4L/DVB (9289): gspca: Other sensor identified as om6802 in t613.
  V4L/DVB (9288): gspca: Write to the USB device and not USB interface in t613.
  V4L/DVB (9287): gspca: Change the name of the multi bytes write function in t613.
  V4L/DVB (9286): gspca: Compilation problem of gspca.c and the kernel version.
  V4L/DVB (9283): Correct typo and enable setting the gain on the mt9m111 sensor
  V4L/DVB (9282): Properly iterate the urbs when destroying them.
  V4L/DVB (9281): gspca: Add hflip and vflip to the po1030 sensor
  V4L/DVB (9280): gspca: Use the gspca debug macros
  V4L/DVB (9279): gspca: Correct some copyright headers
  V4L/DVB (9278): gspca: Remove the m5602_debug variable
  V4L/DVB (9277): gspca: propagate an error in m5602_start_transfer()
  V4L/DVB (9276): videobuf-dvb: two functions are now static
  V4L/DVB (9275): dvb: input data pointer of cx24116_writeregN() should be const
  V4L/DVB (9274): Remove spurious messages and turn into debug.
  ...

16 years agoMerge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Linus Torvalds [Fri, 17 Oct 2008 22:08:11 +0000 (15:08 -0700)]
Merge branch 'for_linus' of git://git./linux/kernel/git/tytso/ext4

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
  ext4: Remove automatic enabling of the HUGE_FILE feature flag
  ext4: Replace hackish ext4_mb_poll_new_transaction with commit callback
  ext4: Update Documentation/filesystems/ext4.txt
  ext4: Remove unused mount options: nomballoc, mballoc, nocheck
  ext4: Remove compile warnings when building w/o CONFIG_PROC_FS
  ext4: Add missing newlines to printk messages
  ext4: Fix file fragmentation during large file write.
  vfs: Add no_nrwrite_index_update writeback control flag
  vfs: Remove the range_cont writeback mode.
  ext4: Use tag dirty lookup during mpage_da_submit_io
  ext4: let the block device know when unused blocks can be discarded
  ext4: Don't reuse released data blocks until transaction commits
  ext4: Use an rbtree for tracking blocks freed during transaction.
  ext4: Do mballoc init before doing filesystem recovery
  ext4: Free ext4_prealloc_space using kmem_cache_free
  ext4: Fix Kconfig typo for ext4dev
  ext4: Remove an old reference to ext4dev in Makefile comment

16 years agoUSB: remove err() macro from more usb drivers
Greg Kroah-Hartman [Thu, 14 Aug 2008 16:37:34 +0000 (09:37 -0700)]
USB: remove err() macro from more usb drivers

USB should not be having it's own printk macros, so remove err() and
use the system-wide standard of dev_err() wherever possible.  In the
few places that will not work out, use a basic printk().

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove err() macro from usb misc drivers
Greg Kroah-Hartman [Thu, 14 Aug 2008 16:37:34 +0000 (09:37 -0700)]
USB: remove err() macro from usb misc drivers

USB should not be having it's own printk macros, so remove err() and
use the system-wide standard of dev_err() wherever possible.  In the
few places that will not work out, use a basic printk().

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove err() macro from usb core code
Greg Kroah-Hartman [Thu, 14 Aug 2008 16:37:34 +0000 (09:37 -0700)]
USB: remove err() macro from usb core code

USB should not be having it's own printk macros, so remove err() and
use the system-wide standard of dev_err() wherever possible.  In the
few places that will not work out, use a basic printk().

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove err() macro from usb class drivers
Greg Kroah-Hartman [Thu, 14 Aug 2008 16:37:34 +0000 (09:37 -0700)]
USB: remove err() macro from usb class drivers

USB should not be having it's own printk macros, so remove err() and
use the system-wide standard of dev_err() wherever possible.  In the
few places that will not work out, use a basic printk().

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove use of err() in drivers/usb/serial
Greg Kroah-Hartman [Wed, 20 Aug 2008 23:56:34 +0000 (16:56 -0700)]
USB: remove use of err() in drivers/usb/serial

err() is going away, so switch to dev_err() or printk() if it's really
needed.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove info() macro from usb mtd drivers
Greg Kroah-Hartman [Mon, 18 Aug 2008 20:21:04 +0000 (13:21 -0700)]
USB: remove info() macro from usb mtd drivers

USB should not be having it's own printk macros, so remove info() and
use the system-wide standard of dev_info() wherever possible.

Acked-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove info() macro from usb input drivers
Greg Kroah-Hartman [Mon, 18 Aug 2008 20:21:04 +0000 (13:21 -0700)]
USB: remove info() macro from usb input drivers

USB should not be having it's own printk macros, so remove info() and
use the system-wide standard of dev_info() wherever possible.

Acked-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove info() macro from usb network drivers
Greg Kroah-Hartman [Mon, 18 Aug 2008 20:21:04 +0000 (13:21 -0700)]
USB: remove info() macro from usb network drivers

USB should not be having it's own printk macros, so remove info() and
use the system-wide standard of dev_info() wherever possible.

Cc: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove info() macro from remaining usb drivers
Greg Kroah-Hartman [Mon, 18 Aug 2008 20:21:04 +0000 (13:21 -0700)]
USB: remove info() macro from remaining usb drivers

USB should not be having it's own printk macros, so remove info() and
use the system-wide standard of dev_info() wherever possible.  In the
few places that will not work out, use a basic printk().

Clean up the remaining usages of this in the drivers/usb/ directory.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove info() macro from usb/misc drivers
Greg Kroah-Hartman [Mon, 18 Aug 2008 20:21:04 +0000 (13:21 -0700)]
USB: remove info() macro from usb/misc drivers

USB should not be having it's own printk macros, so remove info() and
use the system-wide standard of dev_info() wherever possible.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove info() macro from usb/serial drivers
Greg Kroah-Hartman [Mon, 18 Aug 2008 20:21:04 +0000 (13:21 -0700)]
USB: remove info() macro from usb/serial drivers

USB should not be having it's own printk macros, so remove info() and
use the system-wide standard of dev_info() wherever possible.

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove warn macro from HID core
Greg Kroah-Hartman [Wed, 15 Oct 2008 18:30:07 +0000 (11:30 -0700)]
USB: remove warn macro from HID core

There were two stragglers that got missed in the last merge of the HID tree that forgot to change the warn() calls to dev_warn().  This patch fixes them up.

Acked-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove warn() macro from usb drivers
Greg Kroah-Hartman [Thu, 14 Aug 2008 16:37:34 +0000 (09:37 -0700)]
USB: remove warn() macro from usb drivers

USB should not be having it's own printk macros, so remove warn() and
use the system-wide standard of dev_warn() wherever possible.  In the
few places that will not work out, use a basic printk().

Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove warn() macro from usb net drivers
Greg Kroah-Hartman [Thu, 14 Aug 2008 16:37:34 +0000 (09:37 -0700)]
USB: remove warn() macro from usb net drivers

USB should not be having it's own printk macros, so remove warn() and
use the system-wide standard of dev_warn() wherever possible.  In the
few places that will not work out, use a basic printk().

Cc: Jeff Garzik <jgarzik@pobox.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove warn() macro from usb media drivers
Greg Kroah-Hartman [Thu, 14 Aug 2008 16:37:34 +0000 (09:37 -0700)]
USB: remove warn() macro from usb media drivers

USB should not be having it's own printk macros, so remove warn() and
use the system-wide standard of dev_warn() wherever possible.  In the
few places that will not work out, use a basic printk().

Cc: Mauro Carvalho Chehab <mchehab@infradead.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: remove warn() macro from usb input drivers
Greg Kroah-Hartman [Thu, 14 Aug 2008 16:37:34 +0000 (09:37 -0700)]
USB: remove warn() macro from usb input drivers

USB should not be having it's own printk macros, so remove warn() and
use the system-wide standard of dev_warn() wherever possible.  In the
few places that will not work out, use a basic printk().

Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agousb/fsl_qe_udc: clear data toggle on clear halt request
Li Yang [Wed, 24 Sep 2008 07:50:27 +0000 (15:50 +0800)]
usb/fsl_qe_udc: clear data toggle on clear halt request

Fix to comply with USB spec.

Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agousb/fsl_qe_udc: fix response to get status request
Li Yang [Wed, 24 Sep 2008 07:50:26 +0000 (15:50 +0800)]
usb/fsl_qe_udc: fix response to get status request

The original code didn't respond correctly to get status request on
device and endpoint.  Although normal operations can work without the
fix.  It is not compliant with USB spec chapter9 and fails USBCV ch9
tests.  The patch fix this and a few style/typo problems.

Signed-off-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Fix oops on probe failure.
Will Newton [Tue, 12 Aug 2008 14:39:17 +0000 (15:39 +0100)]
fsl_usb2_udc: Fix oops on probe failure.

In some circumstances when fsl_udc_probe fails udc_controller is freed but
the pointer remains non-NULL. fsl_udc_remove will then try and teardown
the partly initialized and freed controller structure resulting in an oops.
This patch ensures udc_controller is either NULL or fully initialized after
fsl_udc_probe.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Add a wmb before priming endpoint.
Will Newton [Tue, 12 Aug 2008 14:39:16 +0000 (15:39 +0100)]
fsl_usb2_udc: Add a wmb before priming endpoint.

Add a wmb to fsl_queue_td before priming the endpoint. This ensures that the
modifications to the QH are seen by the hardware.

Added comment as suggested by Felipe Balbi.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Make fsl_queue_td return type void.
Will Newton [Tue, 12 Aug 2008 14:39:15 +0000 (15:39 +0100)]
fsl_usb2_udc: Make fsl_queue_td return type void.

fsl_queue_td always returns 0. Make it void and remove checks for non-zero
return in callers.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Uninline udc_reset_ep_queue.
Will Newton [Tue, 12 Aug 2008 14:39:14 +0000 (15:39 +0100)]
fsl_usb2_udc: Uninline udc_reset_ep_queue.

Uninline udc_reset_ep_queue and remove it's unused return value.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Rename the arguments of the fsl_writel macro.
Will Newton [Tue, 12 Aug 2008 14:39:13 +0000 (15:39 +0100)]
fsl_usb2_udc: Rename the arguments of the fsl_writel macro.

Rename the arguments of the fsl_writel macro to match their use.
Remove a couple of unnecessary prototypes.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Initialize spinlock earlier.
Will Newton [Tue, 12 Aug 2008 14:39:12 +0000 (15:39 +0100)]
fsl_usb2_udc: Initialize spinlock earlier.

Move spinlock initialization earlier so we can turn shared irq handler
debugging on safely.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Clean up whitespace in /proc debugging output.
Will Newton [Tue, 12 Aug 2008 14:39:11 +0000 (15:39 +0100)]
fsl_usb2_udc: Clean up whitespace in /proc debugging output.

Missing spaces were causing the /proc debugging output to be rather
unreadable.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Clean up whitespace in errors and warnings.
Will Newton [Tue, 12 Aug 2008 14:39:10 +0000 (15:39 +0100)]
fsl_usb2_udc: Clean up whitespace in errors and warnings.

VDBG always outputs a trailing \n.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Fix some sparse warnings and remove redundant code.
Will Newton [Tue, 12 Aug 2008 14:39:09 +0000 (15:39 +0100)]
fsl_usb2_udc: Fix some sparse warnings and remove redundant code.

Fix some sparse "integer used as NULL pointer" warnings.
Remove some unnecessary volatiles and static initialization.
Remove some unused struct members and reorder to improve packing.
Remove a few unneeded includes.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Remove check for udc == NULL in dr_controller_setup.
Will Newton [Tue, 12 Aug 2008 14:39:08 +0000 (15:39 +0100)]
fsl_usb2_udc: Remove check for udc == NULL in dr_controller_setup.

Remove check for udc == NULL in dr_controller_setup. All callers of
this function have already dereferenced udc at some point.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agofsl_usb2_udc: Make dr_ep_setup function static.
Will Newton [Tue, 12 Aug 2008 14:39:07 +0000 (15:39 +0100)]
fsl_usb2_udc: Make dr_ep_setup function static.

Make dr_ep_setup function static as it's never used outside this file.

Signed-off-by: Will Newton <will.newton@gmail.com>
Acked-by: Li Yang <leoli@freescale.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: fix up problems in the vtusb driver
Stephen Ware [Wed, 8 Oct 2008 17:53:56 +0000 (10:53 -0700)]
USB: fix up problems in the vtusb driver

Add range check on buffer sizes passed in from user space
(max is 8*PAGE_SIZE) which will work for the most common
spectrometers even at pages as small as 1K.

Add kref to vst device structure to preserve reference to the
usb object until we truly are done with it.

From: Stephen Ware <stephen.ware@eqware.net>
From: Dennis O'Brien <dennis.obrien@eqware.net>
Signed-off-by: Dennis O'Brien <dennis.obrien@eqware.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: OHCI: fix endless polling behavior
Alan Stern [Thu, 9 Oct 2008 19:40:23 +0000 (15:40 -0400)]
USB: OHCI: fix endless polling behavior

This patch (as1149) fixes an obscure problem in OHCI polling.  In the
current code, if the RHSC interrupt status flag turns on at a time
when RHSC interrupts are disabled, it will remain on forever:

The interrupt handler is the only place where RHSC status
gets turned back off;

The interrupt handler won't turn RHSC status off because it
doesn't turn off status flags if the corresponding interrupt
isn't enabled;

RHSC interrupts will never get enabled because
ohci_root_hub_state_changes() doesn't reenable RHSC if RHSC
status is on!

As a result we will continue polling indefinitely instead of reverting
to interrupt-driven operation, and the root hub will not autosuspend.
This particular sequence of events is not at all unusual; in fact
plugging a USB device into an OHCI controller will usually cause it to
occur.

Of course, this is a bug.  The proper thing to do is to turn off RHSC
status just before reading the actual port status values.  That way
either a port status change will be detected (if it occurs before the
status read) or it will turn RHSC back on.  Possibly both, but that
won't hurt anything.

We can still check for systems in which RHSC is totally broken, by
re-reading RHSC after clearing it and before reading the port
statuses.  (This re-read has to be done anyway, to post the earlier
write.)  If RHSC is on but no port-change statuses are set, then we
know that RHSC is broken and we can avoid re-enabling it.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Cc: stable <stable@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: option: add Pantech cards
Dan Williams [Fri, 10 Oct 2008 10:41:16 +0000 (06:41 -0400)]
USB: option: add Pantech cards

Add some Pantech mobile broadband IDs.

Signed-off-by: Dan Williams <dcbw@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: hub.c: Add initial_descriptor_timeout module parameter for usbcore
Jaroslav Kysela [Fri, 10 Oct 2008 14:24:45 +0000 (16:24 +0200)]
USB: hub.c: Add initial_descriptor_timeout module parameter for usbcore

This patch adds initial_descriptor_timeout module parameter for usbcore.ko
to allow modify initial 64-byte USB_REQ_GET_DESCRIPTOR timeout for
non-standard devices.

For example, the SATA8000 device from DATAST0R Technology Corp
requires about 10 seconds to send reply (probably it waits until
inserted disk is ready for operation).

Also, this patch adds missing usbcore parameters to
Documentation/kernel-parameters.txt.

Signed-off-by: Jaroslav Kysela <perex@perex.cz>
Acked-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: Export if an interface driver supports autosuspend.
Sarah Sharp [Mon, 6 Oct 2008 21:45:46 +0000 (14:45 -0700)]
USB: Export if an interface driver supports autosuspend.

Create a new sysfs file per interface named supports_autosuspend.  This
file returns true if an interface driver's .supports_autosuspend flag is
set.  It also returns true if the interface is unclaimed (since the USB
core will autosuspend a device if an interface is not claimed).

This new sysfs file will be useful for user space scripts to test whether
a USB device correctly auto-suspends.

Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Cc: Oliver Neukum <oliver@neukum.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: EHCI: fix remote-wakeup support for ARC/TDI core
Alan Stern [Mon, 6 Oct 2008 15:25:53 +0000 (11:25 -0400)]
USB: EHCI: fix remote-wakeup support for ARC/TDI core

This patch (as1147) fixes the remote-wakeup support for EHCI
controllers using the ARC/TDI "embedded-TT" core.  These controllers
turn off the RESUME bit by themselves when a port resume is complete;
hence we need to keep separate track of which ports are suspended or
in the process of resuming.

The patch also makes a couple of small improvements in ehci_irq(),
replacing reads of the command register with the value already stored
in a local variable.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Tested-by: Thomas Reitmayr <treitmayr@devbase.at>
CC: David Brownell <david-b@pacbell.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: snoop processes opening usbfs device files
Alan Stern [Mon, 6 Oct 2008 15:24:26 +0000 (11:24 -0400)]
USB: snoop processes opening usbfs device files

This patch (as1148) adds a new "snoop" message to usbfs when a device
file is opened, identifying the process responsible.  This comes in
extremely handy when trying to determine which program is doing some
unwanted USB access.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: Option / AnyData new modem, same ID
Jon K Hellan [Fri, 3 Oct 2008 08:36:16 +0000 (10:36 +0200)]
USB: Option / AnyData new modem, same ID

The AnyData ADU-310 series of wireless modems uses the same product ID as the ADU-E100 series.

Signed-off-by: Jon K Hellan <hellan@acm.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: EHCI: log a warning if ehci-hcd is not loaded first
Alan Stern [Thu, 2 Oct 2008 15:48:13 +0000 (11:48 -0400)]
USB: EHCI: log a warning if ehci-hcd is not loaded first

This patch (as1139) adds a warning to the system log whenever ehci-hcd
is loaded after ohci-hcd or uhci-hcd.  Nowadays most distributions are
pretty good about not doing this; maybe the warning will help convince
anyone still doing it wrong.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Cc: stable <stable@kernel.org> [2.6.27]
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: EHCI, OHCI, UHCI: remove version numbers
Alan Stern [Thu, 2 Oct 2008 15:47:15 +0000 (11:47 -0400)]
USB: EHCI, OHCI, UHCI: remove version numbers

This patch (as1145) removes the essentially useless driver-version
strings from ehci-hcd, ohci-hcd, and uhci-hcd.  It also unifies the
form of the banner lines they display upon loading and adds a missing
test for usb_disabled() to ehci-hcd.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: anchor API changes needed for btusb
Oliver Neukum [Mon, 25 Aug 2008 20:40:25 +0000 (22:40 +0200)]
USB: anchor API changes needed for btusb

This extends the anchor API as btusb needs for autosuspend.

Signed-off-by: Oliver Neukum <oneukum@suse.de>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: ftdi-elan: Always pass usb_bulk_msg() a timeout in milliseconds.
Sarah Sharp [Mon, 29 Sep 2008 17:58:35 +0000 (10:58 -0700)]
USB: ftdi-elan: Always pass usb_bulk_msg() a timeout in milliseconds.

The kernel doc for usb_bulk_msg() says the timeout for a bulk message should be
specified in milliseconds.  The ftdi-elan driver converts milliseconds to
jiffies before passing the timeout to usb_bulk_msg().  This is mostly harmless,
since it will just lead to very long timeouts, but was obviously not the intent
of the original author.

Signed-off-by: Sarah Sharp <sarah.a.sharp@intel.com>
Acked-by: Tony Olech <tony.olech@elandigitalsystems.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: isp1760: Use an IS_ERR test rather than a NULL test
Julien Brunel [Wed, 24 Sep 2008 16:00:36 +0000 (18:00 +0200)]
USB: isp1760: Use an IS_ERR test rather than a NULL test

In case of error, the function isp1760_register returns an ERR
pointer, but never returns a NULL pointer. So after a call to this
function, a NULL test should be replaced by an IS_ERR test. Moreover,
we have noticed that:
(1) the result of isp1760_register is assigned through the function
pci_set_drvdata without an error test,
(2) if the call to isp1760_register fails, the current function
(isp1761_pci_probe) returns 0, and if it succeeds, it returns -ENOMEM,
which seems odd.

Thus, we suggest to move the test before the call to pci_set_drvdata
to correct (1), and to turn it into a non IS_ERR test to correct (2).

The semantic match that finds this problem is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@bad_null_test@
expression x,E;
statement S1, S2;
@@
x =  isp1760_register(...)
... when != x = E
* if (x == NULL)
S1 else S2
// </smpl>

Signed-off-by: Julien Brunel <brunel@diku.dk>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: improve ehci_watchdog's side effect in CPU power management
Yi Yang [Thu, 25 Sep 2008 09:25:44 +0000 (17:25 +0800)]
USB: improve ehci_watchdog's side effect in CPU power management

ehci_watchdog will wake up CPU very frequently so that CPU
stays at C3 very short, average residence time is about 50
ms on Aspire One, but we expect it should be about 1 second
or more, so this kind of periodic timer is very bad for power
saving.

We can't remove this timer because of some bad USB controller
chipset, but at least we should reduce its side effect to as
possible as low.

This patch can make CPU stay at C3 longer, average residence time
is about twice as long as original.

Please consider to apply it, thanks

Signed-off-by: Yi Yang <yi.y.yang@intel.com>
Cc: David Brownell <dbrownell@users.sourceforge.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: UHCI: improve scheduling of interrupt URBs
Alan Stern [Thu, 25 Sep 2008 20:59:57 +0000 (16:59 -0400)]
USB: UHCI: improve scheduling of interrupt URBs

This patch (as1140) adds a little intelligence to the interrupt-URB
scheduler in uhci-hcd.  Right now the scheduler is stupid; every URB
having the same period is assigned to the same slot.  Thus a large
group of period-N URBs can fill their slot and cause -ENOSPC errors
even when all the lower-period slots are empty.

With the patch, if an URB doesn't fit in its assigned slot then the
scheduler will try using lower-period slots.  This will provide
greater flexibility.  As an example, the driver will be able to handle
more than just three or four mice, which the current driver cannot.

Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agoUSB: ti_usb_3410_5052: removed duplicated include
Huang Weiyi [Thu, 25 Sep 2008 05:11:28 +0000 (13:11 +0800)]
USB: ti_usb_3410_5052: removed duplicated include

Removed duplicated #include <linux/firmware.h> in
drivers/usb/serial/ti_usb_3410_5052.c.

Signed-off-by: Huang Weiyi <hwy@cn.fujitsu.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
16 years agousb: vstusb.c : new driver for spectrometers used by Vernier Software & Technology...
Stephen Ware [Tue, 30 Sep 2008 18:39:38 +0000 (11:39 -0700)]
usb: vstusb.c : new driver for spectrometers used by Vernier Software & Technology, Inc.

This patch adds the vstusb driver to the drivers/usb/misc directory.
This driver provides support for Vernier Software & Technology
spectrometers, all made by Ocean Optics. The driver provides both IOCTL
and read()/write() methods for sending raw data to spectrometers across
the bulk channel. Each method allows for a configured timeout.

From: Stephen Ware <stephen.ware@eqware.net>
Signed-off-by: Dennis O'Brien <dennis.obrien@eqware.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>