openwrt/staging/blogic.git
15 years agoeCryptfs: Prevent lower dentry from going negative during unlink
Tyler Hicks [Tue, 22 Sep 2009 17:52:17 +0000 (12:52 -0500)]
eCryptfs: Prevent lower dentry from going negative during unlink

When calling vfs_unlink() on the lower dentry, d_delete() turns the
dentry into a negative dentry when the d_count is 1.  This eventually
caused a NULL pointer deref when a read() or write() was done and the
negative dentry's d_inode was dereferenced in
ecryptfs_read_update_atime() or ecryptfs_getxattr().

Placing mutt's tmpdir in an eCryptfs mount is what initially triggered
the oops and I was able to reproduce it with the following sequence:

open("/tmp/upper/foo", O_RDWR|O_CREAT|O_EXCL|O_NOFOLLOW, 0600) = 3
link("/tmp/upper/foo", "/tmp/upper/bar") = 0
unlink("/tmp/upper/foo")                = 0
open("/tmp/upper/bar", O_RDWR|O_CREAT|O_NOFOLLOW, 0600) = 4
unlink("/tmp/upper/bar")                = 0
write(4, "eCryptfs test\n"..., 14 <unfinished ...>
+++ killed by SIGKILL +++

https://bugs.launchpad.net/ecryptfs/+bug/387073

Reported-by: Loïc Minier <loic.minier@canonical.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoeCryptfs: Propagate vfs_read and vfs_write return codes
Tyler Hicks [Thu, 17 Sep 2009 00:04:20 +0000 (19:04 -0500)]
eCryptfs: Propagate vfs_read and vfs_write return codes

Errors returned from vfs_read() and vfs_write() calls to the lower
filesystem were being masked as -EINVAL.  This caused some confusion to
users who saw EINVAL instead of ENOSPC when the disk was full, for
instance.

Also, the actual bytes read or written were not accessible by callers to
ecryptfs_read_lower() and ecryptfs_write_lower(), which may be useful in
some cases.  This patch updates the error handling logic where those
functions are called in order to accept positive return codes indicating
success.

Cc: Eric Sandeen <esandeen@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoeCryptfs: Validate global auth tok keys
Tyler Hicks [Wed, 26 Aug 2009 06:54:56 +0000 (01:54 -0500)]
eCryptfs: Validate global auth tok keys

When searching through the global authentication tokens for a given key
signature, verify that a matching key has not been revoked and has not
expired.  This allows the `keyctl revoke` command to be properly used on
keys in use by eCryptfs.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoeCryptfs: Filename encryption only supports password auth tokens
Tyler Hicks [Fri, 21 Aug 2009 09:27:46 +0000 (04:27 -0500)]
eCryptfs: Filename encryption only supports password auth tokens

Returns -ENOTSUPP when attempting to use filename encryption with
something other than a password authentication token, such as a private
token from openssl.  Using filename encryption with a userspace eCryptfs
key module is a future goal.  Until then, this patch handles the
situation a little better than simply using a BUG_ON().

Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoeCryptfs: Check for O_RDONLY lower inodes when opening lower files
Tyler Hicks [Wed, 12 Aug 2009 06:06:54 +0000 (01:06 -0500)]
eCryptfs: Check for O_RDONLY lower inodes when opening lower files

If the lower inode is read-only, don't attempt to open the lower file
read/write and don't hand off the open request to the privileged
eCryptfs kthread for opening it read/write.  Instead, only try an
unprivileged, read-only open of the file and give up if that fails.
This patch fixes an oops when eCryptfs is mounted on top of a read-only
mount.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Eric Sandeen <esandeen@redhat.com>
Cc: Dave Kleikamp <shaggy@linux.vnet.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoeCryptfs: Handle unrecognized tag 3 cipher codes
Tyler Hicks [Tue, 11 Aug 2009 05:36:32 +0000 (00:36 -0500)]
eCryptfs: Handle unrecognized tag 3 cipher codes

Returns an error when an unrecognized cipher code is present in a tag 3
packet or an ecryptfs_crypt_stat cannot be initialized.  Also sets an
crypt_stat->tfm error pointer to NULL to ensure that it will not be
incorrectly freed in ecryptfs_destroy_crypt_stat().

Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: ecryptfs-devel@lists.launchpad.net
Cc: stable <stable@kernel.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoecryptfs: improved dependency checking and reporting
Dave Hansen [Thu, 27 Aug 2009 16:47:07 +0000 (09:47 -0700)]
ecryptfs: improved dependency checking and reporting

So, I compiled a 2.6.31-rc5 kernel with ecryptfs and loaded its module.
When it came time to mount my filesystem, I got this in dmesg, and it
refused to mount:

[93577.776637] Unable to allocate crypto cipher with name [aes]; rc = [-2]
[93577.783280] Error attempting to initialize key TFM cipher with name = [aes]; rc = [-2]
[93577.791183] Error attempting to initialize cipher with name = [aes] and key size = [32]; rc = [-2]
[93577.800113] Error parsing options; rc = [-22]

I figured from the error message that I'd either forgotten to load "aes"
or that my key size was bogus.  Neither one of those was the case.  In
fact, I was missing the CRYPTO_ECB config option and the 'ecb' module.
Unfortunately, there's no trace of 'ecb' in that error message.

I've done two things to fix this.  First, I've modified ecryptfs's
Kconfig entry to select CRYPTO_ECB and CRYPTO_CBC.  I also took CRYPTO
out of the dependencies since the 'select' will take care of it for us.

I've also modified the error messages to print a string that should
contain both 'ecb' and 'aes' in my error case.  That will give any
future users a chance of finding the right modules and Kconfig options.

I also wonder if we should:

select CRYPTO_AES if !EMBEDDED

since I think most ecryptfs users are using AES like me.

Cc: ecryptfs-devel@lists.launchpad.net
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Dustin Kirkland <kirkland@canonical.com>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
[tyhicks@linux.vnet.ibm.com: Removed extra newline, 80-char violation]
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoeCryptfs: Fix lockdep-reported AB-BA mutex issue
Roland Dreier [Wed, 1 Jul 2009 22:48:18 +0000 (15:48 -0700)]
eCryptfs: Fix lockdep-reported AB-BA mutex issue

Lockdep reports the following valid-looking possible AB-BA deadlock with
global_auth_tok_list_mutex and keysig_list_mutex:

  ecryptfs_new_file_context() ->
      ecryptfs_copy_mount_wide_sigs_to_inode_sigs() ->
          mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex);
          -> ecryptfs_add_keysig() ->
              mutex_lock(&crypt_stat->keysig_list_mutex);

vs

  ecryptfs_generate_key_packet_set() ->
      mutex_lock(&crypt_stat->keysig_list_mutex);
      -> ecryptfs_find_global_auth_tok_for_sig() ->
          mutex_lock(&mount_crypt_stat->global_auth_tok_list_mutex);

ie the two mutexes are taken in opposite orders in the two different
code paths.  I'm not sure if this is a real bug where two threads could
actually hit the two paths in parallel and deadlock, but it at least
makes lockdep impossible to use with ecryptfs since this report triggers
every time and disables future lockdep reporting.

Since ecryptfs_add_keysig() is called only from the single callsite in
ecryptfs_copy_mount_wide_sigs_to_inode_sigs(), the simplest fix seems to
be to move the lock of keysig_list_mutex back up outside of the where
global_auth_tok_list_mutex is taken.  This patch does that, and fixes
the lockdep report on my system (and ecryptfs still works OK).

The full output of lockdep fixed by this patch is:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.31-2-generic #14~rbd2
-------------------------------------------------------
gdm/2640 is trying to acquire lock:
 (&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}, at: [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90

but task is already holding lock:
 (&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&crypt_stat->keysig_list_mutex){+.+.+.}:
       [<ffffffff8108c897>] check_prev_add+0x2a7/0x370
       [<ffffffff8108cfc1>] validate_chain+0x661/0x750
       [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
       [<ffffffff8108d585>] lock_acquire+0xa5/0x150
       [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
       [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
       [<ffffffff8121526a>] ecryptfs_add_keysig+0x5a/0xb0
       [<ffffffff81213299>] ecryptfs_copy_mount_wide_sigs_to_inode_sigs+0x59/0xb0
       [<ffffffff81214b06>] ecryptfs_new_file_context+0xa6/0x1a0
       [<ffffffff8120e42a>] ecryptfs_initialize_file+0x4a/0x140
       [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
       [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
       [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
       [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
       [<ffffffff8112d8d9>] do_sys_open+0x69/0x140
       [<ffffffff8112d9f0>] sys_open+0x20/0x30
       [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

-> #0 (&mount_crypt_stat->global_auth_tok_list_mutex){+.+.+.}:
       [<ffffffff8108c675>] check_prev_add+0x85/0x370
       [<ffffffff8108cfc1>] validate_chain+0x661/0x750
       [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
       [<ffffffff8108d585>] lock_acquire+0xa5/0x150
       [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
       [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
       [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
       [<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0
       [<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120
       [<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200
       [<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140
       [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
       [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
       [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
       [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
       [<ffffffff8112d8d9>] do_sys_open+0x69/0x140
       [<ffffffff8112d9f0>] sys_open+0x20/0x30
       [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
       [<ffffffffffffffff>] 0xffffffffffffffff

other info that might help us debug this:

2 locks held by gdm/2640:
 #0:  (&sb->s_type->i_mutex_key#11){+.+.+.}, at: [<ffffffff8113cb8b>] do_filp_open+0x3cb/0xae0
 #1:  (&crypt_stat->keysig_list_mutex){+.+.+.}, at: [<ffffffff81217728>] ecryptfs_generate_key_packet_set+0x58/0x2b0

stack backtrace:
Pid: 2640, comm: gdm Tainted: G         C 2.6.31-2-generic #14~rbd2
Call Trace:
 [<ffffffff8108b988>] print_circular_bug_tail+0xa8/0xf0
 [<ffffffff8108c675>] check_prev_add+0x85/0x370
 [<ffffffff81094912>] ? __module_text_address+0x12/0x60
 [<ffffffff8108cfc1>] validate_chain+0x661/0x750
 [<ffffffff81017275>] ? print_context_stack+0x85/0x140
 [<ffffffff81089c68>] ? find_usage_backwards+0x38/0x160
 [<ffffffff8108d2e7>] __lock_acquire+0x237/0x430
 [<ffffffff8108d585>] lock_acquire+0xa5/0x150
 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
 [<ffffffff8108b0b0>] ? check_usage_backwards+0x0/0xb0
 [<ffffffff815526cd>] __mutex_lock_common+0x4d/0x3d0
 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
 [<ffffffff8121591e>] ? ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
 [<ffffffff8108c02c>] ? mark_held_locks+0x6c/0xa0
 [<ffffffff81125b0d>] ? kmem_cache_alloc+0xfd/0x1a0
 [<ffffffff8108c34d>] ? trace_hardirqs_on_caller+0x14d/0x190
 [<ffffffff81552b56>] mutex_lock_nested+0x46/0x60
 [<ffffffff8121591e>] ecryptfs_find_global_auth_tok_for_sig+0x2e/0x90
 [<ffffffff812177d5>] ecryptfs_generate_key_packet_set+0x105/0x2b0
 [<ffffffff81212f49>] ecryptfs_write_headers_virt+0xc9/0x120
 [<ffffffff8121306d>] ecryptfs_write_metadata+0xcd/0x200
 [<ffffffff81210240>] ? ecryptfs_init_persistent_file+0x60/0xe0
 [<ffffffff8120e44b>] ecryptfs_initialize_file+0x6b/0x140
 [<ffffffff8120e54d>] ecryptfs_create+0x2d/0x60
 [<ffffffff8113a7d4>] vfs_create+0xb4/0xe0
 [<ffffffff8113a8c4>] __open_namei_create+0xc4/0x110
 [<ffffffff8113d1c1>] do_filp_open+0xa01/0xae0
 [<ffffffff8129a93e>] ? _raw_spin_unlock+0x5e/0xb0
 [<ffffffff8155410b>] ? _spin_unlock+0x2b/0x40
 [<ffffffff81139e9b>] ? getname+0x3b/0x240
 [<ffffffff81148a5a>] ? alloc_fd+0xfa/0x140
 [<ffffffff8112d8d9>] do_sys_open+0x69/0x140
 [<ffffffff81553b8f>] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [<ffffffff8112d9f0>] sys_open+0x20/0x30
 [<ffffffff81013132>] system_call_fastpath+0x16/0x1b

Signed-off-by: Roland Dreier <rolandd@cisco.com>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoecryptfs: Remove unneeded locking that triggers lockdep false positives
Roland Dreier [Tue, 14 Jul 2009 20:32:56 +0000 (13:32 -0700)]
ecryptfs: Remove unneeded locking that triggers lockdep false positives

In ecryptfs_destroy_inode(), inode_info->lower_file_mutex is locked,
and just after the mutex is unlocked, the code does:

  kmem_cache_free(ecryptfs_inode_info_cache, inode_info);

This means that if another context could possibly try to take the same
mutex as ecryptfs_destroy_inode(), then it could end up getting the
mutex just before the data structure containing the mutex is freed.
So any such use would be an obvious use-after-free bug (catchable with
slab poisoning or mutex debugging), and therefore the locking in
ecryptfs_destroy_inode() is not needed and can be dropped.

Similarly, in ecryptfs_destroy_crypt_stat(), crypt_stat->keysig_list_mutex
is locked, and then the mutex is unlocked just before the code does:

  memset(crypt_stat, 0, sizeof(struct ecryptfs_crypt_stat));

Therefore taking this mutex is similarly not necessary.

Removing this locking fixes false-positive lockdep reports such as the
following (and they are false-positives for exactly the same reason
that the locking is not needed):

=================================
[ INFO: inconsistent lock state ]
2.6.31-2-generic #14~rbd3
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/323 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&inode_info->lower_file_mutex){+.+.?.}, at: [<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100
{RECLAIM_FS-ON-W} state was registered at:
  [<ffffffff8108c02c>] mark_held_locks+0x6c/0xa0
  [<ffffffff8108c10f>] lockdep_trace_alloc+0xaf/0xe0
  [<ffffffff81125a51>] kmem_cache_alloc+0x41/0x1a0
  [<ffffffff8113117a>] get_empty_filp+0x7a/0x1a0
  [<ffffffff8112dd46>] dentry_open+0x36/0xc0
  [<ffffffff8121a36c>] ecryptfs_privileged_open+0x5c/0x2e0
  [<ffffffff81210283>] ecryptfs_init_persistent_file+0xa3/0xe0
  [<ffffffff8120e838>] ecryptfs_lookup_and_interpose_lower+0x278/0x380
  [<ffffffff8120f97a>] ecryptfs_lookup+0x12a/0x250
  [<ffffffff8113930a>] real_lookup+0xea/0x160
  [<ffffffff8113afc8>] do_lookup+0xb8/0xf0
  [<ffffffff8113b518>] __link_path_walk+0x518/0x870
  [<ffffffff8113bd9c>] path_walk+0x5c/0xc0
  [<ffffffff8113be5b>] do_path_lookup+0x5b/0xa0
  [<ffffffff8113bfe7>] user_path_at+0x57/0xa0
  [<ffffffff811340dc>] vfs_fstatat+0x3c/0x80
  [<ffffffff8113424b>] vfs_stat+0x1b/0x20
  [<ffffffff81134274>] sys_newstat+0x24/0x50
  [<ffffffff81013132>] system_call_fastpath+0x16/0x1b
  [<ffffffffffffffff>] 0xffffffffffffffff
irq event stamp: 7811
hardirqs last  enabled at (7811): [<ffffffff810c037f>] call_rcu+0x5f/0x90
hardirqs last disabled at (7810): [<ffffffff810c0353>] call_rcu+0x33/0x90
softirqs last  enabled at (3764): [<ffffffff810631da>] __do_softirq+0x14a/0x220
softirqs last disabled at (3751): [<ffffffff8101440c>] call_softirq+0x1c/0x30

other info that might help us debug this:
2 locks held by kswapd0/323:
 #0:  (shrinker_rwsem){++++..}, at: [<ffffffff810f67ed>] shrink_slab+0x3d/0x190
 #1:  (&type->s_umount_key#35){.+.+..}, at: [<ffffffff811429a1>] prune_dcache+0xd1/0x1b0

stack backtrace:
Pid: 323, comm: kswapd0 Tainted: G         C 2.6.31-2-generic #14~rbd3
Call Trace:
 [<ffffffff8108ad6c>] print_usage_bug+0x18c/0x1a0
 [<ffffffff8108aff0>] ? check_usage_forwards+0x0/0xc0
 [<ffffffff8108bac2>] mark_lock_irq+0xf2/0x280
 [<ffffffff8108bd87>] mark_lock+0x137/0x1d0
 [<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0
 [<ffffffff8108bee6>] mark_irqflags+0xc6/0x1a0
 [<ffffffff8108d337>] __lock_acquire+0x287/0x430
 [<ffffffff8108d585>] lock_acquire+0xa5/0x150
 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
 [<ffffffff8108d2e7>] ? __lock_acquire+0x237/0x430
 [<ffffffff815526ad>] __mutex_lock_common+0x4d/0x3d0
 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
 [<ffffffff81164710>] ? fsnotify_clear_marks_by_inode+0x30/0xf0
 [<ffffffff81210d34>] ? ecryptfs_destroy_inode+0x34/0x100
 [<ffffffff8129a91e>] ? _raw_spin_unlock+0x5e/0xb0
 [<ffffffff81552b36>] mutex_lock_nested+0x46/0x60
 [<ffffffff81210d34>] ecryptfs_destroy_inode+0x34/0x100
 [<ffffffff81145d27>] destroy_inode+0x87/0xd0
 [<ffffffff81146b4c>] generic_delete_inode+0x12c/0x1a0
 [<ffffffff81145832>] iput+0x62/0x70
 [<ffffffff811423c8>] dentry_iput+0x98/0x110
 [<ffffffff81142550>] d_kill+0x50/0x80
 [<ffffffff81142623>] prune_one_dentry+0xa3/0xc0
 [<ffffffff811428b1>] __shrink_dcache_sb+0x271/0x290
 [<ffffffff811429d9>] prune_dcache+0x109/0x1b0
 [<ffffffff81142abf>] shrink_dcache_memory+0x3f/0x50
 [<ffffffff810f68dd>] shrink_slab+0x12d/0x190
 [<ffffffff810f9377>] balance_pgdat+0x4d7/0x640
 [<ffffffff8104c4c0>] ? finish_task_switch+0x40/0x150
 [<ffffffff810f63c0>] ? isolate_pages_global+0x0/0x60
 [<ffffffff810f95f7>] kswapd+0x117/0x170
 [<ffffffff810777a0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff810f94e0>] ? kswapd+0x0/0x170
 [<ffffffff810773be>] kthread+0x9e/0xb0
 [<ffffffff8101430a>] child_rip+0xa/0x20
 [<ffffffff81013c90>] ? restore_args+0x0/0x30
 [<ffffffff81077320>] ? kthread+0x0/0xb0
 [<ffffffff81014300>] ? child_rip+0x0/0x20

Signed-off-by: Roland Dreier <roland@digitalvampire.org>
Signed-off-by: Tyler Hicks <tyhicks@linux.vnet.ibm.com>
15 years agoMerge branch 'drm-intel-next' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt...
Linus Torvalds [Thu, 24 Sep 2009 17:30:41 +0000 (10:30 -0700)]
Merge branch 'drm-intel-next' of git://git./linux/kernel/git/anholt/drm-intel

* 'drm-intel-next' of git://git.kernel.org/pub/scm/linux/kernel/git/anholt/drm-intel: (57 commits)
  drm/i915: Handle ERESTARTSYS during page fault
  drm/i915: Warn before mmaping a purgeable buffer.
  drm/i915: Track purged state.
  drm/i915: Remove eviction debug spam
  drm/i915: Immediately discard any backing storage for uneeded objects
  drm/i915: Do not mis-classify clean objects as purgeable
  drm/i915: Whitespace correction for madv
  drm/i915: BUG_ON page refleak during unbind
  drm/i915: Search harder for a reusable object
  drm/i915: Clean up evict from list.
  drm/i915: Add tracepoints
  drm/i915: framebuffer compression for GM45+
  drm/i915: split display functions by chip type
  drm/i915: Skip the sanity checks if the current relocation is valid
  drm/i915: Check that the relocation points to within the target
  drm/i915: correct FBC update when pipe base update occurs
  drm/i915: blacklist Acer AspireOne lid status
  ACPI: make ACPI button funcs no-ops if not built in
  drm/i915: prevent FIFO calculation overflows on 32 bits with high dotclocks
  drm/i915: intel_display.c handle latency variable efficiently
  ...

Fix up trivial conflicts in drivers/gpu/drm/i915/{i915_dma.c|i915_drv.h}

15 years agoMerge branch 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes...
Linus Torvalds [Thu, 24 Sep 2009 16:57:08 +0000 (09:57 -0700)]
Merge branch 'linux-next' of git://git./linux/kernel/git/jbarnes/pci-2.6

* 'linux-next' of git://git.kernel.org/pub/scm/linux/kernel/git/jbarnes/pci-2.6: (21 commits)
  x86/PCI: make 32 bit NUMA node array int, not unsigned char
  x86/PCI: default pcibus cpumask to all cpus if it lacks affinity
  MAINTAINTERS: remove hotplug driver entries
  PCI: pciehp: remove slot capabilities definitions
  PCI: pciehp: remove error message definitions
  PCI: pciehp: remove number field
  PCI: pciehp: remove hpc_ops
  PCI: pciehp: remove pci_dev field
  PCI: pciehp: remove crit_sect mutex
  PCI: pciehp: remove slot_bus field
  PCI: pciehp: remove first_slot field
  PCI: pciehp: remove slot_device_offset field
  PCI: pciehp: remove hp_slot field
  PCI: pciehp: remove device field
  PCI: pciehp: remove bus field
  PCI: pciehp: remove slot_num_inc field
  PCI: pciehp: remove num_slots field
  PCI: pciehp: remove slot_list field
  PCI: fix VGA arbiter header file
  PCI: Disable AER with pci=nomsi
  ...

Fixed up trivial conflicts in MAINTAINERS

15 years agoMerge branch 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6
Linus Torvalds [Thu, 24 Sep 2009 16:04:24 +0000 (09:04 -0700)]
Merge branch 'cputime' of git://git390.marist.edu/linux-2.6

* 'cputime' of git://git390.marist.edu/pub/scm/linux-2.6:
  [PATCH] Fix idle time field in /proc/uptime

15 years agoMerge branch 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze
Linus Torvalds [Thu, 24 Sep 2009 16:01:44 +0000 (09:01 -0700)]
Merge branch 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze

* 'for-linus' of git://git.monstr.eu/linux-2.6-microblaze: (24 commits)
  microblaze: Disable heartbeat/enable emaclite in defconfigs
  microblaze: Support simpleImage.dts make target
  microblaze: Fix _start symbol to physical address
  microblaze: Use LOAD_OFFSET macro to get correct LMA for all sections
  microblaze: Create the LOAD_OFFSET macro used to compute VMA vs LMA offsets
  microblaze: Copy ppc asm-compat.h for clean handling of constants in asm and C
  microblaze: Actually show KiB rather than pages in "Freeing initrd memory:"
  microblaze: Support ptrace syscall tracing.
  microblaze: Updated CPU version and FPGA family codes in PVR
  microblaze: Generate correct signal and siginfo for integer div-by-zero
  microblaze: Don't be noisy when userspace causes hardware exceptions
  microblaze: Remove ipc.h file which points to non-existing asm-generic file
  microblaze: Clear sticky FSR register after generating exception signals
  microblaze: Ensure CPU usermode is set on new userspace processes
  microblaze: Use correct kbuild variable KBUILD_CFLAGS
  microblaze: Save and restore msr in hw exception
  microblaze: Add architectural support for USB EHCI host controllers
  microblaze: Implement include/asm/syscall.h.
  microblaze: Improve checking mechanism for MSR instruction
  microblaze: Add checking mechanism for MSR instruction
  ...

15 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus
Linus Torvalds [Thu, 24 Sep 2009 16:01:05 +0000 (09:01 -0700)]
Merge git://git./linux/kernel/git/rusty/linux-2.6-for-linus

* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus:
  module: don't call percpu_modfree on NULL pointer.
  module: fix memory leak when load fails after srcversion/version allocated
  module: preferred way to use MODULE_AUTHOR
  param: allow whitespace as kernel parameter separator
  module: reduce string table for loaded modules (v2)
  module: reduce symbol table for loaded modules (v2)

15 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs...
Linus Torvalds [Thu, 24 Sep 2009 15:57:29 +0000 (08:57 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/mason/btrfs-unstable

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-unstable: (42 commits)
  Btrfs: hash the btree inode during  fill_super
  Btrfs: relocate file extents in clusters
  Btrfs: don't rename file into dummy directory
  Btrfs: check size of inode backref before adding hardlink
  Btrfs: fix releasepage to avoid unlocking extents we haven't locked
  Btrfs: Fix test_range_bit for whole file extents
  Btrfs: fix errors handling cached state in set/clear_extent_bit
  Btrfs: fix early enospc during balancing
  Btrfs: deal with NULL space info
  Btrfs: account for space used by the super mirrors
  Btrfs: fix extent entry threshold calculation
  Btrfs: remove dead code
  Btrfs: fix bitmap size tracking
  Btrfs: don't keep retrying a block group if we fail to allocate a cluster
  Btrfs: make balance code choose more wisely when relocating
  Btrfs: fix arithmetic error in clone ioctl
  Btrfs: add snapshot/subvolume destroy ioctl
  Btrfs: change how subvolumes are organized
  Btrfs: do not reuse objectid of deleted snapshot/subvol
  Btrfs: speed up snapshot dropping
  ...

15 years agoMerge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
Linus Torvalds [Thu, 24 Sep 2009 15:32:11 +0000 (08:32 -0700)]
Merge branch 'for-linus' of git://git./linux/kernel/git/viro/vfs-2.6

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
  truncate: use new helpers
  truncate: new helpers
  fs: fix overflow in sys_mount() for in-kernel calls
  fs: Make unload_nls() NULL pointer safe
  freeze_bdev: grab active reference to frozen superblocks
  freeze_bdev: kill bd_mount_sem
  exofs: remove BKL from super operations
  fs/romfs: correct error-handling code
  vfs: seq_file: add helpers for data filling
  vfs: remove redundant position check in do_sendfile
  vfs: change sb->s_maxbytes to a loff_t
  vfs: explicitly cast s_maxbytes in fiemap_check_ranges
  libfs: return error code on failed attr set
  seq_file: return a negative error code when seq_path_root() fails.
  vfs: optimize touch_time() too
  vfs: optimization for touch_atime()
  vfs: split generic_forget_inode() so that hugetlbfs does not have to copy it
  fs/inode.c: add dev-id and inode number for debugging in init_special_inode()
  libfs: make simple_read_from_buffer conventional

15 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current
Linus Torvalds [Thu, 24 Sep 2009 15:31:04 +0000 (08:31 -0700)]
Merge git://git./linux/kernel/git/viro/audit-current

* git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current:
  lsm: Use a compressed IPv6 string format in audit events
  Audit: send signal info if selinux is disabled
  Audit: rearrange audit_context to save 16 bytes per struct
  Audit: reorganize struct audit_watch to save 8 bytes

15 years agomodule: don't call percpu_modfree on NULL pointer.
Rusty Russell [Fri, 25 Sep 2009 06:32:59 +0000 (00:32 -0600)]
module: don't call percpu_modfree on NULL pointer.

The general one handles NULL, the static obsolescent
(CONFIG_HAVE_LEGACY_PER_CPU_AREA) one in module.c doesn't; Eric's
commit 720eba31 assumed it did, and various frobbings since then kept
that assumption.

All other callers in module.c all protect it with an if; this effectively
does the same as free_init is only goto if we fail percpu_modalloc().

Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Eric Dumazet <dada1@cosmosbay.com>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
15 years agomodule: fix memory leak when load fails after srcversion/version allocated
Rusty Russell [Fri, 25 Sep 2009 06:32:58 +0000 (00:32 -0600)]
module: fix memory leak when load fails after srcversion/version allocated

Normally the twisty paths of sysfs will free the attributes, but not if
we fail before we hook it into sysfs (which is the last thing we do in
load_module).

(This sysfs code is a turd, no doubt there are other issues lurking too).

Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Tested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
15 years agomodule: preferred way to use MODULE_AUTHOR
Johannes Berg [Fri, 25 Sep 2009 06:32:58 +0000 (00:32 -0600)]
module: preferred way to use MODULE_AUTHOR

For the longest time now we've been using multiple MODULE_AUTHOR()
statements when a module has more than one author, but the comment here
disagrees.

Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Luciano Coelho <luciano.coelho@nokia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
15 years agoparam: allow whitespace as kernel parameter separator
Peter Oberparleiter [Mon, 6 Jul 2009 15:11:22 +0000 (17:11 +0200)]
param: allow whitespace as kernel parameter separator

Some boot mechanisms require that kernel parameters are stored in a
separate file which is loaded to memory without further processing
(e.g. the "Load from FTP" method on s390). When such a file contains
newline characters, the kernel parameter preceding the newline might
not be correctly parsed (due to the newline being stuck to the end of
the actual parameter value) which can lead to boot failures.

This patch improves kernel command line usability in such a situation
by allowing generic whitespace characters as separators between kernel
parameters.

Signed-off-by: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
15 years agomodule: reduce string table for loaded modules (v2)
Jan Beulich [Mon, 6 Jul 2009 13:51:44 +0000 (14:51 +0100)]
module: reduce string table for loaded modules (v2)

Also remove all parts of the string table (referenced by the symbol
table) that are not needed for kallsyms use (i.e. which were only
referenced by symbols discarded by the previous patch, or not
referenced at all for whatever reason).

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
15 years agomodule: reduce symbol table for loaded modules (v2)
Jan Beulich [Mon, 6 Jul 2009 13:50:42 +0000 (14:50 +0100)]
module: reduce symbol table for loaded modules (v2)

Discard all symbols not interesting for kallsyms use: absolute,
section, and in the common case (!KALLSYMS_ALL) data ones.

Signed-off-by: Jan Beulich <jbeulich@novell.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
15 years agoMerge branch 'for-linus' of git://neil.brown.name/md
Linus Torvalds [Thu, 24 Sep 2009 14:55:29 +0000 (07:55 -0700)]
Merge branch 'for-linus' of git://neil.brown.name/md

* 'for-linus' of git://neil.brown.name/md: (97 commits)
  md: raid-1/10: fix RW bits manipulation
  md: remove unnecessary memset from multipath.
  md: report device as congested when suspended
  md: Improve name of threads created by md_register_thread
  md: remove sparse warnings about lock context.
  md: remove sparse waring "symbol xxx shadows an earlier one"
  async_tx/raid6: add missing dma_unmap calls to the async fail case
  ioat3: fix uninitialized var warnings
  drivers/dma/ioat/dma_v2.c: fix warnings
  raid6test: fix stack overflow
  ioat2: clarify ring size limits
  md/raid6: cleanup ops_run_compute6_2
  md/raid6: eliminate BUG_ON with side effect
  dca: module load should not be an error message
  ioat: driver version 4.0
  dca: registering requesters in multiple dca domains
  async_tx: remove HIGHMEM64G restriction
  dmaengine: sh: Add Support SuperH DMA Engine driver
  dmaengine: Move all map_sg/unmap_sg for slave channel to its client
  fsldma: Add DMA_SLAVE support
  ...

15 years agoMerge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Linus Torvalds [Thu, 24 Sep 2009 14:54:16 +0000 (07:54 -0700)]
Merge branch 'for_linus' of git://git./linux/kernel/git/mchehab/linux-2.6

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-2.6:
  V4L/DVB (13039): dib0700: not building CONFIG_DVB_TUNER_DIB0070 breaks compilation
  V4L/DVB (13038): dvbdev: Remove an anoying/uneeded warning
  V4L/DVB (13037): go7007: Revert compatibility code added at the wrong place
  media: video: Fix build in saa7164

15 years agoMerge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux...
Linus Torvalds [Thu, 24 Sep 2009 14:53:22 +0000 (07:53 -0700)]
Merge branch 'hwpoison' of git://git./linux/kernel/git/ak/linux-mce-2.6

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
  HWPOISON: Enable error_remove_page on btrfs
  HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
  HWPOISON: Add madvise() based injector for hardware poisoned pages v4
  HWPOISON: Enable error_remove_page for NFS
  HWPOISON: Enable .remove_error_page for migration aware file systems
  HWPOISON: The high level memory error handler in the VM v7
  HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
  HWPOISON: shmem: call set_page_dirty() with locked page
  HWPOISON: Define a new error_remove_page address space op for async truncation
  HWPOISON: Add invalidate_inode_page
  HWPOISON: Refactor truncate to allow direct truncating of page v2
  HWPOISON: check and isolate corrupted free pages v2
  HWPOISON: Handle hardware poisoned pages in try_to_unmap
  HWPOISON: Use bitmask/action code for try_to_unmap behaviour
  HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
  HWPOISON: Add poison check to page fault handling
  HWPOISON: Add basic support for poisoned pages in fault handler v3
  HWPOISON: Add new SIGBUS error codes for hardware poison signals
  HWPOISON: Add support for poison swap entries v2
  HWPOISON: Export some rmap vma locking to outside world
  ...

15 years agodrivers/usb/serial/sierra.c: fix CONFIG_PM=n build
Andrew Morton [Wed, 23 Sep 2009 22:57:43 +0000 (15:57 -0700)]
drivers/usb/serial/sierra.c: fix CONFIG_PM=n build

drivers/usb/serial/sierra.c: In function 'sierra_suspend':
drivers/usb/serial/sierra.c:936: error: 'struct usb_device' has no member named 'auto_pm'

Repairs

commit e6929a9020acbeb04d9a3ad9a88234c15be808fd
Author: Oliver Neukum <oliver@neukum.org>
Date:   Fri Sep 4 23:19:53 2009 +0200

    USB: support for autosuspend in sierra while online

Cc: Greg KH <greg@kroah.com>
Cc: Oliver Neukum <oliver@neukum.org>
Cc: Elina Pasheva <epasheva@sierrawireless.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoalpha: AGP update (fixes compile failure)
Ivan Kokshaysky [Wed, 23 Sep 2009 22:57:42 +0000 (15:57 -0700)]
alpha: AGP update (fixes compile failure)

This brings Alpha AGP platforms in sync with the change to struct
agp_memory (unsigned long *memory => struct page **pages).

Only compile tested (I don't have titan/marvel hardware), but this change
looks pretty straightforward, so hopefully it's ok.

Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Dave Airlie <airlied@linux.ie>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agotask_struct cleanup: move binfmt field to mm_struct
Hiroshi Shimamoto [Wed, 23 Sep 2009 22:57:41 +0000 (15:57 -0700)]
task_struct cleanup: move binfmt field to mm_struct

Because the binfmt is not different between threads in the same process,
it can be moved from task_struct to mm_struct.  And binfmt moudle is
handled per mm_struct instead of task_struct.

Signed-off-by: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoinclude/linux/unaligned/{l,b}e_byteshift.h: fix usage for compressed kernels
Albin Tonnerre [Wed, 23 Sep 2009 22:57:38 +0000 (15:57 -0700)]
include/linux/unaligned/{l,b}e_byteshift.h: fix usage for compressed kernels

When unaligned accesses are required for uncompressing a kernel (such as
for LZO decompression on ARM in a patch that follows), including
<linux/kernel.h> causes issues as it brings in a lot of things that are
not available in the decompression environment.

linux/kernel.h brings at least:
extern int console_printk[];
extern const char hex_asc[];
which causes errors at link-time as they are not available when
compiling the pre-boot environement. There are also a few others:

  arch/arm/boot/compressed/misc.o: In function `valid_user_regs':
   arch/arm/include/asm/ptrace.h:158: undefined reference to `elf_hwcap'
  arch/arm/boot/compressed/misc.o: In function `console_silent':
   include/linux/kernel.h:292: undefined reference to `console_printk'
  arch/arm/boot/compressed/misc.o: In function `console_verbose':
   include/linux/kernel.h:297: undefined reference to `console_printk'
  arch/arm/boot/compressed/misc.o: In function `pack_hex_byte':
   include/linux/kernel.h:360: undefined reference to `hex_asc'
  arch/arm/boot/compressed/misc.o: In function `hweight_long':
   include/linux/bitops.h:45: undefined reference to `hweight32'
  arch/arm/boot/compressed/misc.o: In function `__cmpxchg_local_generic':
   include/asm-generic/cmpxchg-local.h:21: undefined reference to `wrong_size_cmpxchg'
   include/asm-generic/cmpxchg-local.h:42: undefined reference to `wrong_size_cmpxchg'
  arch/arm/boot/compressed/misc.o: In function `__xchg':
   arch/arm/include/asm/system.h:309: undefined reference to `__bad_xchg'

However, those files apparently use nothing from <linux/kernel.h>, all
they need is the declaration of types such as u32 or u64, so
<linux/types.h> should be enough

Signed-off-by: Albin Tonnerre <albin.tonnerre@free-electrons.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Russell King <rmk@arm.linux.org.uk>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Phillip Lougher <phillip@lougher.demon.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agolzma/gzip: fix potential oops when input data is truncated
Phillip Lougher [Wed, 23 Sep 2009 22:57:37 +0000 (15:57 -0700)]
lzma/gzip: fix potential oops when input data is truncated

If the lzma/gzip decompressors are called with insufficient input data
(len > 0 & fill = NULL), they will attempt to call the fill function to
obtain more data, leading to a kernel oops.

Signed-off-by: Phillip Lougher <phillip@lougher.demon.co.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodrivers/vlynq/vlynq.c: fix resource size off by 1 error
Julia Lawall [Wed, 23 Sep 2009 22:57:36 +0000 (15:57 -0700)]
drivers/vlynq/vlynq.c: fix resource size off by 1 error

In this case, the calls to request_mem_region, ioremap, and
release_mem_region all have a consistent length argument, len, but since
in other files (res->end - res->start) + 1, equivalent to
resource_size(res), is used for a resource-typed structure res, one could
consider whether the same should be done here.

The problem was found using the following semantic patch:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@@
struct resource *res;
@@

- (res->end - res->start) + 1
+ resource_size(res)

@@
struct resource *res;
@@

- res->end - res->start
+ BAD(resource_size(res))
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Acked-by: Florian Fainelli <florian@openwrt.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agofs/romfs: correct error-handling code
Julia Lawall [Wed, 23 Sep 2009 22:57:35 +0000 (15:57 -0700)]
fs/romfs: correct error-handling code

romfs_iget returns an ERR_PTR value in an error case instead of NULL.

A simplified version of the semantic match that finds this problem is as
follows: (http://coccinelle.lip6.fr/)

// <smpl>
@match exists@
expression x, E;
statement S1, S2;
@@

x = romfs_iget(...)
... when != x = E
(
*  if (x == NULL || ...) S1 else S2
|
*  if (x == NULL && ...) S1 else S2
)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agogru: allocation may fail in quicktest1()
Roel Kluin [Wed, 23 Sep 2009 22:57:33 +0000 (15:57 -0700)]
gru: allocation may fail in quicktest1()

The allocation may fail.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Acked-by: Jack Steiner <steiner@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agogru: use proc_create()
Alexey Dobriyan [Wed, 23 Sep 2009 22:57:33 +0000 (15:57 -0700)]
gru: use proc_create()

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Jack Steiner <steiner@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoaio: ifdef fields in mm_struct
Alexey Dobriyan [Wed, 23 Sep 2009 22:57:32 +0000 (15:57 -0700)]
aio: ifdef fields in mm_struct

->ioctx_lock and ->ioctx_list are used only under CONFIG_AIO.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Zach Brown <zach.brown@oracle.com>
Cc: Benjamin LaHaise <bcrl@kvack.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemstick: move dev_dbg
Jiri Slaby [Wed, 23 Sep 2009 22:57:31 +0000 (15:57 -0700)]
memstick: move dev_dbg

id_reg.if_mode might be unitialized when (*mrq)->error is nonzero.  move
dev_dbg() inside the if so that we are sure we can use id_reg values.

Signed-off-by: Jiri Slaby <jirislaby@gmail.com>
Cc: Alex Dubov <oakad@yahoo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoadfs: remove redundant test on unsigned
Roel Kluin [Wed, 23 Sep 2009 22:57:30 +0000 (15:57 -0700)]
adfs: remove redundant test on unsigned

unsigned block cannot be less than 0.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoedac: core: remove completion-wait for complete with rcu_barrier
Jesper Dangaard Brouer [Wed, 23 Sep 2009 22:57:29 +0000 (15:57 -0700)]
edac: core: remove completion-wait for complete with rcu_barrier

Module edac_core.ko uses call_rcu() callbacks in edac_device.c, edac_mc.c
and edac_pci.c.

They all use a wait_for_completion() scheme, but this scheme it not 100%
safe on multiple CPUs.  See the _rcu_barrier() implementation which
explains why extra precausion is needed.

The patch adds a comment about rcu_barrier() and as a precausion calls
rcu_barrier().  A maintainer needs to look at removing the
wait_for_completion code.

[dougthompson@xmission.com: remove the wait_for_completion code]
Signed-off-by Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoedac: i3200 memory controller driver
Jason Uhlenkott [Wed, 23 Sep 2009 22:57:27 +0000 (15:57 -0700)]
edac: i3200 memory controller driver

A driver for the Intel 3200 and 3210 memory controllers.  It has only had
light testing so far, and currently makes no attempt to decode error
addresses at anything finer than csrow granularity.

Signed-off-by: Jason Uhlenkott <juhlenko@akamai.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoedac: fix resource size calculation
Julia Lawall [Wed, 23 Sep 2009 22:57:26 +0000 (15:57 -0700)]
edac: fix resource size calculation

Use the function resource_size, which reduces the chance of introducing
off-by-one errors in calculating the resource size.

The semantic patch that makes this change is as follows:
(http://www.emn.fr/x-info/coccinelle/)

// <smpl>
@@
struct resource *res;
@@

- (res->end - res->start) + 1
+ resource_size(res)
// </smpl>

Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoedac: mpc85xx add mpc83xx support
Ira W. Snyder [Wed, 23 Sep 2009 22:57:25 +0000 (15:57 -0700)]
edac: mpc85xx add mpc83xx support

Add support for the Freescale MPC83xx memory controller to the existing
driver for the Freescale MPC85xx memory controller.  The only difference
between the two processors are in the CS_BNDS register parsing code, which
has been changed so it will work on both processors.

The L2 cache controller does not exist on the MPC83xx, but the OF
subsystem will not use the driver if the device is not present in the OF
device tree.

I had to change the nr_pages calculation to make the math work out.  I
checked it on my board and did the math by hand for a 64GB 85xx using 64K
pages.  In both cases, nr_pages * PAGE_SIZE comes out to the correct
value.

Signed-off-by: Ira W. Snyder <iws@ovro.caltech.edu>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Cc: Kumar Gala <galak@gate.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoedac: mpc85xx add P2020DS support
Yang Shi [Wed, 23 Sep 2009 22:57:24 +0000 (15:57 -0700)]
edac: mpc85xx add P2020DS support

Based on Kumar's new compatible types patch, add P2020 into MPC85xx EDAC
compatible lists so that EDAC can recognize P2020 meomry controller and L2
cache controller and export the relevant fields to sysfs.

EDAC MPC85xx DDR3 support is needed if DDR3 memory stick is installed on a
P2020DS board so that EDAC core can recognize DDR3 memory type.

Signed-off-by: Yang Shi <yang.shi@windriver.com>
Acked-by: Dave Jiang <djiang@mvista.com>
Signed-off-by: Doug Thompson <dougthompson@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agolinux/futex.h: place kernel types behind __KERNEL__
Mike Frysinger [Wed, 23 Sep 2009 22:57:23 +0000 (15:57 -0700)]
linux/futex.h: place kernel types behind __KERNEL__

The forward decls for some kernel types are only needed by the code behind
__KERNEL__, so don't bleed these types to userspace.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agopidns: deny CLONE_PARENT|CLONE_NEWPID combination
Sukadev Bhattiprolu [Wed, 23 Sep 2009 22:57:22 +0000 (15:57 -0700)]
pidns: deny CLONE_PARENT|CLONE_NEWPID combination

CLONE_PARENT was used to implement an older threading model.  For
consistency with the CLONE_THREAD check in copy_pid_ns(), disable
CLONE_PARENT with CLONE_NEWPID, at least until the required semantics of
pid namespaces are clear.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Roland McGrath <roland@redhat.com>
Acked-by: Serge Hallyn <serue@us.ibm.com>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agofork(): disable CLONE_PARENT for init
Sukadev Bhattiprolu [Wed, 23 Sep 2009 22:57:20 +0000 (15:57 -0700)]
fork(): disable CLONE_PARENT for init

When global or container-init processes use CLONE_PARENT, they create a
multi-rooted process tree.  Besides siblings of global init remain as
zombies on exit since they are not reaped by their parent (swapper).  So
prevent global and container-inits from creating siblings.

Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Oren Laadan <orenl@cs.columbia.edu>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agosysctl: remove "struct file *" argument of ->proc_handler
Alexey Dobriyan [Wed, 23 Sep 2009 22:57:19 +0000 (15:57 -0700)]
sysctl: remove "struct file *" argument of ->proc_handler

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoMAINTAINERS: add Matt Mackall and Herbert Xu to HARDWARE RANDOM NUMBER GENERATOR
Joe Perches [Wed, 23 Sep 2009 22:57:17 +0000 (15:57 -0700)]
MAINTAINERS: add Matt Mackall and Herbert Xu to HARDWARE RANDOM NUMBER GENERATOR

Signed-off-by: Joe Perches <joe@perches.com>
Cc: David Daney <ddaney@caviumnetworks.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agobfin-otp: add writing support
Mike Frysinger [Wed, 23 Sep 2009 22:57:16 +0000 (15:57 -0700)]
bfin-otp: add writing support

The on-chip OTP may be written at runtime, so enable support for it in the
driver.  However, since writing should really be done only on development
systems, don't bend over backwards to make sure the simple software lock
is per-fd -- per-device is OK.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Signed-off-by: Bryan Wu <cooloney@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodrivers/char/uv_mmtimer.c: add memory mapped RTC driver for UV
Dimitri Sivanich [Wed, 23 Sep 2009 22:57:15 +0000 (15:57 -0700)]
drivers/char/uv_mmtimer.c: add memory mapped RTC driver for UV

This driver memory maps the UV Hub RTC.

Signed-off-by: Dimitri Sivanich <sivanich@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodrivers/char/rio/rioctrl.c: off by one error in rioctrl.c
Dan Carpenter [Wed, 23 Sep 2009 22:57:14 +0000 (15:57 -0700)]
drivers/char/rio/rioctrl.c: off by one error in rioctrl.c

If DownLoad.ProductCode == MAX_PRODUCT, that would be a problem when we do
RIOBootTable[DownLoad.ProductCode] a couple lines down.

Found by smatch (http://repo.or.cz/w/smatch.git).

Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agohpet: hpet driver periodic timer setup bug fixes
Nils Carlson [Wed, 23 Sep 2009 22:57:13 +0000 (15:57 -0700)]
hpet: hpet driver periodic timer setup bug fixes

The periodic interrupt from drivers/char/hpet.c does not work correctly,
both when using the periodic capability of the hardware and while
emulating the periodic interrupt (when hardware does not support periodic
mode).

With timers capable of periodic interrupts, the comparator field is first
set with the period value followed by set of hidden accumulator, which has
the side effect of overwriting the comparator value.  This results in
wrong periodicity for the interrupts.  For, periodic interrupts to work,
following steps are necessary, in that order.

* Set config with Tn_VAL_SET_CNF bit

* Write to hidden accumulator, the value written is the time when the
  first interrupt should be generated

* Write compartor with period interval for subsequent interrupts
  (http://www.intel.com/hardwaredesign/hpetspec_1.pdf )

When emulating periodic timer with timers not capable of periodic
interrupt, driver is adding the period to counter value instead of
comparator value, which causes slow drift when using this emulation.

Also, driver seems to add hpetp->hp_delta both while setting up periodic
interrupt and while emulating periodic interrupts with timers not capable
of doing periodic interrupts.  This hp_delta will result in slower than
expected interrupt rate and should not be used while setting the interval.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Nils Carlson <nils.carlson@ericsson.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomwave: fix read buffer overflow
Roel Kluin [Wed, 23 Sep 2009 22:57:11 +0000 (15:57 -0700)]
mwave: fix read buffer overflow

Check whether index is within bounds before grabbing the element.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agofs/char_dev.c: remove useless loop
Renzo Davoli [Wed, 23 Sep 2009 22:57:10 +0000 (15:57 -0700)]
fs/char_dev.c: remove useless loop

There are two useless lines in fs/char_dev.c.

In register_chrdev there is a loop to change all '/' into '!' in the
kernel object name.
This code is useless as the same substitution is in kobject_set_name_vargs in
lib/kobject.c:
228         /* ewww... some of these buggers have '/' in the name ... */
229         while ((s = strchr(kobj->name, '/')))
230                 s[0] = '!';

kobject_set_name_vargs is called by kobject_set_name.
kobject_set_name is called just above the useless loop.

[hidave.darkstar@gmail.com: fix warning, remove the unused char *s]
Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years ago/dev/zero: avoid repeated access_ok() checks
Nikanth Karthikesan [Wed, 23 Sep 2009 22:57:09 +0000 (15:57 -0700)]
/dev/zero: avoid repeated access_ok() checks

In read_zero, we check for access_ok() once for the count bytes.  It is
unnecessarily checked again in clear_user.  Use __clear_user, which does
not check for access_ok().

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoflat: use IS_ERR_VALUE() helper macro
Mike Frysinger [Wed, 23 Sep 2009 22:57:07 +0000 (15:57 -0700)]
flat: use IS_ERR_VALUE() helper macro

There is a common macro now for testing mixed pointer/errno values, so use
that rather than handling the casts ourself.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: David McCullough <david_mccullough@securecomputing.com>
Acked-by: Greg Ungerer <gerg@uclinux.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agofdpic: ignore the loader's PT_GNU_STACK when calculating the stack size
David Howells [Wed, 23 Sep 2009 22:57:06 +0000 (15:57 -0700)]
fdpic: ignore the loader's PT_GNU_STACK when calculating the stack size

Ignore the loader's PT_GNU_STACK when calculating the stack size, and only
consider the executable's PT_GNU_STACK, assuming the executable has one.

Currently the behaviour is to take the largest stack size and use that,
but that means you can't reduce the stack size in the executable.  The
loader's stack size should probably only be used when executing the loader
directly.

WARNING: This patch is slightly dangerous - it may render a system
inoperable if the loader's stack size is larger than that of important
executables, and the system relies unknowingly on this increasing the size
of the stack.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Mike Frysinger <vapier@gentoo.org>
Acked-by: Paul Mundt <lethal@linux-sh.org>
Cc: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoelf: clean up fill_note_info()
Amerigo Wang [Wed, 23 Sep 2009 22:57:05 +0000 (15:57 -0700)]
elf: clean up fill_note_info()

Introduce a helper function elf_note_info_init() to help fill_note_info()
to do initializations, also fix the potential memory leaks.

[akpm@linux-foundation.org: remove NUM_NOTES]
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: David Howells <dhowells@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agosignals: inline __fatal_signal_pending
Roland McGrath [Wed, 23 Sep 2009 22:57:04 +0000 (15:57 -0700)]
signals: inline __fatal_signal_pending

__fatal_signal_pending inlines to one instruction on x86, probably two
instructions on other machines.  It takes two longer x86 instructions just
to call it and test its return value, not to mention the function itself.

On my random x86_64 config, this saved 70 bytes of text (59 of those being
__fatal_signal_pending itself).

Signed-off-by: Roland McGrath <roland@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agofcntl: add F_[SG]ETOWN_EX
Peter Zijlstra [Wed, 23 Sep 2009 22:57:03 +0000 (15:57 -0700)]
fcntl: add F_[SG]ETOWN_EX

In order to direct the SIGIO signal to a particular thread of a
multi-threaded application we cannot, like suggested by the manpage, put a
TID into the regular fcntl(F_SETOWN) call.  It will still be send to the
whole process of which that thread is part.

Since people do want to properly direct SIGIO we introduce F_SETOWN_EX.

The need to direct SIGIO comes from self-monitoring profiling such as with
perf-counters.  Perf-counters uses SIGIO to notify that new sample data is
available.  If the signal is delivered to the same task that generated the
new sample it can augment that data by inspecting the task's user-space
state right after it returns from the kernel.  This is esp.  convenient
for interpreted or virtual machine driven environments.

Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex
as argument:

struct f_owner_ex {
int   type;
pid_t pid;
};

Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: stephane eranian <eranian@googlemail.com>
Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agosignals: send_sigio: use do_send_sig_info() to avoid check_kill_permission()
Oleg Nesterov [Wed, 23 Sep 2009 22:57:01 +0000 (15:57 -0700)]
signals: send_sigio: use do_send_sig_info() to avoid check_kill_permission()

group_send_sig_info()->check_kill_permission() assumes that current is the
sender and uses current_cred().

This is not true in send_sigio_to_task() case.  From the security pov the
sender is not current, but the task which did fcntl(F_SETOWN), that is why
we have sigio_perm() which uses the right creds to check.

Fortunately, send_sigio() always sends either SEND_SIG_PRIV or
SI_FROMKERNEL() signal, so check_kill_permission() does nothing.  But
still it would be tidier to avoid this bogus security check and save a
couple of cycles.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agosignals: introduce do_send_sig_info() helper
Oleg Nesterov [Wed, 23 Sep 2009 22:57:00 +0000 (15:57 -0700)]
signals: introduce do_send_sig_info() helper

Introduce do_send_sig_info() and convert group_send_sig_info(),
send_sig_info(), do_send_specific() to use this helper.

Hopefully it will have more users soon, it allows to specify
specific/group behaviour via "bool group" argument.

Shaves 80 bytes from .text.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stephane eranian <eranian@googlemail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoexec: fix set_binfmt() vs sys_delete_module() race
Oleg Nesterov [Wed, 23 Sep 2009 22:56:59 +0000 (15:56 -0700)]
exec: fix set_binfmt() vs sys_delete_module() race

sys_delete_module() can set MODULE_STATE_GOING after
search_binary_handler() does try_module_get().  In this case
set_binfmt()->try_module_get() fails but since none of the callers
check the returned error, the task will run with the wrong old
->binfmt.

The proper fix should change all ->load_binary() methods, but we can
rely on fact that the caller must hold a reference to binfmt->module
and use __module_get() which never fails.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Hiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoexec: allow do_coredump() to wait for user space pipe readers to complete
Neil Horman [Wed, 23 Sep 2009 22:56:58 +0000 (15:56 -0700)]
exec: allow do_coredump() to wait for user space pipe readers to complete

Allow core_pattern pipes to wait for user space to complete

One of the things that user space processes like to do is look at metadata
for a crashing process in their /proc/<pid> directory.  this is racy
however, since do_coredump in the kernel doesn't wait for the user space
process to complete before it reaps the crashing process.  This patch
corrects that.  Allowing the kernel to wait for the user space process to
complete before cleaning up the crashing process.  This is a bit tricky to
do for a few reasons:

1) The user space process isn't our child, so we can't sys_wait4 on it
2) We need to close the pipe before waiting for the user process to complete,
since the user process may rely on an EOF condition

I've discussed several solutions with Oleg Nesterov off-list about this,
and this is the one we've come up with.  We add ourselves as a pipe reader
(to prevent premature cleanup of the pipe_inode_info), and remove
ourselves as a writer (to provide an EOF condition to the writer in user
space), then we iterate until the user space process exits (which we
detect by pipe->readers == 1, hence the > 1 check in the loop).  When we
exit the loop, we restore the proper reader/writer values, then we return
and let filp_close in do_coredump clean up the pipe data properly.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoexec: let do_coredump() limit the number of concurrent dumps to pipes
Neil Horman [Wed, 23 Sep 2009 22:56:56 +0000 (15:56 -0700)]
exec: let do_coredump() limit the number of concurrent dumps to pipes

Introduce core pipe limiting sysctl.

Since we can dump cores to pipe, rather than directly to the filesystem,
we create a condition in which a user can create a very high load on the
system simply by running bad applications.

If the pipe reader specified in core_pattern is poorly written, we can
have lots of ourstandig resources and processes in the system.

This sysctl introduces an ability to limit that resource consumption.
core_pipe_limit defines how many in-flight dumps may be run in parallel,
dumps beyond this value are skipped and a note is made in the kernel log.
A special value of 0 in core_pipe_limit denotes unlimited core dumps may
be handled (this is the default value).

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoexec: make do_coredump() more resilient to recursive crashes
Neil Horman [Wed, 23 Sep 2009 22:56:54 +0000 (15:56 -0700)]
exec: make do_coredump() more resilient to recursive crashes

Change how we detect recursive dumps.

Currently we have a mechanism by which we try to compare pathnames of the
crashing process to the core_pattern path.  This is broken for a dozen
reasons, and just doesn't work in any sort of robust way.

I'm replacing it with the use of a 0 RLIMIT_CORE value.  Since helper apps
set RLIMIT_CORE to zero, we don't write out core files for any process
with that particular limit set.  It the core_pattern is a pipe, any
non-zero limit is translated to RLIM_INFINITY.

This allows complete dumps to be captured, but prevents infinite recursion
in the event that the core_pattern process itself crashes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Reported-by: Earl Chew <earl_chew@agilent.com>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agosignals: tracehook_notify_jctl change
Roland McGrath [Wed, 23 Sep 2009 22:56:53 +0000 (15:56 -0700)]
signals: tracehook_notify_jctl change

This changes tracehook_notify_jctl() so it's called with the siglock held,
and changes its argument and return value definition.  These clean-ups
make it a better fit for what new tracing hooks need to check.

Tracing needs the siglock here, held from the time TASK_STOPPED was set,
to avoid potential SIGCONT races if it wants to allow any blocking in its
tracing hooks.

This also folds the finish_stop() function into its caller
do_signal_stop().  The function is short, called only once and only
unconditionally.  It aids readability to fold it in.

[oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
[oleg@redhat.com: introduce tracehook_finish_jctl() helper]
Signed-off-by: Roland McGrath <roland@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agowait_noreap_copyout(): check for ->wo_info != NULL
Vitaly Mayatskikh [Wed, 23 Sep 2009 22:56:52 +0000 (15:56 -0700)]
wait_noreap_copyout(): check for ->wo_info != NULL

Current behaviour of sys_waitid() looks odd.  If user passes infop ==
NULL, sys_waitid() returns success.  When user additionally specifies flag
WNOWAIT, sys_waitid() returns -EFAULT on the same conditions.  When user
combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

This patch adds check for ->wo_info in wait_noreap_copyout().

User-visible change: starting from this commit, sys_waitid() always checks
infop != NULL and does not fail if it is NULL.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodo_wait: fix sys_waitid()-specific behaviour
Vitaly Mayatskikh [Wed, 23 Sep 2009 22:56:51 +0000 (15:56 -0700)]
do_wait: fix sys_waitid()-specific behaviour

do_wait() checks ->wo_info to figure out who is the caller.  If it's not
NULL the caller should be sys_waitid(), in that case do_wait() fixes up
the retval or zeros ->wo_info, depending on retval from underlying
function.

This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
incorrect value.

man 2 waitid says:

waitid(): returns 0 on success

Test-case:

int main(void)
{
if (fork())
assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

return 0;
}

Result:

Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

Move that code to sys_waitid().

User-visible change: sys_waitid() will return 0 on success, either
infop is set or not.

Note, there's another bug in wait_noreap_copyout() which affects
return value of sys_waitid(). It will be fixed in next patch.

Signed-off-by: Vitaly Mayatskikh <v.mayatskih@gmail.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agowait_consider_task: kill "parent" argument
Oleg Nesterov [Wed, 23 Sep 2009 22:56:50 +0000 (15:56 -0700)]
wait_consider_task: kill "parent" argument

Kill the unused "parent" argument in wait_consider_task(), it was never used.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodo_wait-wakeup-optimization: simplify task_pid_type()
Oleg Nesterov [Wed, 23 Sep 2009 22:56:49 +0000 (15:56 -0700)]
do_wait-wakeup-optimization: simplify task_pid_type()

task_pid_type() is only used by eligible_pid() which has to check wo_type
!= PIDTYPE_MAX anyway.  Remove this check from task_pid_type() and factor
out ->pids[type] access, this shrinks .text a bit and simplifies the code.

The matches the behaviour of other similar helpers, say get_task_pid().
The caller must ensure that pid_type is valid, not the callee.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodo_wait-wakeup-optimization: fix child_wait_callback()->eligible_child() usage
Oleg Nesterov [Wed, 23 Sep 2009 22:56:48 +0000 (15:56 -0700)]
do_wait-wakeup-optimization: fix child_wait_callback()->eligible_child() usage

child_wait_callback()->eligible_child() is not right, we can miss the
wakeup if the task was detached before __wake_up_parent() and the caller
of do_wait() didn't use __WALL.

Move ->wo_pid checks from eligible_child() to the new helper,
eligible_pid(), and change child_wait_callback() to use it instead of
eligible_child().

Note: actually I think it would be better to fix the __WCLONE check in
eligible_child(), it doesn't look exactly right.  But it is not clear what
is the supposed behaviour, and any change is user-visible.

Reported-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodo_wait() wakeup optimization: child_wait_callback: check __WNOTHREAD case
Oleg Nesterov [Wed, 23 Sep 2009 22:56:47 +0000 (15:56 -0700)]
do_wait() wakeup optimization: child_wait_callback: check __WNOTHREAD case

Suggested by Roland.

do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
it is ->real_parent and the child is not traced. IOW, caller == p->parent
otherwise we should not wake up.

Change child_wait_callback() to check this. Ratan reports the workload with
CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodo_wait() wakeup optimization: change __wake_up_parent() to use filtered wakeup
Oleg Nesterov [Wed, 23 Sep 2009 22:56:46 +0000 (15:56 -0700)]
do_wait() wakeup optimization: change __wake_up_parent() to use filtered wakeup

Ratan Nalumasu reported that in a process with many threads doing
unnecessary wakeups.  Every waiting thread in the process wakes up to loop
through the children and see that the only ones it cares about are still
not ready.

Now that we have struct wait_opts we can change do_wait/__wake_up_parent
to use filtered wakeups.

We can make child_wait_callback() more clever later, right now it only
checks eligible_child().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Acked-by: James Morris <jmorris@namei.org>
Tested-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodo_wait() wakeup optimization: shift security_task_wait() from eligible_child() to...
Oleg Nesterov [Wed, 23 Sep 2009 22:56:45 +0000 (15:56 -0700)]
do_wait() wakeup optimization: shift security_task_wait() from eligible_child() to wait_consider_task()

Preparation, no functional changes.

eligible_child() has a single caller, wait_consider_task(). We can move
security_task_wait() out from eligible_child(), this allows us to use it
for filtered wake_up().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Ratan Nalumasu <rnalumasu@gmail.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee
Oleg Nesterov [Wed, 23 Sep 2009 22:56:44 +0000 (15:56 -0700)]
ptrace: __ptrace_detach: do __wake_up_parent() if we reap the tracee

The bug is old, it wasn't cause by recent changes.

Test case:

static void *tfunc(void *arg)
{
int pid = (long)arg;

assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
kill(pid, SIGKILL);

sleep(1);
return NULL;
}

int main(void)
{
pthread_t th;
long pid = fork();

if (!pid)
pause();

signal(SIGCHLD, SIG_IGN);
assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

int r = waitpid(-1, NULL, __WNOTHREAD);
printf("waitpid: %d %m\n", r);

return 0;
}

Before the patch this program hangs, after this patch waitpid() correctly
fails with errno == -ECHILD.

The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
->real_parent is our sub-thread and we ignore SIGCHLD.  But in this case
we should wake up other threads which can sleep in do_wait().

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Roland McGrath <roland@redhat.com>
Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: show swap usage in stat file
Daisuke Nishimura [Wed, 23 Sep 2009 22:56:43 +0000 (15:56 -0700)]
memcg: show swap usage in stat file

We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage.  It would
be useful for users to show swap usage in memory.stat file, because they
don't need calculate memsw.usage - res.usage to know swap usage.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: improve resource counter scalability
Balbir Singh [Wed, 23 Sep 2009 22:56:42 +0000 (15:56 -0700)]
memcg: improve resource counter scalability

Reduce the resource counter overhead (mostly spinlock) associated with the
root cgroup.  This is a part of the several patches to reduce mem cgroup
overhead.  I had posted other approaches earlier (including using percpu
counters).  Those patches will be a natural addition and will be added
iteratively on top of these.

The patch stops resource counter accounting for the root cgroup.  The data
for display is derived from the statisitcs we maintain via
mem_cgroup_charge_statistics (which is more scalable).  What happens today
is that, we do double accounting, once using res_counter_charge() and once
using memory_cgroup_charge_statistics().  For the root, since we don't
implement limits any more, we don't need to track every charge via
res_counter_charge() and check for limit being exceeded and reclaim.

The main mem->res usage_in_bytes can be derived by summing the cache and
rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE).  However, for memsw->res usage_in_bytes, we need
additional data about swapped out memory.  This patch adds a
MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE to derive the memsw data.  This data is computed
recursively when hierarchy is enabled.

The tests results I see on a 24 way show that

1. The lock contention disappears from /proc/lock_stats
2. The results of the test are comparable to running with
   cgroup_disable=memory.

Here is a sample of my program runs

Without Patch

 Performance counter stats for '/home/balbir/parallel_pagefault':

 7192804.124144  task-clock-msecs         #     23.937 CPUs
         424691  context-switches         #      0.000 M/sec
            267  CPU-migrations           #      0.000 M/sec
       28498113  page-faults              #      0.004 M/sec
  5826093739340  cycles                   #    809.989 M/sec
   408883496292  instructions             #      0.070 IPC
     7057079452  cache-references         #      0.981 M/sec
     3036086243  cache-misses             #      0.422 M/sec

  300.485365680  seconds time elapsed

With cgroup_disable=memory

 Performance counter stats for '/home/balbir/parallel_pagefault':

 7182183.546587  task-clock-msecs         #     23.915 CPUs
         425458  context-switches         #      0.000 M/sec
            203  CPU-migrations           #      0.000 M/sec
       92545093  page-faults              #      0.013 M/sec
  6034363609986  cycles                   #    840.185 M/sec
   437204346785  instructions             #      0.072 IPC
     6636073192  cache-references         #      0.924 M/sec
     2358117732  cache-misses             #      0.328 M/sec

  300.320905827  seconds time elapsed

With this patch applied

 Performance counter stats for '/home/balbir/parallel_pagefault':

 7191619.223977  task-clock-msecs         #     23.955 CPUs
         422579  context-switches         #      0.000 M/sec
             88  CPU-migrations           #      0.000 M/sec
       91946060  page-faults              #      0.013 M/sec
  5957054385619  cycles                   #    828.333 M/sec
  1058117350365  instructions             #      0.178 IPC
     9161776218  cache-references         #      1.274 M/sec
     1920494280  cache-misses             #      0.267 M/sec

  300.218764862  seconds time elapsed

Data from Prarit (kernel compile with make -j64 on a 64
CPU/32G machine)

For a single run

Without patch

real 27m8.988s
user 87m24.916s
sys 382m6.037s

With patch

real    4m18.607s
user    84m58.943s
sys     50m52.682s

With config turned off

real    4m54.972s
user    90m13.456s
sys     50m19.711s

NOTE: The data looks counterintuitive due to the increased performance
with the patch, even over the config being turned off. We probably need
more runs, but so far all testing has shown that the patches definitely
help.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Andi Kleen <andi@firstfloor.org>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemory controller: soft limit reclaim on contention
Balbir Singh [Wed, 23 Sep 2009 22:56:39 +0000 (15:56 -0700)]
memory controller: soft limit reclaim on contention

Implement reclaim from groups over their soft limit

Permit reclaim from memory cgroups on contention (via the direct reclaim
path).

memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.

Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off.  The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits.  For soft
limits, we try to do more targetted reclaim.  Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded.  The proportion has been empirically
determined.

[akpm@linux-foundation.org: build fix]
[kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
[nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemory controller: soft limit refactor reclaim flags
Balbir Singh [Wed, 23 Sep 2009 22:56:38 +0000 (15:56 -0700)]
memory controller: soft limit refactor reclaim flags

Refactor mem_cgroup_hierarchical_reclaim()

Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
flags, so that new parameters don't have to be passed as we make the
reclaim routine more flexible

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemory controller: soft limit organize cgroups
Balbir Singh [Wed, 23 Sep 2009 22:56:37 +0000 (15:56 -0700)]
memory controller: soft limit organize cgroups

Organize cgroups over soft limit in a RB-Tree

Introduce an RB-Tree for storing memory cgroups that are over their soft
limit.  The overall goal is to

1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
   We are careful about updates, updates take place only after a particular
   time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
   limit

The next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.

[hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemory controller: soft limit interface
Balbir Singh [Wed, 23 Sep 2009 22:56:36 +0000 (15:56 -0700)]
memory controller: soft limit interface

Add an interface to allow get/set of soft limits.  Soft limits for memory
plus swap controller (memsw) is currently not supported.  Resource
counters have been enhanced to support soft limits and new type
RES_SOFT_LIMIT has been added.  Unlike hard limits, soft limits can be
directly set and do not need any reclaim or checks before setting them to
a newer value.

Kamezawa-San raised a question as to whether soft limit should belong to
res_counter.  Since all resources understand the basic concepts of hard
and soft limits, it is justified to add soft limits here.  Soft limits are
a generic resource usage feature, even file system quotas support soft
limits.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemory controller: soft limit documentation
Balbir Singh [Wed, 23 Sep 2009 22:56:34 +0000 (15:56 -0700)]
memory controller: soft limit documentation

Soft limits is a new feature for the memory resource controller, something
similar has existed in the group scheduler in the form of shares.  The CPU
controllers interpretation of shares is very different though.

Soft limits are the most useful feature to have for environments where the
administrator wants to overcommit the system, such that only on memory
contention do the limits become active.  The current soft limits
implementation provides a soft_limit_in_bytes interface for the memory
controller and not for memory+swap controller.  The implementation
maintains an RB-Tree of groups that exceed their soft limit and starts
reclaiming from the group that exceeds this limit by the maximum amount.

This patch:

Add documentation for soft limits

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: add comments explaining memory barriers
KAMEZAWA Hiroyuki [Wed, 23 Sep 2009 22:56:33 +0000 (15:56 -0700)]
memcg: add comments explaining memory barriers

Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().

[akpm@linux-foundation.org: coding-style fixes]
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agomemcg: remove the overhead associated with the root cgroup
Balbir Singh [Wed, 23 Sep 2009 22:56:32 +0000 (15:56 -0700)]
memcg: remove the overhead associated with the root cgroup

Change the memory cgroup to remove the overhead associated with accounting
all pages in the root cgroup.  As a side-effect, we can no longer set a
memory hard limit in the root cgroup.

A new flag to track whether the page has been accounted or not has been
added as well.  Flags are now set atomically for page_cgroup,
pcg_default_flags is now obsolete and removed.

[akpm@linux-foundation.org: fix a few documentation glitches]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: let ss->can_attach and ss->attach do whole threadgroups at a time
Ben Blum [Wed, 23 Sep 2009 22:56:31 +0000 (15:56 -0700)]
cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time

Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc.  (This is a
pre-patch to cgroup-procs-writable.patch.)

Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader.  No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.

[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Dave Young <hidave.darkstar@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: change css_set freeing mechanism to be under RCU
Ben Blum [Wed, 23 Sep 2009 22:56:29 +0000 (15:56 -0700)]
cgroups: change css_set freeing mechanism to be under RCU

Changes css_set freeing mechanism to be under RCU

This is a prepatch for making the procs file writable. In order to free the
old css_sets for each task to be moved as they're being moved, the freeing
mechanism must be RCU-protected, or else we would have to have a call to
synchronize_rcu() for each task before freeing its old css_set.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: use vmalloc for large cgroups pidlist allocations
Ben Blum [Wed, 23 Sep 2009 22:56:28 +0000 (15:56 -0700)]
cgroups: use vmalloc for large cgroups pidlist allocations

Separates all pidlist allocation requests to a separate function that
judges based on the requested size whether or not the array needs to be
vmalloced or can be gotten via kmalloc, and similar for kfree/vfree.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: ensure correct concurrent opening/reading of pidlists across pid namespaces
Ben Blum [Wed, 23 Sep 2009 22:56:27 +0000 (15:56 -0700)]
cgroups: ensure correct concurrent opening/reading of pidlists across pid namespaces

Previously there was the problem in which two processes from different pid
namespaces reading the tasks or procs file could result in one process
seeing results from the other's namespace.  Rather than one pidlist for
each file in a cgroup, we now keep a list of pidlists keyed by namespace
and file type (tasks versus procs) in which entries are placed on demand.
Each pidlist has its own lock, and that the pidlists themselves are passed
around in the seq_file's private pointer means we don't have to touch the
cgroup or its master list except when creating and destroying entries.

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: add a read-only "procs" file similar to "tasks" that shows only unique tgids
Ben Blum [Wed, 23 Sep 2009 22:56:26 +0000 (15:56 -0700)]
cgroups: add a read-only "procs" file similar to "tasks" that shows only unique tgids

struct cgroup used to have a bunch of fields for keeping track of the
pidlist for the tasks file.  Those are now separated into a new struct
cgroup_pidlist, of which two are had, one for procs and one for tasks.
The way the seq_file operations are set up is changed so that just the
pidlist struct gets passed around as the private data.

Interface example: Suppose a multithreaded process has pid 1000 and other
threads with ids 1001, 1002, 1003:
$ cat tasks
1000
1001
1002
1003
$ cat cgroup.procs
1000
$

Signed-off-by: Ben Blum <bblum@google.com>
Signed-off-by: Paul Menage <menage@google.com>
Acked-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: revert "cgroups: fix pid namespace bug"
Paul Menage [Wed, 23 Sep 2009 22:56:25 +0000 (15:56 -0700)]
cgroups: revert "cgroups: fix pid namespace bug"

The following series adds a "cgroup.procs" file to each cgroup that
reports unique tgids rather than pids, and allows all threads in a
threadgroup to be atomically moved to a new cgroup.

The subsystem "attach" interface is modified to support attaching whole
threadgroups at a time, which could introduce potential problems if any
subsystem were to need to access the old cgroup of every thread being
moved.  The attach interface may need to be revised if this becomes the
case.

Also added is functionality for read/write locking all CLONE_THREAD
fork()ing within a threadgroup, by means of an rwsem that lives in the
sighand_struct, for per-threadgroup-ness and also for sharing a cacheline
with the sighand's atomic count.  This scheme should introduce no extra
overhead in the fork path when there's no contention.

The final patch reveals potential for a race when forking before a
subsystem's attach function is called - one potential solution in case any
subsystem has this problem is to hang on to the group's fork mutex through
the attach() calls, though no subsystem yet demonstrates need for an
extended critical section.

This patch:

Revert

commit 096b7fe012d66ed55e98bc8022405ede0cc80e96
Author:     Li Zefan <lizf@cn.fujitsu.com>
AuthorDate: Wed Jul 29 15:04:04 2009 -0700
Commit:     Linus Torvalds <torvalds@linux-foundation.org>
CommitDate: Wed Jul 29 19:10:35 2009 -0700

    cgroups: fix pid namespace bug

This is in preparation for some clashing cgroups changes that subsume the
original commit's functionaliy.

The original commit fixed a pid namespace bug which Ben Blum fixed
independently (in the same way, but with different code) as part of a
series of patches.  I played around with trying to reconcile Ben's patch
series with Li's patch, but concluded that it was simpler to just revert
Li's, given that Ben's patch series contained essentially the same fix.

Signed-off-by: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Matt Helsley <matthltc@us.ibm.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: allow cgroup hierarchies to be created with no bound subsystems
Paul Menage [Wed, 23 Sep 2009 22:56:23 +0000 (15:56 -0700)]
cgroups: allow cgroup hierarchies to be created with no bound subsystems

This patch removes the restriction that a cgroup hierarchy must have at
least one bound subsystem.  The mount option "none" is treated as an
explicit request for no bound subsystems.

A hierarchy with no subsystems can be useful for plain task tracking, and
is also a step towards the support for multiply-bindable subsystems.

As part of this change, the hierarchy id is no longer calculated from the
bitmask of subsystems in the hierarchy (since this is not guaranteed to be
unique) but is allocated via an ida.  Reference counts on cgroups from
css_set objects are now taken explicitly one per hierarchy, rather than
one per subsystem.

Example usage:

mount -t cgroup -o none,name=foo cgroup /mnt/cgroup

Based on the "no-op"/"none" subsystem concept proposed by
kamezawa.hiroyu@jp.fujitsu.com

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: add a back-pointer from struct cg_cgroup_link to struct cgroup
Paul Menage [Wed, 23 Sep 2009 22:56:22 +0000 (15:56 -0700)]
cgroups: add a back-pointer from struct cg_cgroup_link to struct cgroup

Currently the cgroups code makes the assumption that the subsystem
pointers in a struct css_set uniquely identify the hierarchy->cgroup
mappings associated with the css_set; and there's no way to directly
identify the associated set of cgroups other than by indirecting through
the appropriate subsystem state pointers.

This patch removes the need for that assumption by adding a back-pointer
from struct cg_cgroup_link object to its associated cgroup; this allows
the set of cgroups to be determined by traversing the cg_links list in
the struct css_set.

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: move the cgroup debug subsys into cgroup.c to access internal state
Paul Menage [Wed, 23 Sep 2009 22:56:20 +0000 (15:56 -0700)]
cgroups: move the cgroup debug subsys into cgroup.c to access internal state

While it's architecturally clean to have the cgroup debug subsystem be
completely independent of the cgroups framework, it limits its usefulness
for debugging the contents of internal data structures.  Move the debug
subsystem code into the scope of all the cgroups data structures to make
more detailed debugging possible.

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: support named cgroups hierarchies
Paul Menage [Wed, 23 Sep 2009 22:56:19 +0000 (15:56 -0700)]
cgroups: support named cgroups hierarchies

To simplify referring to cgroup hierarchies in mount statements, and to
allow disambiguation in the presence of empty hierarchies and
multiply-bindable subsystems this patch adds support for naming a new
cgroup hierarchy via the "name=" mount option

A pre-existing hierarchy may be specified by either name or by subsystems;
a hierarchy's name cannot be changed by a remount operation.

Example usage:

# To create a hierarchy called "foo" containing the "cpu" subsystem
mount -t cgroup -oname=foo,cpu cgroup /mnt/cgroup1

# To mount the "foo" hierarchy on a second location
mount -t cgroup -oname=foo cgroup /mnt/cgroup2

Signed-off-by: Paul Menage <menage@google.com>
Reviewed-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agocgroups: make unlock sequence in cgroup_get_sb consistent
Xiaotian Feng [Wed, 23 Sep 2009 22:56:18 +0000 (15:56 -0700)]
cgroups: make unlock sequence in cgroup_get_sb consistent

Make the last unlock sequence consistent with previous unlock sequeue.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Paul Menage <menage@google.com>
Signed-off-by: Xiaotian Feng <dfeng@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agodocs: fix various Documentation/ paths in header files
Randy Dunlap [Wed, 23 Sep 2009 22:56:17 +0000 (15:56 -0700)]
docs: fix various Documentation/ paths in header files

Fix various Documentation/ paths in include/linux/.

Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Reviewed-by: Jesper Juhl <jj@chaosbits.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agopage-types: add feature for walking process address space
Wu Fengguang [Wed, 23 Sep 2009 22:56:16 +0000 (15:56 -0700)]
page-types: add feature for walking process address space

Introduce "-p|--pid <pid>" for walking the process address space.  The
default action is to walk raw memory PFNs.

Both the virtual address and physical address of each present pages will
be listed:

# ./tools/vm/page-types -lp $$ | head -3
voffset offset  len     flags
400     11bebe  1       __RU_lA____M______________________
402     11bebc  1       __RU_lA____M______________________

Note that voffset/offset/len are now showed as hex numbers.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoDocumentation/vm/.gitignore: add page-types
Josh Triplett [Wed, 23 Sep 2009 22:56:15 +0000 (15:56 -0700)]
Documentation/vm/.gitignore: add page-types

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
15 years agoincludecheck fix: Documentation, cfag12864b-example.c
Jaswinder Singh Rajput [Wed, 23 Sep 2009 22:56:14 +0000 (15:56 -0700)]
includecheck fix: Documentation, cfag12864b-example.c

fix the following 'make includecheck' warning:

  Documentation/auxdisplay/cfag12864b-example.c: string.h is included more than once.

Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>