Jan Engelhardt [Tue, 13 Apr 2010 13:32:16 +0000 (15:32 +0200)]
netfilter: ipv6: add IPSKB_REROUTED exclusion to NF_HOOK/POSTROUTING invocation
Similar to how IPv4's ip_output.c works, have ip6_output also check
the IPSKB_REROUTED flag. It will be set from xt_TEE for cloned packets
since Xtables can currently only deal with a single packet in flight
at a time.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Acked-by: David S. Miller <davem@davemloft.net>
[Patrick: changed to use an IP6SKB value instead of IPSKB]
Signed-off-by: Patrick McHardy <kaber@trash.net>
Jan Engelhardt [Tue, 13 Apr 2010 13:28:11 +0000 (15:28 +0200)]
netfilter: ipv6: move POSTROUTING invocation before fragmentation
Patrick McHardy notes: "We used to invoke IPv4 POST_ROUTING after
fragmentation as well just to defragment the packets in conntrack
immediately afterwards, but that got changed during the
netfilter-ipsec integration. Ideally IPv6 would behave like IPv4."
This patch makes it so. Sending an oversized frame (e.g. `ping6
-s64000 -c1 ::1`) will now show up in POSTROUTING as a single skb
rather than multiple ones.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Alexey Dobriyan [Tue, 13 Apr 2010 12:09:15 +0000 (14:09 +0200)]
Restore __ALIGN_MASK()
Fix lib/bitmap.c compile failure due to __ALIGN_KERNEL changes.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Bart De Schuymer [Tue, 13 Apr 2010 09:41:39 +0000 (11:41 +0200)]
netfilter: bridge-netfilter: update a comment in br_forward.c about ip_fragment()
ip_refrag isn't used anymore in the bridge-netfilter code
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Bart De Schuymer [Tue, 13 Apr 2010 09:40:41 +0000 (11:40 +0200)]
netfilter: bridge-netfilter: cleanup br_netfilter.c
bridge-netfilter: cleanup br_netfilter.c
- remove some of the graffiti at the head of br_netfilter.c
- remove __br_dnat_complain()
- remove KERN_INFO messages when CONFIG_NETFILTER_DEBUG is defined
Signed-off-by: Bart De Schuymer <bdschuym@pandora.be>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Zhitong Wang [Tue, 13 Apr 2010 09:25:41 +0000 (11:25 +0200)]
netfilter: fix some coding styles and remove moduleparam.h
Fix some coding styles and remove moduleparam.h
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Alexey Dobriyan [Tue, 13 Apr 2010 09:21:46 +0000 (11:21 +0200)]
netfilter: xtables: make XT_ALIGN() usable in exported headers by exporting __ALIGN_KERNEL()
XT_ALIGN() was rewritten through ALIGN() by commit
42107f5009da223daa800d6da6904d77297ae829
"netfilter: xtables: symmetric COMPAT_XT_ALIGN definition".
ALIGN() is not exported in userspace headers, which created compile problem for tc(8)
and will create problem for iptables(8).
We can't export generic looking name ALIGN() but we can export less generic
__ALIGN_KERNEL() (suggested by Ben Hutchings).
Google knows nothing about __ALIGN_KERNEL().
COMPAT_XT_ALIGN() changed for symmetry.
Reported-by: Andreas Henriksson <andreas@fatal.se>
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Adam Nielsen [Fri, 9 Apr 2010 14:51:40 +0000 (16:51 +0200)]
netfilter: xt_LED: add refcounts to LED target
Add reference counting to the netfilter LED target, to fix errors when
multiple rules point to the same target ("LED trigger already exists").
Signed-off-by: Adam Nielsen <a.nielsen@shikadi.net>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Patrick McHardy [Fri, 9 Apr 2010 14:42:15 +0000 (16:42 +0200)]
netfilter: remove invalid rcu_dereference() calls
The CONFIG_PROVE_RCU option discovered a few invalid uses of
rcu_dereference() in netfilter. In all these cases, the code code
intends to check whether a pointer is already assigned when
performing registration or whether the assigned pointer matches
when performing unregistration. The entire registration/
unregistration is protected by a mutex, so we don't need the
rcu_dereference() calls.
Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Tested-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Herbert Xu [Thu, 8 Apr 2010 12:54:35 +0000 (14:54 +0200)]
netfilter: only do skb_checksum_help on CHECKSUM_PARTIAL in nfnetlink_queue
As we will set ip_summed to CHECKSUM_NONE when necessary in
nfqnl_mangle, there is no need to zap CHECKSUM_COMPLETE in
nfqnl_build_packet_message.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Herbert Xu [Thu, 8 Apr 2010 12:53:40 +0000 (14:53 +0200)]
netfilter: only do skb_checksum_help on CHECKSUM_PARTIAL in ip6_queue
As we will set ip_summed to CHECKSUM_NONE when necessary in
ipq_mangle_ipv6, there is no need to zap CHECKSUM_COMPLETE in
ipq_build_packet_message.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Herbert Xu [Thu, 8 Apr 2010 12:52:28 +0000 (14:52 +0200)]
netfilter: only do skb_checksum_help on CHECKSUM_PARTIAL in ip_queue
While doing yet another audit on ip_summed I noticed ip_queue
calling skb_checksum_help unnecessarily. As we will set ip_summed
to CHECKSUM_NONE when necessary in ipq_mangle_ipv4, there is no
need to zap CHECKSUM_COMPLETE in ipq_build_packet_message.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Patrick McHardy [Thu, 8 Apr 2010 11:35:47 +0000 (13:35 +0200)]
IPVS: fix potential stack overflow with overly long protocol names
When protocols use very long names, the sprintf calls might overflow
the on-stack buffer. No protocol in the kernel does this however.
Print the protocol name in the pr_debug statement directly to avoid
this.
Based on patch by Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Eric Dumazet [Thu, 1 Apr 2010 12:35:56 +0000 (14:35 +0200)]
netfilter: xt_hashlimit: RCU conversion
xt_hashlimit uses a central lock per hash table and suffers from
contention on some workloads. (Multiqueue NIC or if RPS is enabled)
After RCU conversion, central lock is only used when a writer wants to
add or delete an entry.
For 'readers', updating an existing entry, they use an individual lock
per entry.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Eric Dumazet [Thu, 1 Apr 2010 10:54:09 +0000 (12:54 +0200)]
netfilter: CLUSTERIP: clusterip_seq_stop() fix
If clusterip_seq_start() memory allocation fails, we crash later in
clusterip_seq_start(), trying to kfree(ERR_PTR(-ENOMEM))
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Jiri Pirko [Thu, 1 Apr 2010 10:39:19 +0000 (12:39 +0200)]
netfilter: ctnetlink: compute message size properly
Message size should be dependent on the presence of an accounting
extension, not on CONFIG_NF_CT_ACCT definition.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Jan Engelhardt [Wed, 24 Mar 2010 21:50:01 +0000 (22:50 +0100)]
netfilter: xtables: merge registration structure to NFPROTO_UNSPEC
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Mon, 22 Mar 2010 18:39:04 +0000 (19:39 +0100)]
netfilter: xtables: remove xt_string revision 0
Superseded by xt_string revision 1 (linux
v2.6.26-rc8-1127-g4ad3f26,
iptables 1.4.2-rc1).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Mon, 22 Mar 2010 18:35:01 +0000 (19:35 +0100)]
netfilter: xtables: remove xt_multiport revision 0
Superseded by xt_multiport revision 1 (introduction already predates
linux.git).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Mon, 22 Mar 2010 18:28:53 +0000 (19:28 +0100)]
netfilter: xtables: remove xt_hashlimit revision 0
Superseded by xt_hashlimit revision 1 (linux
v2.6.24-6212-g09e410d,
iptables 1.4.1-rc1).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Sun, 21 Mar 2010 03:05:56 +0000 (04:05 +0100)]
netfilter: xtables: shorten up return clause
The return value of nf_ct_l3proto_get can directly be returned even in
the case of success.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 19 Mar 2010 16:32:59 +0000 (17:32 +0100)]
netfilter: xtables: slightly better error reporting
When extended status codes are available, such as ENOMEM on failed
allocations, or subsequent functions (e.g. nf_ct_get_l3proto), passing
them up to userspace seems like a good idea compared to just always
EINVAL.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Thu, 25 Mar 2010 15:34:45 +0000 (16:34 +0100)]
netfilter: xtables: change targets to return error code
Part of the transition of done by this semantic patch:
// <smpl>
@ rule1 @
struct xt_target ops;
identifier check;
@@
ops.checkentry = check;
@@
identifier rule1.check;
@@
check(...) { <...
-return true;
+return 0;
...> }
@@
identifier rule1.check;
@@
check(...) { <...
-return false;
+return -EINVAL;
...> }
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 23 Mar 2010 15:35:56 +0000 (16:35 +0100)]
netfilter: xtables: change matches to return error code
The following semantic patch does part of the transformation:
// <smpl>
@ rule1 @
struct xt_match ops;
identifier check;
@@
ops.checkentry = check;
@@
identifier rule1.check;
@@
check(...) { <...
-return true;
+return 0;
...> }
@@
identifier rule1.check;
@@
check(...) { <...
-return false;
+return -EINVAL;
...> }
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 19 Mar 2010 16:16:42 +0000 (17:16 +0100)]
netfilter: xtables: change xt_target.checkentry return type
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
// <smpl>
@@
type bool;
identifier check, par;
@@
-bool check
+int check
(struct xt_tgchk_param *par) { ... }
// </smpl>
Minus the change it does to xt_ct_find_proto.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 19 Mar 2010 16:16:42 +0000 (17:16 +0100)]
netfilter: xtables: change xt_match.checkentry return type
Restore function signatures from bool to int so that we can report
memory allocation failures or similar using -ENOMEM rather than
always having to pass -EINVAL back.
This semantic patch may not be too precise (checking for functions
that use xt_mtchk_param rather than functions referenced by
xt_match.checkentry), but reviewed, it produced the intended result.
// <smpl>
@@
type bool;
identifier check, par;
@@
-bool check
+int check
(struct xt_mtchk_param *par) { ... }
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 23 Mar 2010 16:40:13 +0000 (17:40 +0100)]
netfilter: xtables: untangle spaghetti if clauses in checkentry
As I'm changing the return values soon, I want to have a clear visual
path.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 23 Mar 2010 03:08:46 +0000 (04:08 +0100)]
netfilter: ipvs: use NFPROTO values for NF_HOOK invocation
Semantic patch:
// <smpl>
@@
@@
IP_VS_XMIT(
-PF_INET6,
+NFPROTO_IPV6,
...)
@@
@@
IP_VS_XMIT(
-PF_INET,
+NFPROTO_IPV4,
...)
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 23 Mar 2010 03:09:14 +0000 (04:09 +0100)]
netfilter: decnet: use NFPROTO values for NF_HOOK invocation
The semantic patch used was:
// <smpl>
@@
@@
NF_HOOK(
-PF_DECnet,
+NFPROTO_DECNET,
...)
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 23 Mar 2010 03:09:07 +0000 (04:09 +0100)]
netfilter: ipv6: use NFPROTO values for NF_HOOK invocation
The semantic patch that was used:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_THRESH
|nf_hook
)(
-PF_INET6,
+NFPROTO_IPV6,
...)
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 23 Mar 2010 03:07:29 +0000 (04:07 +0100)]
netfilter: ipv4: use NFPROTO values for NF_HOOK invocation
The semantic patch that was used:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_COND
|nf_hook
)(
-PF_INET,
+NFPROTO_IPV4,
...)
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 23 Mar 2010 03:07:21 +0000 (04:07 +0100)]
netfilter: bridge: use NFPROTO values for NF_HOOK invocation
The first argument to NF_HOOK* is an nfproto since quite some time.
Commit
v2.6.27-2457-gfdc9314 was the first to practically start using
the new names. Do that now for the remaining NF_HOOK calls.
The semantic patch used was:
// <smpl>
@@
@@
(NF_HOOK
|NF_HOOK_THRESH
)(
-PF_BRIDGE,
+NFPROTO_BRIDGE,
...)
@@
@@
NF_HOOK(
-PF_INET6,
+NFPROTO_IPV6,
...)
@@
@@
NF_HOOK(
-PF_INET,
+NFPROTO_IPV4,
...)
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 19 Mar 2010 20:29:08 +0000 (21:29 +0100)]
netfilter: xt_recent: allow changing ip_list_[ug]id at runtime
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 10 Jul 2009 17:27:47 +0000 (19:27 +0200)]
netfilter: xtables: consolidate code into xt_request_find_match
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 10 Jul 2009 16:55:11 +0000 (18:55 +0200)]
netfilter: xtables: make use of xt_request_find_target
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 19 Mar 2010 20:08:16 +0000 (21:08 +0100)]
netfilter: xt extensions: use pr_<level> (2)
Supplement to
1159683ef48469de71dc26f0ee1a9c30d131cf89.
Downgrade the log level to INFO for most checkentry messages as they
are, IMO, just an extra information to the -EINVAL code that is
returned as part of a parameter "constraint violation". Leave errors
to real errors, such as being unable to create a LED trigger.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 19 Mar 2010 17:47:51 +0000 (18:47 +0100)]
netfilter: xtables: make use of caller family rather than target family
Supplement to
aa5fa3185791aac71c9172d4fda3e8729164b5d1.
The semantic patch for this change is:
// <smpl>
@@
struct xt_target_param *par;
@@
-par->target->family
+par->family
@@
struct xt_tgchk_param *par;
@@
-par->target->family
+par->family
@@
struct xt_tgdtor_param *par;
@@
-par->target->family
+par->family
// </smpl>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Zhitong Wang [Fri, 19 Mar 2010 15:04:10 +0000 (16:04 +0100)]
netfilter: remove unused headers in net/ipv4/netfilter/nf_nat_h323.c
Remove unused headers in net/ipv4/netfilter/nf_nat_h323.c
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Zhitong Wang [Fri, 19 Mar 2010 15:01:54 +0000 (16:01 +0100)]
netfilter: remove unused headers in net/ipv6/netfilter/ip6t_LOG.c
Remove unused headers in net/ipv6/netfilter/ip6t_LOG.c
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Jan Engelhardt [Wed, 17 Mar 2010 15:04:40 +0000 (16:04 +0100)]
netfilter: xt extensions: use pr_<level>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Thu, 18 Mar 2010 01:22:32 +0000 (02:22 +0100)]
netfilter: xtables: replace custom duprintf with pr_debug
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Wed, 17 Mar 2010 23:27:03 +0000 (00:27 +0100)]
netfilter: xtables: do not print any messages on ENOMEM
ENOMEM is a very obvious error code (cf. EINVAL), so I think we do not
really need a warning message. Not to mention that if the allocation
fails, the user is most likely going to get a stack trace from slab
already.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Thu, 18 Mar 2010 13:02:10 +0000 (14:02 +0100)]
netfilter: xtables: reduce holes in struct xt_target
This will save one full padding chunk (8 bytes on x86_64) per target.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Thu, 18 Mar 2010 10:03:51 +0000 (11:03 +0100)]
netfilter: xtables: remove almost-unused xt_match_param.data member
This member is taking up a "long" per match, yet is only used by one
module out of the roughly 90 modules, ip6t_hbh. ip6t_hbh can be
restructured a little to accomodate for the lack of the .data member.
This variant uses checking the par->match address, which should avoid
having to add two extra functions, including calls, i.e.
(hbh_mt6: call hbhdst_mt6(skb, par, NEXTHDR_OPT),
dst_mt6: call hbhdst_mt6(skb, par, NEXTHDR_DEST))
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Thu, 18 Mar 2010 09:30:44 +0000 (10:30 +0100)]
netfilter: update documentation fields of x_tables.h
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Wed, 17 Mar 2010 23:44:52 +0000 (00:44 +0100)]
netfilter: xtables: make use of caller family rather than match family
The matches can have .family = NFPROTO_UNSPEC, and though that is not
the case for the touched modules, it seems better to just use the
nfproto from the caller.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 16 Mar 2010 19:06:55 +0000 (20:06 +0100)]
netfilter: xtables: resort osf kconfig text
Restore alphabetical ordering of the list and put the xt_osf option
into its 'right' place again.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 16 Mar 2010 20:44:44 +0000 (21:44 +0100)]
netfilter: xtables: limit xt_mac to ethernet devices
I do not see a point of allowing the MAC module to work with devices
that don't possibly have one, e.g. various tunnel interfaces such as
tun and sit.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 16 Mar 2010 20:09:04 +0000 (21:09 +0100)]
netfilter: xtables: clean up xt_mac match routine
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 9 Mar 2010 22:27:24 +0000 (23:27 +0100)]
netfilter: xtables: do without explicit XT_ALIGN
XT_ALIGN is already applied on matchsize/targetsize in x_tables.c,
so it is not strictly needed in the extensions.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Patrick McHardy [Thu, 18 Mar 2010 11:55:50 +0000 (12:55 +0100)]
Merge branch 'master' of ../nf-2.6
Zhitong Wang [Wed, 17 Mar 2010 15:28:25 +0000 (16:28 +0100)]
netfilter: remove unused headers in net/netfilter/nfnetlink.c
Remove unused headers in net/netfilter/nfnetlink.c
Signed-off-by: Zhitong Wang <zhitong.wangzt@alibaba-inc.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Tim Gardner [Wed, 17 Mar 2010 15:18:56 +0000 (16:18 +0100)]
netfilter: xt_recent: check for unsupported user space flags
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Tim Gardner [Tue, 16 Mar 2010 18:53:13 +0000 (19:53 +0100)]
netfilter: xt_recent: add an entry reaper
One of the problems with the way xt_recent is implemented is that
there is no efficient way to remove expired entries. Of course,
one can write a rule '-m recent --remove', but you have to know
beforehand which entry to delete. This commit adds reaper
logic which checks the head of the LRU list when a rule
is invoked that has a '--seconds' value and XT_RECENT_REAP set. If an
entry ceases to accumulate time stamps, then it will eventually bubble
to the top of the LRU list where it is then reaped.
Signed-off-by: Tim Gardner <tim.gardner@canonical.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Jan Engelhardt [Mon, 1 Mar 2010 10:55:33 +0000 (11:55 +0100)]
netfilter: xt_recent: remove old proc directory
The compat option was introduced in October 2008.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Sun, 28 Feb 2010 22:22:35 +0000 (23:22 +0100)]
netfilter: xt_recent: update description
It had IPv6 for quite a while already :-)
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Sun, 28 Feb 2010 22:22:04 +0000 (23:22 +0100)]
netfilter: ebt_ip6: add principal maintainer in a MODULE_AUTHOR tag
Cc: Kuo-Lang Tseng <kuo-lang.tseng@intel.com>
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Sun, 28 Feb 2010 22:19:52 +0000 (23:19 +0100)]
netfilter: update my email address
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Tue, 16 Mar 2010 17:25:12 +0000 (18:25 +0100)]
netfilter: xtables: schedule xt_NOTRACK for removal
It is being superseded by xt_CT (-j CT --notrack).
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 26 Feb 2010 13:20:32 +0000 (14:20 +0100)]
netfilter: xtables: merge xt_CONNMARK into xt_connmark
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Sat, 28 Feb 2009 02:23:57 +0000 (03:23 +0100)]
netfilter: xtables: merge xt_MARK into xt_mark
Two arguments for combining the two:
- xt_mark is pretty useless without xt_MARK
- the actual code is so small anyway that the kmod metadata and the module
in its loaded state totally outweighs the combined actual code size.
i586-before:
-rw-r--r-- 1 jengelh users 3821 Feb 10 01:01 xt_MARK.ko
-rw-r--r-- 1 jengelh users 2592 Feb 10 00:04 xt_MARK.o
-rw-r--r-- 1 jengelh users 3274 Feb 10 01:01 xt_mark.ko
-rw-r--r-- 1 jengelh users 2108 Feb 10 00:05 xt_mark.o
text data bss dec hex filename
354 264 0 618 26a xt_MARK.o
223 176 0 399 18f xt_mark.o
And the runtime size is like 14 KB.
i586-after:
-rw-r--r-- 1 jengelh users 3264 Feb 18 17:28 xt_mark.o
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 26 Feb 2010 13:14:22 +0000 (14:14 +0100)]
netfilter: xtables: add comment markers to Xtables Kconfig
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Fri, 5 Jun 2009 13:22:15 +0000 (15:22 +0200)]
netfilter: xt_NFQUEUE: consolidate v4/v6 targets into one
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Jan Engelhardt [Wed, 10 Mar 2010 23:38:44 +0000 (00:38 +0100)]
netfilter: xt_CT: par->family is an nfproto
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
David S. Miller [Wed, 17 Mar 2010 06:36:24 +0000 (23:36 -0700)]
e1000e: Fix build with CONFIG_PM disabled.
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Wed, 17 Mar 2010 04:24:32 +0000 (21:24 -0700)]
drivers/net/e100.c: Use pr_<level> and netif_<level>
Convert DPRINTK, commonly used for debugging, to netif_<level>
Remove #define PFX
Use #define pr_fmt
Consistently use no periods for non-sentence logging messages
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Gunthorpe [Tue, 9 Mar 2010 09:17:42 +0000 (09:17 +0000)]
NET: Support clause 45 MDIO commands at the MDIO bus level
IEEE 802.3ae clause 45 specifies a somewhat modified MDIO protocol
for use by 10GIGE phys. The main change is a 21 bit address split into
a 5 bit device ID and a 16 bit register offset. The definition is designed
so that normal and extended devices can run on the same MDIO bus.
Extend mdio-bitbang to do the new protocol. At the MDIO bus level the
protocol is requested by or'ing MII_ADDR_C45 into the register offset.
Make phy_read/phy_write/etc pass a full 32 bit register offset.
This does not attempt to make the phy layer support C45 style PHYs, just
to provide the MDIO bus support.
Tested against a Broadcom 10GE phy with ID 0x206034, and several
Broadcom 10/100/1000 Phys in normal mode.
Signed-off-by: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafael J. Wysocki [Sun, 14 Mar 2010 14:35:17 +0000 (14:35 +0000)]
e1000e / PCI / PM: Add basic runtime PM support (rev. 4)
Use the PCI runtime power management framework to add basic PCI
runtime PM support to the e1000e driver. Namely, make the driver
suspend the device when the link is off and set it up for generating
a wakeup event after the link has been detected again. [This
feature is disabled until the user space enables it with the help of
the /sys/devices/.../power/contol device attribute.]
Based on a patch from Matthew Garrett.
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Rafael J. Wysocki [Sun, 14 Mar 2010 14:33:51 +0000 (14:33 +0000)]
r8169 / PCI / PM: Add simplified runtime PM support (rev. 3)
Use the PCI runtime power management framework to add basic PCI
runtime PM support to the r8169 driver. Namely, make the driver
suspend the device when the link is not present and set it up for
generating a wakeup event after the link has been detected again.
[This feature is disabled until the user space enables it with the
help of the /sys/devices/.../power/contol device attribute.]
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Pirko [Mon, 1 Mar 2010 05:09:14 +0000 (05:09 +0000)]
net: convert multiple drivers to use netdev_for_each_mc_addr, part7
In mlx4, using char * to store mc address in private structure instead.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Joe Perches [Sat, 27 Feb 2010 14:43:51 +0000 (14:43 +0000)]
drivers/net/ks*: Use netdev_<level>, netif_<level> and pr_<level>
I'm not sure this is correct.
It changes logging macros from:
dev_<level>(&ks->spidev->dev,
to
netdev_<level>(ks->netdev,
Comments?
Use netdev_<level>
Use netif_<level>
Use pr_<level>
Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
Add missing line to message in ks8851_remove
Change kmalloc/memset(,0) to kzalloc
Remove ks_<level> macros
Consolidation code into set_media_state
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Horman [Mon, 15 Mar 2010 07:58:45 +0000 (07:58 +0000)]
tipc: Allow retransmission of cloned buffers
Forward port commit
fc477e160af086f6e30c3d4fdf5f5c000d29beb5
from git://tipc.cslab.ericsson.net/pub/git/people/allan/tipc.git
Origional commit message:
Allow retransmission of cloned buffers
This patch fixes an issue with TIPC's message retransmission logic
that prevented retransmission of clone sk_buffs. Originally intended
as a means of avoiding wasted work in retransmitting messages that
were still on the driver's outbound queue, it also prevented TIPC
from retransmitting messages through other means -- such as the
secondary bearer of the broadcast link, or another interface in a
set of bonded interfaces. This fix removes existing checks for
cloned sk_buffs that prevented such retransmission.
Origionally-Signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Horman [Mon, 15 Mar 2010 08:02:24 +0000 (08:02 +0000)]
tipc: Increase frequency of load distribution over broadcast link
Forward port commit
29eb572941501c40ac6e62dbc5043bf9ee76ee56
from git://tipc.cslab.ericsson.net/pub/git/people/allan/tipc.git
Origional commit message:
Increase frequency of load distribution over broadcast link
This patch enhances the behavior of TIPC's broadcast link so that it
alternates between redundant bearers (if available) after every
message sent, rather than after every 10 messages. This change helps
to speed up delivery of retransmitted messages by ensuring that
they are not sent repeatedly over a bearer that is no longer working,
but not yet recognized as failed.
Tested by myself in the latest net-2.6 tree using the tipc sanity test suite
Origionally-signed-off-by: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
bcast.c | 35 ++++++++++++++---------------------
1 file changed, 14 insertions(+), 21 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
Jan Engelhardt [Thu, 11 Mar 2010 09:57:29 +0000 (09:57 +0000)]
net: core: add IFLA_STATS64 support
`ip -s link` shows interface counters truncated to 32 bit. This is
because interface statistics are transported only in 32-bit quantity
to userspace. This commit adds a new IFLA_STATS64 attribute that
exports them in full 64 bit.
References: http://lkml.indiana.edu/hypermail/linux/kernel/0307.3/0215.html
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jan Engelhardt [Thu, 11 Mar 2010 09:57:28 +0000 (09:57 +0000)]
net: tcp: make veno selectable as default congestion module
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jan Engelhardt [Thu, 11 Mar 2010 09:57:27 +0000 (09:57 +0000)]
net: tcp: make hybla selectable as default congestion module
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 9 Mar 2010 20:03:38 +0000 (20:03 +0000)]
net: remove rcu locking from fib_rules_event()
We hold RTNL at this point and dont use RCU variants of list traversals,
we dont need rcu_read_lock()/rcu_read_unlock()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
stephen hemminger [Tue, 2 Mar 2010 13:32:09 +0000 (13:32 +0000)]
bridge: per-cpu packet statistics (v3)
The shared packet statistics are a potential source of slow down
on bridged traffic. Convert to per-cpu array, but only keep those
statistics which change per-packet.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Herbert [Tue, 16 Mar 2010 08:03:29 +0000 (08:03 +0000)]
rps: Receive Packet Steering
This patch implements software receive side packet steering (RPS). RPS
distributes the load of received packet processing across multiple CPUs.
Problem statement: Protocol processing done in the NAPI context for received
packets is serialized per device queue and becomes a bottleneck under high
packet load. This substantially limits pps that can be achieved on a single
queue NIC and provides no scaling with multiple cores.
This solution queues packets early on in the receive path on the backlog queues
of other CPUs. This allows protocol processing (e.g. IP and TCP) to be
performed on packets in parallel. For each device (or each receive queue in
a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
process packets. A CPU is selected on a per packet basis by hashing contents
of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
into the CPU mask. The IPI mechanism is used to raise networking receive
softirqs between CPUs. This effectively emulates in software what a multi-queue
NIC can provide, but is generic requiring no device support.
Many devices now provide a hash over the 4-tuple on a per packet basis
(e.g. the Toeplitz hash). This patch allow drivers to set the HW reported hash
in an skb field, and that value in turn is used to index into the RPS maps.
Using the HW generated hash can avoid cache misses on the packet when
steering it to a remote CPU.
The CPU mask is set on a per device and per queue basis in the sysfs variable
/sys/class/net/<device>/queues/rx-<n>/rps_cpus. This is a set of canonical
bit maps for receive queues in the device (numbered by <n>). If a device
does not support multi-queue, a single variable is used for the device (rx-0).
Generally, we have found this technique increases pps capabilities of a single
queue device with good CPU utilization. Optimal settings for the CPU mask
seem to depend on architectures and cache hierarcy. Below are some results
running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
Results show cumulative transaction rate and system CPU utilization.
e1000e on 8 core Intel
Without RPS: 108K tps at 33% CPU
With RPS: 311K tps at 64% CPU
forcedeth on 16 core AMD
Without RPS: 156K tps at 15% CPU
With RPS: 404K tps at 49% CPU
bnx2x on 16 core AMD
Without RPS 567K tps at 61% CPU (4 HW RX queues)
Without RPS 738K tps at 96% CPU (8 HW RX queues)
With RPS: 854K tps at 76% CPU (4 HW RX queues)
Caveats:
- The benefits of this patch are dependent on architecture and cache hierarchy.
Tuning the masks to get best performance is probably necessary.
- This patch adds overhead in the path for processing a single packet. In
a lightly loaded server this overhead may eliminate the advantages of
increased parallelism, and possibly cause some relative performance degradation.
We have found that masks that are cache aware (share same caches with
the interrupting CPU) mitigate much of this.
- The RPS masks can be changed dynamically, however whenever the mask is changed
this introduces the possibility of generating out of order packets. It's
probably best not change the masks too frequently.
Signed-off-by: Tom Herbert <therbert@google.com>
include/linux/netdevice.h | 32 ++++-
include/linux/skbuff.h | 3 +
net/core/dev.c | 335 +++++++++++++++++++++++++++++++++++++--------
net/core/net-sysfs.c | 225 ++++++++++++++++++++++++++++++-
net/core/skbuff.c | 2 +
5 files changed, 538 insertions(+), 59 deletions(-)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tina Yang [Thu, 11 Mar 2010 13:50:07 +0000 (13:50 +0000)]
RDS: Enable per-cpu workqueue threads
Create per-cpu workqueue threads instead of a single
krdsd thread. This is a step towards better scalability.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:50:06 +0000 (13:50 +0000)]
RDS: Do not call set_page_dirty() with irqs off
set_page_dirty() unconditionally re-enables interrupts, so
if we call it with irqs off, they will be on after the call,
and that's bad. This patch moves the call after we've re-enabled
interrupts in send_drop_to(), so it's safe.
Also, add BUG_ONs to let us know if we ever do call set_page_dirty
with interrupts off.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Sherman Pun [Thu, 11 Mar 2010 13:50:05 +0000 (13:50 +0000)]
RDS: Properly unmap when getting a remote access error
If the RDMA op has aborted with a remote access error,
in addition to what we already do (tell userspace it has
completed with an error) also unmap it and put() the rm.
Otherwise, hangs may occur on arches that track maps and
will not exit without proper cleanup.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:50:04 +0000 (13:50 +0000)]
RDS: only put sockets that have seen congestion on the poll_waitq
rds_poll_waitq's listeners will be awoken if we receive a congestion
notification. Bad performance may result because *all* polled sockets
contend for this single lock. However, it should not be necessary to
wake pollers when a congestion update arrives if they have never
experienced congestion, and not putting these on the waitq will
hopefully greatly reduce contention.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tina Yang [Thu, 11 Mar 2010 13:50:03 +0000 (13:50 +0000)]
RDS: Fix locking in rds_send_drop_to()
It seems rds_send_drop_to() called
__rds_rdma_send_complete(rs, rm, RDS_RDMA_CANCELED)
with only rds_sock lock, but not rds_message lock. It raced with
other threads that is attempting to modify the rds_message as well,
such as from within rds_rdma_send_complete().
Signed-off-by: Tina Yang <tina.yang@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:50:02 +0000 (13:50 +0000)]
RDS: Turn down alarming reconnect messages
RDS's error messages when a connection goes down are a little
extreme. A connection may go down, and it will be re-established,
and everything is fine. This patch links these messages through
rdsdebug(), instead of to printk directly.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:50:01 +0000 (13:50 +0000)]
RDS: Workaround for in-use MRs on close causing crash
if a machine is shut down without closing sockets properly, and
freeing all MRs, then a BUG_ON will bring it down. This patch
changes these to WARN_ONs -- leaking MRs is not fatal (although
not ideal, and there is more work to do here for a proper fix.)
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Tina Yang [Thu, 11 Mar 2010 13:50:00 +0000 (13:50 +0000)]
RDS: Fix send locking issue
Fix a deadlock between rds_rdma_send_complete() and
rds_send_remove_from_sock() when rds socket lock and
rds message lock are acquired out-of-order.
Signed-off-by: Tina Yang <Tina.Yang@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:49:59 +0000 (13:49 +0000)]
RDS: Fix congestion issues for loopback
We have two kinds of loopback: software (via loop transport)
and hardware (via IB). sw is used for 127.0.0.1, and doesn't
support rdma ops. hw is used for sends to local device IPs,
and supports rdma. Both are used in different cases.
For both of these, when there is a congestion map update, we
want to call rds_cong_map_updated() but not actually send
anything -- since loopback local and foreign congestion maps
point to the same spot, they're already in sync.
The old code never called sw loop's xmit_cong_map(),so
rds_cong_map_updated() wasn't being called for it. sw loop
ports would not work right with the congestion monitor.
Fixing that meant that hw loopback now would send congestion maps
to itself. This is also undesirable (racy), so we check for this
case in the ib-specific xmit code.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:49:58 +0000 (13:49 +0000)]
RDS/TCP: Wait to wake thread when write space available
Instead of waking the send thread whenever any send space is available,
wait until it is at least half empty. This is modeled on how
sock_def_write_space() does it, and may help to minimize context
switches.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:49:57 +0000 (13:49 +0000)]
RDS: update copy_to_user state in tcp transport
Other transports use rds_page_copy_user, which updates our
s_copy_to_user counter. TCP doesn't, so it needs to explicity
call rds_stats_add().
Reported-by: Richard Frank <richard.frank@oracle.com>
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:49:56 +0000 (13:49 +0000)]
RDS: sendmsg() should check sndtimeo, not rcvtimeo
Most likely cut n paste error - sendmsg() was checking sock_rcvtimeo.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Andy Grover [Thu, 11 Mar 2010 13:49:55 +0000 (13:49 +0000)]
RDS: Do not BUG() on error returned from ib_post_send
BUGging on a runtime error code should be avoided. This
patch also eliminates all other BUG()s that have no real
reason to exist.
Signed-off-by: Andy Grover <andy.grover@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Mar 2010 21:37:47 +0000 (14:37 -0700)]
bridge: Make first arg to deliver_clone const.
Otherwise we get a warning from the call in br_forward().
Signed-off-by: David S. Miller <davem@davemloft.net>
YOSHIFUJI Hideaki / 吉藤英明 [Mon, 15 Mar 2010 21:51:18 +0000 (21:51 +0000)]
bridge br_multicast: Don't refer to BR_INPUT_SKB_CB(skb)->mrouters_only without IGMP snooping.
Without CONFIG_BRIDGE_IGMP_SNOOPING,
BR_INPUT_SKB_CB(skb)->mrouters_only is not appropriately
initialized, so we can see garbage.
A clear option to fix this is to set it even without that
config, but we cannot optimize out the branch.
Let's introduce a macro that returns value of mrouters_only
and let it return 0 without CONFIG_BRIDGE_IGMP_SNOOPING.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Vitaliy Gusev [Tue, 16 Mar 2010 01:07:51 +0000 (01:07 +0000)]
route: Fix caught BUG_ON during rt_secret_rebuild_oneshot()
route: Fix caught BUG_ON during rt_secret_rebuild_oneshot()
Call rt_secret_rebuild can cause BUG_ON(timer_pending(&net->ipv4.rt_secret_timer)) in
add_timer as there is not any synchronization for call rt_secret_rebuild_oneshot()
for the same net namespace.
Also this issue affects to rt_secret_reschedule().
Thus use mod_timer enstead.
Signed-off-by: Vitaliy Gusev <vgusev@openvz.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
YOSHIFUJI Hideaki / 吉藤英明 [Mon, 15 Mar 2010 19:26:56 +0000 (19:26 +0000)]
bridge br_multicast: Fix skb leakage in error path.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
YOSHIFUJI Hideaki / 吉藤英明 [Mon, 15 Mar 2010 19:27:00 +0000 (19:27 +0000)]
bridge br_multicast: Fix handling of Max Response Code in IGMPv3 message.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Slaby [Tue, 16 Mar 2010 05:29:54 +0000 (05:29 +0000)]
NET: netpoll, fix potential NULL ptr dereference
Stanse found that one error path in netpoll_setup dereferences npinfo
even though it is NULL. Avoid that by adding new label and go to that
instead.
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Daniel Borkmann <danborkmann@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Acked-by: chavey@google.com
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Neil Horman [Tue, 16 Mar 2010 08:14:33 +0000 (08:14 +0000)]
tipc: fix lockdep warning on address assignment
So in the forward porting of various tipc packages, I was constantly
getting this lockdep warning everytime I used tipc-config to set a network
address for the protocol:
[ INFO: possible circular locking dependency detected ]
2.6.33 #1
tipc-config/1326 is trying to acquire lock:
(ref_table_lock){+.-...}, at: [<
ffffffffa0315148>] tipc_ref_discard+0x53/0xd4 [tipc]
but task is already holding lock:
(&(&entry->lock)->rlock#2){+.-...}, at: [<
ffffffffa03150d5>] tipc_ref_lock+0x43/0x63 [tipc]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #1 (&(&entry->lock)->rlock#2){+.-...}:
[<
ffffffff8107b508>] __lock_acquire+0xb67/0xd0f
[<
ffffffff8107b78c>] lock_acquire+0xdc/0x102
[<
ffffffff8145471e>] _raw_spin_lock_bh+0x3b/0x6e
[<
ffffffffa03152b1>] tipc_ref_acquire+0xe8/0x11b [tipc]
[<
ffffffffa031433f>] tipc_createport_raw+0x78/0x1b9 [tipc]
[<
ffffffffa031450b>] tipc_createport+0x8b/0x125 [tipc]
[<
ffffffffa030f221>] tipc_subscr_start+0xce/0x126 [tipc]
[<
ffffffffa0308fb2>] process_signal_queue+0x47/0x7d [tipc]
[<
ffffffff81053e0c>] tasklet_action+0x8c/0xf4
[<
ffffffff81054bd8>] __do_softirq+0xf8/0x1cd
[<
ffffffff8100aadc>] call_softirq+0x1c/0x30
[<
ffffffff810549f4>] _local_bh_enable_ip+0xb8/0xd7
[<
ffffffff81054a21>] local_bh_enable_ip+0xe/0x10
[<
ffffffff81454d31>] _raw_spin_unlock_bh+0x34/0x39
[<
ffffffffa0308eb8>] spin_unlock_bh.clone.0+0x15/0x17 [tipc]
[<
ffffffffa0308f47>] tipc_k_signal+0x8d/0xb1 [tipc]
[<
ffffffffa0308dd9>] tipc_core_start+0x8a/0xad [tipc]
[<
ffffffffa01b1087>] 0xffffffffa01b1087
[<
ffffffff8100207d>] do_one_initcall+0x72/0x18a
[<
ffffffff810872fb>] sys_init_module+0xd8/0x23a
[<
ffffffff81009b42>] system_call_fastpath+0x16/0x1b
-> #0 (ref_table_lock){+.-...}:
[<
ffffffff8107b3b2>] __lock_acquire+0xa11/0xd0f
[<
ffffffff8107b78c>] lock_acquire+0xdc/0x102
[<
ffffffff81454836>] _raw_write_lock_bh+0x3b/0x6e
[<
ffffffffa0315148>] tipc_ref_discard+0x53/0xd4 [tipc]
[<
ffffffffa03141ee>] tipc_deleteport+0x40/0x119 [tipc]
[<
ffffffffa0316e35>] release+0xeb/0x137 [tipc]
[<
ffffffff8139dbf4>] sock_release+0x1f/0x6f
[<
ffffffff8139dc6b>] sock_close+0x27/0x2b
[<
ffffffff811116f6>] __fput+0x12a/0x1df
[<
ffffffff811117c5>] fput+0x1a/0x1c
[<
ffffffff8110e49b>] filp_close+0x68/0x72
[<
ffffffff8110e552>] sys_close+0xad/0xe7
[<
ffffffff81009b42>] system_call_fastpath+0x16/0x1b
Finally decided I should fix this. Its a straightforward inversion,
tipc_ref_acquire takes two locks in this order:
ref_table_lock
entry->lock
while tipc_deleteport takes them in this order:
entry->lock (via tipc_port_lock())
ref_table_lock (via tipc_ref_discard())
when the same entry is referenced, we get the above warning. The fix is equally
straightforward. Theres no real relation between the entry->lock and the
ref_table_lock (they just are needed at the same time), so move the entry->lock
aquisition in tipc_ref_acquire down, after we unlock ref_table_lock (this is
safe since the ref_table_lock guards changes to the reference table, and we've
already claimed a slot there. I've tested the below fix and confirmed that it
clears up the lockdep issue
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Allan Stephens <allan.stephens@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
James Chapman [Tue, 16 Mar 2010 06:29:20 +0000 (06:29 +0000)]
l2tp: Fix UDP socket reference count bugs in the pppol2tp driver
This patch fixes UDP socket refcnt bugs in the pppol2tp driver.
A bug can cause a kernel stack trace when a tunnel socket is closed.
A way to reproduce the issue is to prepare the UDP socket for L2TP (by
opening a tunnel pppol2tp socket) and then close it before any L2TP
sessions are added to it. The sequence is
Create UDP socket
Create tunnel pppol2tp socket to prepare UDP socket for L2TP
pppol2tp_connect: session_id=0, peer_session_id=0
L2TP SCCRP control frame received (tunnel_id==0)
pppol2tp_recv_core: sock_hold()
pppol2tp_recv_core: sock_put
L2TP ZLB control frame received (tunnel_id=nnn)
pppol2tp_recv_core: sock_hold()
pppol2tp_recv_core: sock_put
Close tunnel management socket
pppol2tp_release: session_id=0, peer_session_id=0
Close UDP socket
udp_lib_close: BUG
The addition of sock_hold() in pppol2tp_connect() solves the problem.
For data frames, two sock_put() calls were added to plug a refcnt leak
per received data frame. The ref that is grabbed at the top of
pppol2tp_recv_core() must always be released, but this wasn't done for
accepted data frames or data frames discarded because of bad UDP
checksums. This leak meant that any UDP socket that had passed L2TP
data traffic (i.e. L2TP data frames, not just L2TP control frames)
using pppol2tp would not be released by the kernel.
WARNING: at include/net/sock.h:435 udp_lib_unhash+0x117/0x120()
Pid: 1086, comm: openl2tpd Not tainted 2.6.33-rc1 #8
Call Trace:
[<
c119e9b7>] ? udp_lib_unhash+0x117/0x120
[<
c101b871>] ? warn_slowpath_common+0x71/0xd0
[<
c119e9b7>] ? udp_lib_unhash+0x117/0x120
[<
c101b8e3>] ? warn_slowpath_null+0x13/0x20
[<
c119e9b7>] ? udp_lib_unhash+0x117/0x120
[<
c11598a7>] ? sk_common_release+0x17/0x90
[<
c11a5e33>] ? inet_release+0x33/0x60
[<
c11577b0>] ? sock_release+0x10/0x60
[<
c115780f>] ? sock_close+0xf/0x30
[<
c106e542>] ? __fput+0x52/0x150
[<
c106b68e>] ? filp_close+0x3e/0x70
[<
c101d2e2>] ? put_files_struct+0x62/0xb0
[<
c101eaf7>] ? do_exit+0x5e7/0x650
[<
c1081623>] ? mntput_no_expire+0x13/0x70
[<
c106b68e>] ? filp_close+0x3e/0x70
[<
c101eb8a>] ? do_group_exit+0x2a/0x70
[<
c101ebe1>] ? sys_exit_group+0x11/0x20
[<
c10029b0>] ? sysenter_do_call+0x12/0x26
Signed-off-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>