Discussion:
[rsyslog] Stuck omfwd connections
Tim Smith
10 years ago
Permalink
Hi,

I have a pair of Linux/RHEL servers (RHEL 6.x), A and B, that forward logs
to multiple destinations:
- one copy to Splunk syslog listener
- one copy to local flume process over TCP
- one copy to a remote RSyslog receiver, X and Y (RHEL 6.x)

Forwarding copies to Splunk and Flume works fine. However, forwarding to
the remote Syslog receivers gets stuck in a strange way. The forwarding is
setup as:
RSyslog-Server-A -> RSyslog-Server-X
RSyslog-Server-B -> RSyslog-Server-Y

All four - A,B, X and Y are running exactly the same version of RSyslog -
8.6.2-2, from the adiscon repo:
rsyslog-8.6.0-2.el6.x86_64

What happens is A/B stop sending logs to X/Y. Looking at the send/receive
TCP queues at both ends, the receive queue on X/Y is clear but the sendQ on
A/B gets stuck. As an example, this connection lingers forever (extracted
with netstat -an | grep EST):
tcp 0 103660 10.24.62.9:47081 10.2.1.2:514
ESTABLISHED

Observations:
==========
- The connection remains established with the same number of bytes in the
sendQ
- No data is transferred over the "stuck" connection, looking at tcpdump
- Re-starting the receive end, X/Y, does not help
- I don't see an action suspended error in the rsyslog logs
- Running the send side in debug doesn't help - I easily ended up with 100+
Gigs of debug logs without the issue manifesting itself. The A/B pair
handle lots of traffic and running rsyslogd in debug mode reduces their
throughput - perhaps the issue does not manifest at lower EPS.
- Only re-starting the send side, A/B, resolves the issue.

I tweaked omfwd action to change TCP_Framing from default to octet-based.
Here is the send side omfwd config on A/B:
--------------------
action (name="it_tcp_X" type="omfwd" Target="X.abc.com" Port="514"
Protocol="tcp" TCP_Framing="octet-counted" queue.filename="it_tcp_X"
queue.maxdiskspace="10G" queue.Size="8640000"
queue.dequeuebatchsize="4096" queue.type="LinkedList"
queue.timeoutenqueue="0" queue.maxfilesize="1G" queue.saveonshutdown="on"
queue.workerThreads="4" RebindInterval="10000000" template="fwdformat" )
--------------------


The receive side, X/Y, config:
--------------------
module(load="imptcp" threads="16") # needs to be done just once

global (
workdirectory="/data/rsyslog/queues"
maxmessagesize="64K"
debug.logfile="/data/rsyslog/debug/debug.log"
net.enabledns="off"
)

$DebugLevel 0

main_queue (
queue.FileName="globalqueue"
queue.Type="LinkedList"
queue.MaxDiskSpace="250g"
queue.maxfilesize="5g"
queue.Size="864000000"
queue.dequeuebatchsize="1000"
queue.TimeoutEnqueue="0"
queue.workerThreads="4"
queue.SaveOnShutdown="on"
)

ruleset(name="aggregate") {
action (name="to_flume"
type="omfwd"
Target="localhost"
Port="5614"
Protocol="tcp"
queue.filename="to_flume"
queue.size="360000000"
queue.maxdiskspace="360G"
queue.highwatermark="216000000" # 60% of queue.size
queue.discardmark="288000000" # 80% of queue.size
queue.type="LinkedList"
queue.dequeuebatchsize="4096"
queue.timeoutenqueue="0"
queue.maxfilesize="4G"
queue.saveonshutdown="on"
queue.workerThreads="4"
RebindInterval="10000000"
template="rawfwd"
) stop
}

input(type="imptcp" port="514" ruleset="aggregate")
--------------------

Any pointers to troubleshoot and smoke out the bug will be highly
appreciated :)

Thanks
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Tim Smith
10 years ago
Permalink
As I was typing out the email, it occurred to me that the issue is OS
related:

Looking at a sending server, A, I saw these messages in dmesg:
TCP: Peer 10.2.1.2:514/47081 unexpectedly shrunk window 861404336:861405796
(repaired)

The local TCP port, 47081 is the same one that is part of the stuck
connection.

Now, I know what the problem is :) However, cannot seem to find a fix :(
...
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Tim Smith
10 years ago
Permalink
I tweaked a few OS/kernel parameters like eth driver options but finally,
this seems to have done the trick:
sysctl -w net.ipv4.tcp_window_scaling=0
...
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
David Lang
10 years ago
Permalink
This makes me think that you have a firewall between the two that doesn't
understand window scaling and is stripping it out of the packets (breaking
things when scaling is in use)

This is not normally done by ISPs, but if you have an old firewall in the path
somewhere, check it out. It probably needs to be updated to patch security holes
(and to get it onto a supported version, this is an old problem)

David Lang
...
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Tim Smith
10 years ago
Permalink
Thanks David.

Issues with Layer-3 network devices (routers, firewalls etc) did occur to
me but honestly, my priority was to stabilize the log stream between the
two ends without involving others since that usually slows down the
process. Also, unless it is a major outage causing bug, network engineers
are very reluctant to closely cooperate on troubleshooting such issues :)
That said, and now that my log stream is stable, I will open a case with
our network engineering team. Will post back if I find something useful.

As a side note, my sending rsyslog servers are running on hardware with tg3
ethernet drivers. Googling around, people seem to have had lots of trouble
with tg3 drivers and I saw several recommendations to turn off a bunch of
tcp activities off-loaded to the ethernet card - "gso off tso off sg off
gro off"
...
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Loading...