Discussion:
[rsyslog] clarification on queues and forwarding messages
Matt Garman via rsyslog
2018-11-13 21:33:52 UTC
Permalink
Using rsyslog 8.34 on CentOS 6 and 7 (Adiscon RPMs). We are working
on implementing a central logging server.

We somehow made a mistake when configuring the remote forward rule: we
forgot to think about queues (in particular, what happens when the
remote log server is unreachable). This is the rule we were using:

*.* action(type="omrelp" target="central-log-server" port="20514" tls="on")

This worked just fine when things are good. We were adding the above
directive to increasingly more systems, when we started experiencing
issues that resembled a network outage. We stumbled on this message
which gave the "ah-hah" moment:

https://lists.gt.net/rsyslog/users/7949

With the directive I specified above, rsyslog works in "direct queue"
mode (a fancy way to say "no queue"). And apparently, this can lead
to effectively crippling the network interface if the remote server is
unavailable.

That in mind, clearly I need to have a better forwarding config.
Here's what I want, in English: "Try really hard not to lose any
messages by queueing as much as you can; after that wait until the
network comes back." In other words, I'd rather lose log messages
than have the network soft lock.

I may be missing something, but I can't find how to tell rsyslog to
"wait until the network comes back when queue is full". Maybe that's
implicit?

Here's the revised rule I'm working on - am I missing anything?

*.* action(type="omrelp" target="central-log-server" port="20514" tls="on"
# params for in-memory queue
queue.type="LinkedList"
queue.size="1000000"
# params for disk assisted queue, i.e. spillover for in-memory queue
queue.saveOnShutdown="on"
queue.maxDiskSpace="5g"
queue.filename="rsyslog.central-log-server"
# parameters specific to this action
action.reportSuspension="on"
action.reportSuspensionContinuation="on"
)

Thanks!
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
David Lang
2018-11-13 22:21:38 UTC
Permalink
Post by Matt Garman via rsyslog
With the directive I specified above, rsyslog works in "direct queue"
mode (a fancy way to say "no queue"). And apparently, this can lead
to effectively crippling the network interface if the remote server is
unavailable.
That in mind, clearly I need to have a better forwarding config.
Here's what I want, in English: "Try really hard not to lose any
messages by queueing as much as you can; after that wait until the
network comes back." In other words, I'd rather lose log messages
than have the network soft lock.
I may be missing something, but I can't find how to tell rsyslog to
"wait until the network comes back when queue is full". Maybe that's
implicit?
when the connection starts working again, logs should start flowing without any
action.

there is a default main queue that exists. This then results in one worker
thread that goes through all actions, in order, and if any of them block, the
worker thread will wait until it clears (either by giving up, at which point
that log will not ever go to that destination, or by delivering the log)

you can add additional queues on either actions or rulesets (if you want one
queue for all network based things, put them in one ruleset and put the queue on
that ruleset)

This then makes it so that in the main queue worker thread, when it gets to that
action, instead of performing the action/ruleset, it copies the message to the
queue. If that queue is full, it stops and waits, just like it would if the
action was there and blocking.
Post by Matt Garman via rsyslog
Here's the revised rule I'm working on - am I missing anything?
*.* action(type="omrelp" target="central-log-server" port="20514" tls="on"
note, the *.* is redundant, you can just start the line with action(
Post by Matt Garman via rsyslog
# params for in-memory queue
queue.type="LinkedList"
personally, I use array, I find it a smidge faster than linkedlist in my
personal testing.
Post by Matt Garman via rsyslog
queue.size="1000000"
this is a fairly large queue, run a test where you block the network output (say
with iptables) and fill up the queue with average or large messages, make sure
this doesn't eat up too much RAM
Post by Matt Garman via rsyslog
# params for disk assisted queue, i.e. spillover for in-memory queue
queue.saveOnShutdown="on"
queue.maxDiskSpace="5g"
why limit it? is your disk really that small? (if it's a VM/container it may be)
Post by Matt Garman via rsyslog
queue.filename="rsyslog.central-log-server"
# parameters specific to this action
action.reportSuspension="on"
action.reportSuspensionContinuation="on"
)
you may want to set retries. Also look at the high/low watermark configs for
what to do when you start filling up the queue.

I also want all the logs, but don't want to risk hurting the production servers.
What I do is I make my syslog relays redundant (sharing an IP one way or
another), on the same subnet as the production systems (greatly reducing the
number of things that can block the network), and then use UDP for the first
hop. I use TCP/RELP from the relays to the central collectors, but I'd rather
loose logs than have an unexpected buildup of logs on the servers.

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Matt Garman via rsyslog
2018-11-14 16:09:47 UTC
Permalink
Hi David,

Thank you for the helpful reply! A few follow-up questions...

First, in our original forwarding config:
*.* action(type="omrelp" target="central-log-server" port="20514" tls="on")
There is no queue, which means there is only the one main worker
thread. So if that main worker thread can't send, it will keep
trying, blocking all other operations. I can see how this will
effectively cause rsyslog to hang, but it appears to actually cause
the whole interface to hang. Why is that?

More detail for our case: we have an NFS server that exports a
filesystem to over 100 nodes. We want to capture NFS server logs on
our central rsyslog server. The rsyslog server lives in a subnet that
has lower bandwidth (1gbps) than the NFS subnet (10gbps or greater).
The same interface on the NFS server is used for NFS and rsyslog.
What we found is that when that rsyslog network becomes saturated,
rsyslog on the NFS server effectively thinks the central server is
down. But that also kills all NFS traffic, i.e. all the NFS client
nodes have issues.

So that's fundamentally what I'm trying to understand: why does a hung
rsyslog (hung due to blocking on send) effectively kill the whole
network interface?
Post by David Lang
This then makes it so that in the main queue worker thread, when it gets to that
action, instead of performing the action/ruleset, it copies the message to the
queue. If that queue is full, it stops and waits, just like it would if the
action was there and blocking.
So in my proposed "new" config (where I explicitly configure a DA
memory queue), I will automatically have at least one worker thread
(based on my understanding of the "Worker Thread Pools" section of
https://www.rsyslog.com/doc/v8-stable/concepts/queues.html).

In other words, the main thread should no longer hang: the message
should get passed to the queue I've defined, and that queue's worker
thread(s) should take over, leaving the main thread to go about its
business.

But, if both the memory queue and the disk queue get 100% full, is it
possible for the queue worker threads to hang the whole network
interface the same way the main thread can (in the queue-less config
above)?
Post by David Lang
you may want to set retries. Also look at the high/low watermark configs for
what to do when you start filling up the queue.
I think the defaults for high/low watermarks look reasonable to me.

I'm looking at discardMark and discardSeverity - the defaults also
seem reasonable here too.

But I still don't understand what happens when the in-memory and disk
queues are full: are all messages dropped, or do I run the risk of the
whole interface hanging?
Post by David Lang
I also want all the logs, but don't want to risk hurting the production servers.
What I do is I make my syslog relays redundant (sharing an IP one way or
another), on the same subnet as the production systems (greatly reducing the
number of things that can block the network), and then use UDP for the first
hop. I use TCP/RELP from the relays to the central collectors, but I'd rather
loose logs than have an unexpected buildup of logs on the servers.
I agree, that is a good architecture, and one we would like to use.
However, not sure if we can get away with any UDP, as management has
security concerns.

Thanks again!
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
David Lang
2018-11-14 19:09:55 UTC
Permalink
Post by Matt Garman via rsyslog
Thank you for the helpful reply! A few follow-up questions...
*.* action(type="omrelp" target="central-log-server" port="20514" tls="on")
There is no queue, which means there is only the one main worker
thread. So if that main worker thread can't send, it will keep
trying, blocking all other operations. I can see how this will
effectively cause rsyslog to hang, but it appears to actually cause
the whole interface to hang. Why is that?
when you try to write to syslog, the spec says that the write should block until
the message is written to disk. This has been relaxed to be that the writer is
blocked until the message is processed and put on the main queue, but if the
main queue is full, you won't be able to do anything (include logging into the
machine, as it tries to write a log to record your login)
Post by Matt Garman via rsyslog
More detail for our case: we have an NFS server that exports a
filesystem to over 100 nodes. We want to capture NFS server logs on
our central rsyslog server. The rsyslog server lives in a subnet that
has lower bandwidth (1gbps) than the NFS subnet (10gbps or greater).
The same interface on the NFS server is used for NFS and rsyslog.
What we found is that when that rsyslog network becomes saturated,
rsyslog on the NFS server effectively thinks the central server is
down. But that also kills all NFS traffic, i.e. all the NFS client
nodes have issues.
So that's fundamentally what I'm trying to understand: why does a hung
rsyslog (hung due to blocking on send) effectively kill the whole
network interface?
it's not killing the network interface, what's happening is that as the nfs
daemon is trying to write to syslog, it's hung.

putting a queue (disk assisted so you aren't limited to the available RAM), will
let it keep writing logs to rsyslog, even if rsyslog can't deliver the logs.
Post by Matt Garman via rsyslog
Post by David Lang
This then makes it so that in the main queue worker thread, when it gets to that
action, instead of performing the action/ruleset, it copies the message to the
queue. If that queue is full, it stops and waits, just like it would if the
action was there and blocking.
So in my proposed "new" config (where I explicitly configure a DA
memory queue), I will automatically have at least one worker thread
(based on my understanding of the "Worker Thread Pools" section of
https://www.rsyslog.com/doc/v8-stable/concepts/queues.html).
In other words, the main thread should no longer hang: the message
should get passed to the queue I've defined, and that queue's worker
thread(s) should take over, leaving the main thread to go about its
business.
But, if both the memory queue and the disk queue get 100% full, is it
possible for the queue worker threads to hang the whole network
interface the same way the main thread can (in the queue-less config
above)?
yes, you will have a separate worker thread for the action queue, and a separate
worker thread for the disk queue

but if both of those fill up, it will then block the main queue, and once that
fills up, block processes from writing logs.
Post by Matt Garman via rsyslog
Post by David Lang
you may want to set retries. Also look at the high/low watermark configs for
what to do when you start filling up the queue.
I think the defaults for high/low watermarks look reasonable to me.
I'm looking at discardMark and discardSeverity - the defaults also
seem reasonable here too.
also, enable impstats, that will show you the status of all the queues.
Post by Matt Garman via rsyslog
But I still don't understand what happens when the in-memory and disk
queues are full: are all messages dropped, or do I run the risk of the
whole interface hanging?
see above.
Post by Matt Garman via rsyslog
Post by David Lang
I also want all the logs, but don't want to risk hurting the production servers.
What I do is I make my syslog relays redundant (sharing an IP one way or
another), on the same subnet as the production systems (greatly reducing the
number of things that can block the network), and then use UDP for the first
hop. I use TCP/RELP from the relays to the central collectors, but I'd rather
loose logs than have an unexpected buildup of logs on the servers.
I agree, that is a good architecture, and one we would like to use.
However, not sure if we can get away with any UDP, as management has
security concerns.
concerns because it is not going to be encrypted? or concerns because it's
'unreliable'?

David Lang
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
Continue reading on narkive:
Loading...