Eric Campusano via rsyslog
2018-11-15 04:14:55 UTC
I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS.
They're part of a distributed measurement network that collects data and
fowards it to a central location using rsyslog. All instances are running
within the same VPC and it's a flat /24 network. The version of rsyslog
that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or
TLS or any of the other fancier transport protocols that rsyslog supports.
Each measurement node forwards dozens of events per second to the central
collector. All of this works great except that once every 3-4 months the
measurement data on a small number of instances will become corrupted by
having one or more bits shifted by a single character. For example, one of
the fields of data that these instances forward to the central collector
includes the hostname of the instance. This instance is named
"tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here
are the erroneous values that it used for its hostname when this problem
occurred during a period of 15 minutes:
tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain
Each value is slightly different than the other values by a single
character and each character has been shifted by two characters either
backwards or forwards. Here are some examples taken from the above list:
r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f
This problem was observed on the central rsyslog collector once all of the
aggregated data underwent validation before being sent off to another
server to be processed, therefore I can not definitely say where the
corruption occurred: on the measurement node where rsyslog is running, on
the network when rsyslog forwarded the data to the collector, or on the
collector itself. There were two measurement nodes that experienced this
issue during the same time period.
I'm at a bit of a loss to determine where this bit shifting is occurring.
I've looked through my kernel logs and system logs trying to correlate the
times when the bit shifting occurred to some issue with the system but
haven't been able to find anything, everything else on the system appears
to have been working fine. I've read that bit shifting can occur as a
result of bad memory, but as I understand it all of the hardware in AWS
runs ECC memory which should prevent this from happening due to faulty
memory. Is there anything in Linux or rsyslog that I should be looking at
to determine the cause of this issue or to prevent its reocurrence? I've
read that there's a kernel module called EDAC in Linux that can help detect
memory errors but I'm not sure that it would function in a VM, I believe it
would need direct access to physical memory. Does anyone have any
suggestions on what might be causing this issue or where I should focus my
troubleshooting?
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.
They're part of a distributed measurement network that collects data and
fowards it to a central location using rsyslog. All instances are running
within the same VPC and it's a flat /24 network. The version of rsyslog
that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or
TLS or any of the other fancier transport protocols that rsyslog supports.
Each measurement node forwards dozens of events per second to the central
collector. All of this works great except that once every 3-4 months the
measurement data on a small number of instances will become corrupted by
having one or more bits shifted by a single character. For example, one of
the fields of data that these instances forward to the central collector
includes the hostname of the instance. This instance is named
"tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here
are the erroneous values that it used for its hostname when this problem
occurred during a period of 15 minutes:
tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain
Each value is slightly different than the other values by a single
character and each character has been shifted by two characters either
backwards or forwards. Here are some examples taken from the above list:
r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f
This problem was observed on the central rsyslog collector once all of the
aggregated data underwent validation before being sent off to another
server to be processed, therefore I can not definitely say where the
corruption occurred: on the measurement node where rsyslog is running, on
the network when rsyslog forwarded the data to the collector, or on the
collector itself. There were two measurement nodes that experienced this
issue during the same time period.
I'm at a bit of a loss to determine where this bit shifting is occurring.
I've looked through my kernel logs and system logs trying to correlate the
times when the bit shifting occurred to some issue with the system but
haven't been able to find anything, everything else on the system appears
to have been working fine. I've read that bit shifting can occur as a
result of bad memory, but as I understand it all of the hardware in AWS
runs ECC memory which should prevent this from happening due to faulty
memory. Is there anything in Linux or rsyslog that I should be looking at
to determine the cause of this issue or to prevent its reocurrence? I've
read that there's a kernel module called EDAC in Linux that can help detect
memory errors but I'm not sure that it would function in a VM, I believe it
would need direct access to physical memory. Does anyone have any
suggestions on what might be causing this issue or where I should focus my
troubleshooting?
_______________________________________________
rsyslog mailing list
http://lists.adiscon.net/mailman/listinfo/rsyslog
http://www.rsyslog.com/professional-services/
What's up with rsyslog? Follow https://twitter.com/rgerhards
NOTE WELL: This is a PUBLIC mailing list, posts are ARCHIVED by a myriad of sites beyond our control. PLEASE UNSUBSCRIBE and DO NOT POST if you DON'T LIKE THAT.