LMS0: terminating the instance due to error 481

15. December 2016 Uncategorized 0

One production RAC database has been upgraded to 12c and everything went fine for a couple of months, but suddenly after some time, one instance got terminated with the error messages:

LMS2 (ospid: 17238): terminating the instance due to error 481 
ORA-00481: LMON process terminated with error 

 

The analysis was not so easy. We checked all logs, but still were not able to find the root cause of it in such short time, so we opened a Service Request with the Priority 1.

Oracle support guys couldn’t find it either so fast, because the OSWatcher wasn’t installed on that system. So, we installed the OSWatcher and waited that issue gets reproduced.

Cause

The instance was crashing as a key cluster process, LMS could not communicate with the other instance. This is commonly caused by underlying network issues. The netstat data gathered by OSWatcher confirms that there is indeed an issue in the network layer – drastic increase in the number of packet reassembly failure during the issue time.

The packet reassembly failures are known to cause instance/node eviction.

Solution

We increased the kernel parameters net.ipv4.ipfrag_high_thresh and net.ipv4.ipfrag_low_thresh according to Doc ID 2008933.1:

net.ipv4.ipfrag_high_thresh = 16777216
net.ipv4.ipfrag_low_thresh = 15728640

 

Update on 12-Jan-2017:

According to Red Hat Knowledge Base, these two parameters could be also increased:

net.core.netdev_max_backlog = 2000
net.core.netdev_budget = 600

You can find more details here:

IP fragmentation fails and fragmented packets get dropped

How to tune `net.core.netdev_max_backlog` and `net.core.netdev_budget` sysctl kernel tunables?

 

Hope it helps.


Leave a Reply

Your email address will not be published. Required fields are marked *