One production RAC database has been upgraded to 12c and everything went fine for a couple of months, but suddenly after some time, one instance got terminated with the error messages:
LMS2 (ospid: 17238): terminating the instance due to error 481 ORA-00481: LMON process terminated with error
The analysis was not so easy. We checked all logs, but still were not able to find the root cause of it in such short time, so we opened a Service Request with the Priority 1.
Oracle support guys couldn’t find it either so fast, because the OSWatcher wasn’t installed on that system. So, we installed the OSWatcher and waited that issue gets reproduced.
The instance was crashing as a key cluster process, LMS could not communicate with the other instance. This is commonly caused by underlying network issues. The netstat data gathered by OSWatcher confirms that there is indeed an issue in the network layer – drastic increase in the number of packet reassembly failure during the issue time.
The packet reassembly failures are known to cause instance/node eviction.
We increased the kernel parameters net.ipv4.ipfrag_high_thresh and net.ipv4.ipfrag_low_thresh according to Doc ID 2008933.1:
net.ipv4.ipfrag_high_thresh = 16777216 net.ipv4.ipfrag_low_thresh = 15728640
Update on 12-Jan-2017:
According to Red Hat Knowledge Base, these two parameters could be also increased:
net.core.netdev_max_backlog = 2000 net.core.netdev_budget = 600
You can find more details here:
Hope it helps.