Page 1 of 1

Issues with IWStack Milano DC1

Posted: Sat Jan 30, 2016 1:21 pm
by Admin
We are experiencing some reboots among the Milano IWStack clusters.
While the orchestrator tries to cope restarting VMs on other nodes, due to the big number of events, the restarts take a long time, up to 30-50 minutes.
We are investigating the issue.

Re: Issues with IWStack Milano DC1

Posted: Sun Jan 31, 2016 12:21 pm
by Admin
After a lot of investigations we believe there is a problem with the fiberchannel links, one of them is probably causing some issues affecting the others or the switches.
When the nodes lose connectivity, they are automatically rebooted to protect the data.
A lot of such events at the same time create backlogs, meaning the vms restart slowly.

Re: Issues with IWStack Milano DC1

Posted: Mon Feb 01, 2016 12:01 am
by Admin
We believe we have found the issue, as the fc cards were overwhelmed by the number of commands. This cause them to stop accepting new to clear the backlog which inceased the iowait to the point the orchestrator considered the node dead starting the VMs on another which had the same issue in turn, bringing the whole cluster down in a cascade failure, like the ones that cause power grids to fail in countries or areas of countries.
As a result, we will balance the clusters to consider this issue too, hopefully stopping the reboots.