Sunday 10th February 2013: 11.47pm. Sigh ... since I relocated one of my big heavy VMs from the house cloud node to the French cloud node last weekend, I've been seeing some terrible disc i/o problems within the VMs. Witness this:
root@plone1:~#./pveperf
CPU BOGOMIPS: 3990.19
REGEX/SECOND: 531012
HD SIZE: 4.00 GB (/dev/simfs)
FSYNCS/SECOND: 40.93
DNS EXT: 40.92 ms
DNS INT: 31.92 ms (nedland)
Yeah that's just forty fsyncs/sec when the disc is easily capable of doing 1000 fsyncs/sec - the entire node is seeing a constant 60-80% i/o delay. Last night one of the nodes took twelve hours to delete some files instead of doing its work, which made it appear to have hanged. My email VM also spontaneously hangs from time to time, which isn't helpful either when IMAP keeps timing out. So I had to do something about it today really ...
Now the evil way to fix this is to write zero into /proc/sys/fs/fsync-enable which turns off all fsyncing in all containers, but leaves it turned on for the host. Yet I already have ext4 barriers disabled, and that's about far enough I want to go down the path of losing data. So something else is obviously at work.
Four hours of fiddling later and I found it: it's DRBD which is adding latency to the entire i/o system, and the extra i/o load introduced by the extra VM tipped it over the edge. Disconnect the replicated devices and voila:
root@plone1:~#./pveperf
CPU BOGOMIPS: 3990.19
REGEX/SECOND: 637584
HD SIZE: 4.00 GB (/dev/simfs)
FSYNCS/SECOND: 1028.90
DNS EXT: 39.27 ms
DNS INT: 33.85 ms (nedland)
DRBD was even halving fsync speed for writes to non-replicated devices if it is connected. So my solution: a cron job now disconnects DRBD for most of the day, and reconnects it for one hour in the middle of the night whereupon it ought to replicate whatever has changed. Hopefully problem solved!
Go to previous entry | Go to next entry | Go back to the archive index | Go back to the latest entries |