Image

Imagedarth_spacey wrote in Imagelinuxsupport 😯confused

FIXED: Linux HA 1.2.3 -- split-brain every time

<edit>

Mischief managed, and boy do I feel like a dork now. I had different subnet masks on each end of the drbd link, and the serial ports turned out to be less reliable when tested with data longer than a few bytes. I've turned off the serial heartbeat for now (pending a BIOS update or something, I have yet to figure that out), and corrected the subnet masks, and everything seems to be working.

</edit>

I'm running two SLES 9 boxes, with heartbeat 1.2.3<edit>1.2.3-2.6</edit>and drbd 0.7.5<edit>0.7.5-0.14</edit>

Everything works fine when either machine is powered on, and the other is powered off. However, if I power the partner machine on, it takes over the heartbeat services as though the first machine is down. When it tries to steal the drbd device, it fails, so all the services that rely on data on the drbd device fail to start.

I end up with two machines both claiming the virtual IP address, both running some of the services, one running the drbd device, and neither running the services that rely on the drbd device.

If I power one machine off, and /etc/rc.d/heartbeat restart on the powered on machine, everything goes back to a working state.

I have manually tested the serial heartbeat cable (insomuch as I can echo test >/dev/ttyS0 and cat /dev/ttyS0 in both directions), and I'm also running a redundant heartbeat over the drbd ethernet interface, and I can ping in both directions over that link, too.

In /etc/ha.d/ha.cf the auto_failback setting is set to off, and I am officially out of ideas.

What config settings should I double-check? Is there something I've missed on the Linux HA website? Are there any obvious Google search terms that I've overlooked?


Thanks,