This action requires saving a cookie to remember the page's state. You can read more about that on the cookies page.

 

Example docs: DB

Released on: 2023-09-25

This is example documentation for the fairytale blog post. It is for an imaginary legacy infrastructure and is not intended to be use.


Table of contents

Quick overview

Heading Description
What MySQL database containing users and drivers.
Fault-tolerance Master-slave. Slave can be promoted to master.
Fail-over method Manual.
Hostnames (PROD|STAG|DEV)-DB-(M|S)-01A eg: PROD-DB-M-01A
Serves (PROD|STAG|DEV)-APP-[0-9][0-9](A|B) eg: PROD-APP-01A
Connect via (PROD|STAG|DEV)-DBPROXY-01(A|B) eg: PROD-DBPROXY-01A
Upstream documentation https://dev.mysql.com/doc/

Current considerations

  • Failover being manual: This is temporary while we solve a recurring replication issue that could lead to split brain.
  • Topology: Was previously master-master. Currently master-slave to simplify replication while we debug what’s wrong with it. We’ll revisit this once we know more.

Alerts

Replication lag

Possible causes

  • High load.
    • Normal, but insufficient capacity.
    • App bug.
    • Malicious
  • Broken replication.

Determining the replication problem

  1. Connect to the VPN.
  2. Connect to MySQL using the MySQL client
  3. Check the replication status and see what state Exec_Source_Log_Pos and Seconds_Behind_Source are in.

Outcomes:

  • If Exec_Source_Log_Pos is increasing, and Seconds_Behind_Source is high: Only high load.
  • If Exec_Source_Log_Pos is not increasing: Replication is broken.

Fixing broken replication

  1. Connect to the VPN.
  2. Connect to MySQL using the MySQL client
  3. Find out what the current query is that is breaking.
  4. Skip it.

How to

Fail-over

  1. Connect to the VPN.
  2. Force the health check to fail.
  3. Wait for the connections to drain.
  4. Promote the other server to master.
  5. Check that the connections start coming in.
  6. Reset the health check.

Connect to MySQL using the MySQL client

This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.

Recover from a replication failure

This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.

Overriding the health check

Force a failure:

  1. Connect to the VPN.

  2. SSH to the relevant server.

  3. Run

    touch /tmp/forceFailure
  4. Check that HAProxy has marked the machine as failed..

Allow the health check to operate normally:

  1. Connect to the VPN.

  2. SSH to the relevant server.

  3. Run

    rm /tmp/forceFailure
  4. Check that HAProxy has marked the machine as online..

Background: If /tmp/forceFailure exists, the health check that HAProxy uses will fail. If you are unable to SSH to the machine, you can get similar results by manipulating HAProxy.

Promote or demote a server

This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.

How to reboot the slave

  1. Connect to the VPN.

  2. Check that there are no connections to the server..

  3. SSH to the server.

  4. Run

    sudo shutdown -r 0
  5. Watch for the server to come back..

How to reboot the master

NOTE: If you need to do the acting slave as well, do it first, to minimise the number of times you need to drain the connections.

  1. Connect to the VPN.
  2. Promote the slave to master and demote the current master.
  3. Check that connections have moved to the original slave..
  4. Follow the slave reboot steps.
  5. Promote the server back to master and demote the current master back to slave.
  6. Check that connections have moved to the original slave..

Check the status of the replication

  1. Connect to the VPN.

  2. Connect to MySQL using the MySQL client

  3. Run this a few times over a few seconds:

    SHOW REPLICA STATUS\G
    • Look at Exec_Source_Log_Pos and Seconds_Behind_Source are doing. You’ll need this information for what ever instructions sent you here.

Determine which query is currently being replicated

This is deeper than I want to go for an example. I’ve left this in here to show what the structure would look like.

Skip the the current query

  1. Connect to the VPN.

  2. Connect to MySQL using the MySQL client

  3. Check the replication status and see what state Exec_Source_Log_Pos and Seconds_Behind_Source are in.

  4. Run

    SET GLOBAL sql_replica_skip_counter = 2;START SLAVE;
  5. Check the replication status and see what state Exec_Source_Log_Pos and Seconds_Behind_Source are in over the space of the next minutes.

At this point, Exec_Source_Log_Pos should have now increased, and Seconds_Behind_Source should start decreasing over time.

This post references

If your technical documentation would sound amazing when narrated by someone like Morgan Freeman, or Judi Dench, it's not documentation.

Posts using the same tags

If your technical documentation would sound amazing when narrated by someone like Morgan Freeman, or Judi Dench, it's not documentation.
Home | About | Contact | Cookies | Site map