Configuring StateStore for High Availability
With a pair of StateStore instances in primary/standby mode, the primary StateStore instance will send the cluster's state update and propagate metadata updates. It periodically sends heartbeat to the standby StateStore instance, CatalogD, coordinators and executors. The standby StateStore instance also sends heartbeats to the CatalogD, and coordinators and executors. RPC connections between daemons and StateStore instances are kept alive so that broken connections usually won't result in false failure reports between nodes. The standby StateStore instance takes over the primary role when the service is needed in order to continue to operate when the primary instance goes down.
Enabling StateStore High Availability
- Restart two StateStore instances with the following additional
flags:
enable_statestored_ha: true state_store_ha_port: RPC port for StateStore HA service (default: 24020) state_store_peer_host: Hostname of the peer StateStore instance state_store_peer_ha_port: RPC port of high availability service on the peer StateStore instance (default: 24020)
- Restart all subscribers (including CatalogD, coordinators, and executors) with the
following additional
flags:
state_store_host: Hostname of the first StateStore instance state_store_port: RPC port for StateStore registration on the first StateStore instance (default: 24000) enable_statestored_ha: true state_store_2_host: Hostname of the second StateStore instance state_store_2_port: RPC port for StateStore registration on the second StateStore instance (default: 24000)
By setting these flags, the Impala cluster is configured to use two StateStore instances for high availability, ensuring high availability and fault tolerance.
Disabling StateStore High Availability
- Remove one StateStore instance from the Impala cluster.
- Restart the remaining StateStore instance along with the coordinator, executor, and
CatalogD nodes, ensuring they are restarted without the
enable_statestored_ha
flag.
StateStore Failure Detection
The primary StateStore instance continuously sends heartbeat to its registered clients, and
the standby StateStore instance. Each StateStore client registers to both active and standby
StateStore instances, and maintains the following information about the StateStore servers:
the server IP and port, service role - primary/standby, the last time the heartbeat request
was received, or number of missed heartbeats. A missing heartbeat response from the
StateStor’s client indicates an unhealthy daemon. There is a flag that defines
MAX_MISSED_HEARTBEAT_REQUEST_NUM
as the consecutive number of missed
heartbeat requests to indicate losing communication with the StateStore server from the
client's point of view so that the client marks the StateStore server as down. Standby
StateStore instance collects the connection states between the clients (CatalogD,
coordinators and executors) and primary StateStore instance in its heartbeat messages to the
clients. If the standby StateStore instance misses MAX_MISSED_HEARTBEAT_REQUEST_NUM
of heartbeat requests from the primary StateStore instance and the majority of
clients lose connections with the primary StateStore, it takes over the primary role.