Asterisk High Availability Design

High Availability (HA) is normally achieved through "clustering" - which means two machines acting as one for a specific purpose. There are many ways to create a cluster, each with its own benefits, risks, costs, and trade-offs. The terms "High Availability" (HA) and "Clustering" can be overused so beware of the hype. Clustering, and HA have specific (and different!) meanings. If you are responsible for creating a high availability cluster for Asterisk, below are the issues and concepts you should be aware of. This page is intended to be a starting point in the design, creation or selection of a High Availability or Clustering solution for Asterisk.

Note that if you are designing a call center for PSAP (Public Safety Answering Point) / 911 then there are specific requirements you must consider. Some are noted below, others are specified by rules/orders from FCC (USA), CRTC (Canada), and similar country specific organizations. (eg: FCC 05-116 order 10). Even if you are not designing for a PSAP, these guidelines are excellent best practices often applied by large commercial call centers anyways.

Please do not add specific product names/links to this page, it is intended to be product neutral. Don't say "this is the best" because your product/your favorite product uses it.. Stick to facts please.

Co-Dependence and Autonomy

This criteria is among the most important (if not THE most important) criteria when designing/selecting/building a high availability telephony environment. In order to be a true cluster, the machines (or "peers") must be autonomous. Some HA solutions involve sharing hardware, software, a logical device, etc .The problem with this approach is that you create a single point of failure. For example, if a cluster shares a hardware channel bank (eg: connecting to 2 machines via 2 USB cables), then if the channel bank fails the entire cluster fails. As another example, if a cluster shares a disk (eg: DRBD), then corruption of the disk content from a failing peer immediately corrupts the disk content of the other peer. In a true cluster the peers must be autonomous; i.e. not share any hardware, software, logical devices, etc.

Telephony devices in true high availability environments do not share any logical/physical resources. For example, in emergency call centers/PSAP's nothing on the call path is shared: from clustered PBX's, to separate switches, to clustered routers (HSRP/VRRP) to the trunks. Each peer (whether PBX or router or other) must survive the destruction of its peer. (NG911 Section IV.C).

Data Synchronization and Scalability

In order for a cluster to remain useful, the data on the peers must remain in sync. This allows one peer to pick up where the other left off in the event of a failure. However, synchronization is one of the greatest challenges for clusters. This is the next most important criteria in designing/selecting/building a high availability telephony environment.

There are 2 approaches to solving this problem. The simplest approach is letting the 2 peers share a disk (i.e. 'shared data'); however, this violates the first objective of peer autonomy (since a failing peer corrupting it's disk immediately corrupts the disk of the other peer). The more complex approach is synchronizing data between the peers (i.e. 'synchronized data'). There are a variety of tools to help synchronize, however it's important that synchronization only occur if the peers are healthy (otherwise you risk synchronizing corrupt data like the first approach).

Synchronized Data

The following technologies synchronize data between peers (either under the control of the HA software or independently): file copy, rsync, database log roll-forward, proprietary. One of the key benefits to these technologies is that should one peer fail, then can suspend any synchronization (to avoid corrupting the peer). As well, these technologies enable the application of changes to synchronized data; for example, network routes and ITSP data can be changed when one peer synchronizes with a data center on a different continent. The downside to this technology is the complexity/cost to implement it properly.

Shared Data

The following technologies share a logical/physical disk: DRBD, iSCSI, NFS, SAMBA, USB disk. Once of the key benefits to these technologies is simplicity/cost; you/a vendor can implement a shared disk quickly using open source software. The major downside is that these technologies violiate the first principle of peer autonomy. If a failing peer corrupts the files/data on the shared disk, then the corruption immediately applies to the other peer's disk as well. Some shared data software like DRBD also create a local cache which can result in "Split Brain" syndrome when peers restablish a connection and the host cannot figure out which copy of the data is correct (or automatically wiping out all data on one peer). All of these technologies also fail over wide area networks (and consume enourmous bandwidth) which is why they are normally only seen if peers are located next to one another.

Detecting Failures

In order for a cluster to remain productive, it must know when a peer is healthy, when it is deteriorating, and when it has failed. The most simplistic failure scenario (and the easiest to handle) is when a peer completely shuts down and disappears from the cluster. However, this is rarely the case in real life, where peers slowly deteriorate, introducing issues, corruption, no longer bridging calls, etc. It is imperative that any clustering solution monitor and detect degrees of deterioration and make an intelligent determination of when a peer has failed. Consider the following scenarios:

  • System powered down: This is the easiest to detect. Normally accomplished by monitoring the link state of a direct crossover cable, or by lack of heartbeat responses.
  • Calls failing: This is a common scenario found with Asterisk systems. The hardware is running fine, the Asterisk process is alive, the SIP/IAX channels respond to devices, but calls are not completing.
  • External trunks not available: This commonly occurs during a network failure, an ITSP failure, a local/regional datacom or power outage etc, firewall/gateway failure etc.
  • NOC component failure: Failure of a router, firewall, etc at one network operations center will leave asterisk Alive and running, but telephony services will be down.

Some solutions use the open source 'Heartbeat' package which does a good job of simplistic detection of a process being alive. There are add-ons for Heartbeat which try to extend Heartbeat into a few more telephony type features (eg: SIP INFO), but it is not asterisk aware, Asterisk connectivity aware, not aware if Asterisk can bridge calls, etc. This is also an 'all or none' type solution since Heartbeat does not build any type of health profile - the process monitored is either a pass or fail. (So heartbeat may leave a crippled Asterisk process as active even though it is not doing any call bridging).

Some solutions use proprietary detection of Asterisk, the environment, etc. These solutions look at the environment as a whole, including upstream firewalls, ITSP's, response times, etc. They also look at Asterisk capabilities including successful registration of trunks, calls per minutes, T1/PRI health, etc.

Up & Downstream Transparency

Although not essential for large or complex telephony environments, up and downstream transparency of the cluster interface makes life a lot easier. In this case transparency refers to up and downstream devices not being aware that the active peer has changed. This is most often accomplished by the cluster moving an IP address from one peer to another, and notify neighboring network devices of the address change. The result is that upstream trunks and downstream phone sets continue to connect to the single IP address and are not aware that a change has taken place.

Physical Peer Separation

Although not essential for small office / home office telephony environments, a common request is that peers be geographically separated by significant distances. The benefit of physical separation is that disasters which strike one data center / city / region will likely not affect the other. This is something to be aware of when designing your HA cluster Solutions which use 'data sharing' struggle with physically separated peers as disk latency / synchronization latency grows. Solutions using 'data synchronization' do better as they maintain their own independent storage.

Encryption

Another consideration of physically separated peers is protecting the Asterisk nodes from corruption and man-in-the-middle attacks. Any data which must travel over exposed (internet) connections should be encrypted if possible (note that DRBD, NFS and iSCSI does not encrypt traffic). To meet PSAP requirements all control/handshaking between peers must be encrypted and protected. (NG911 Section IV.D.6)

Speed of Fail Over

Simplistic clusters cannot fail over quickly - most commonly because they do not properly detect deterioration of a peer. In fact, they wait until the health of a peer has deteriorated so substantially that the peer is almost dead and unresponsive before they declare a failure. Detection speed must be near immediate - but this is separate from failover speeds, which also must be quick to prevent a no-service scenario.

Fault tolerance

A HA system should have a combination of fault tolerance AND clustering. For example, you should have redundant network interfaces, power supplies and disks. This will allow the most common failures to happen transparently to the users, without an outage. But when a critical system error occurs (such as power completely removed from a rack, or a kernel panic), the cluster must be able to detect this and start the services up on another node, with as little downtime as possible.

Load Balancing

Load balancing and high availability are different but related functions. To help distinguish the two consider this:
  • The SUM OF ALL load balanced servers must be able to handle the load of an entire deployment, v.s. EACH high availability server must be able to handle the load of an entire deployment.
  • A load balanced server must contain m+n servers (m = minimum amount to handle call load, n=number of permitted server failures) which can become expensive.
  • Load balanced servers cannot share configuration/data as each must remain completely autonomous. This prevents load balancing from being considered 'HA' since one peer cannot really pick up where the other left off (voicemails, configurations, etc).

Ten years ago load balancing was necessary to handle high call volumes and introduce a degree of fault tolerance. However, with low-cost off the shelf servers now able to bridge hundreds of calls simultaneously load-balancing is rarely implemented

For large deployments (eg: 20 call enters) a load balancer is may be deployed centrally, along with high availability at EACH call center. It's important to note that load balancers have little / no awareness of the health of the PBX they are sending calls to. So if an Asterisk server is no longer bridging calls (failure) but the SIP stack is up, then the load balancer will keep sending calls to that dead Asterisk server. The role of HA is to detect the failed PBX and transfer control to the peer (or shutdown the cluster completely removing it from the load balancer's destination list).

Stability and Support

In an environment where the phone system (Asterisk) may be free, some integrators resent any commercial products. Just like Digium charges for a commercial version of Asterisk and support, add-on software and hardware vendors do too. That's how you get product stability and support. The cost of many hardware and software HA solutions are easily justified by clients running call centers, large commercial enterprises, etc.

Commercial and Free Products

You must decide based on the per-minute cost of an outage what it's worth for commercial features and support. If designing a SOHO solution, then using free components and assembling a do-it-yourself solutions makes the most sense. (Or look for a "free" version of commercial products). If designing large-scale solutions, then a system with the right balance of features, price, and 24/7 support are essential.

There are a number of commercial and free products which claim to be HA or clustering solutions (see Asterisk High Availability Solutions for examples). Some issues to consider when evaluating these products:

  • Do they meet the criteria above (and offer the trade-offs right for your deployment)
  • Are they too simplistic (e.g.: Adding a RAID disk and calling it a HA solution)
  • Are they too complex (e.g.: Building a full Linux HA with heartbeat solution without deep Asterisk awareness)
  • Are they too generic (e.g.: Using a VM cluster which doesn't detect real life failures, external failures, etc)
  • Is there support and assistance if/when needed
  • Am I buying a commercial 'module' that is really just a couple of free Linux packages put together (eg: DRBD + Heartbeat)

If you are targeting the SOHO market then you may have no choice but to go with homebrew scripts, or a simple 'HA' module of a commercial distribution. If you are targeting large scale telephony installations then you may have to go with more expensive commercial products; however, be sure you can try before you buy! (In particular, ask the vendors questions based on the criteria above to really understand what you are buying). Note that commercial products are not "better" than free ones: it depends on your criteria, and no single product will score perfectly on each criteria.
High Availability (HA) is normally achieved through "clustering" - which means two machines acting as one for a specific purpose. There are many ways to create a cluster, each with its own benefits, risks, costs, and trade-offs. The terms "High Availability" (HA) and "Clustering" can be overused so beware of the hype. Clustering, and HA have specific (and different!) meanings. If you are responsible for creating a high availability cluster for Asterisk, below are the issues and concepts you should be aware of. This page is intended to be a starting point in the design, creation or selection of a High Availability or Clustering solution for Asterisk.

Note that if you are designing a call center for PSAP (Public Safety Answering Point) / 911 then there are specific requirements you must consider. Some are noted below, others are specified by rules/orders from FCC (USA), CRTC (Canada), and similar country specific organizations. (eg: FCC 05-116 order 10). Even if you are not designing for a PSAP, these guidelines are excellent best practices often applied by large commercial call centers anyways.

Please do not add specific product names/links to this page, it is intended to be product neutral. Don't say "this is the best" because your product/your favorite product uses it.. Stick to facts please.

Co-Dependence and Autonomy

This criteria is among the most important (if not THE most important) criteria when designing/selecting/building a high availability telephony environment. In order to be a true cluster, the machines (or "peers") must be autonomous. Some HA solutions involve sharing hardware, software, a logical device, etc .The problem with this approach is that you create a single point of failure. For example, if a cluster shares a hardware channel bank (eg: connecting to 2 machines via 2 USB cables), then if the channel bank fails the entire cluster fails. As another example, if a cluster shares a disk (eg: DRBD), then corruption of the disk content from a failing peer immediately corrupts the disk content of the other peer. In a true cluster the peers must be autonomous; i.e. not share any hardware, software, logical devices, etc.

Telephony devices in true high availability environments do not share any logical/physical resources. For example, in emergency call centers/PSAP's nothing on the call path is shared: from clustered PBX's, to separate switches, to clustered routers (HSRP/VRRP) to the trunks. Each peer (whether PBX or router or other) must survive the destruction of its peer. (NG911 Section IV.C).

Data Synchronization and Scalability

In order for a cluster to remain useful, the data on the peers must remain in sync. This allows one peer to pick up where the other left off in the event of a failure. However, synchronization is one of the greatest challenges for clusters. This is the next most important criteria in designing/selecting/building a high availability telephony environment.

There are 2 approaches to solving this problem. The simplest approach is letting the 2 peers share a disk (i.e. 'shared data'); however, this violates the first objective of peer autonomy (since a failing peer corrupting it's disk immediately corrupts the disk of the other peer). The more complex approach is synchronizing data between the peers (i.e. 'synchronized data'). There are a variety of tools to help synchronize, however it's important that synchronization only occur if the peers are healthy (otherwise you risk synchronizing corrupt data like the first approach).

Synchronized Data

The following technologies synchronize data between peers (either under the control of the HA software or independently): file copy, rsync, database log roll-forward, proprietary. One of the key benefits to these technologies is that should one peer fail, then can suspend any synchronization (to avoid corrupting the peer). As well, these technologies enable the application of changes to synchronized data; for example, network routes and ITSP data can be changed when one peer synchronizes with a data center on a different continent. The downside to this technology is the complexity/cost to implement it properly.

Shared Data

The following technologies share a logical/physical disk: DRBD, iSCSI, NFS, SAMBA, USB disk. Once of the key benefits to these technologies is simplicity/cost; you/a vendor can implement a shared disk quickly using open source software. The major downside is that these technologies violiate the first principle of peer autonomy. If a failing peer corrupts the files/data on the shared disk, then the corruption immediately applies to the other peer's disk as well. Some shared data software like DRBD also create a local cache which can result in "Split Brain" syndrome when peers restablish a connection and the host cannot figure out which copy of the data is correct (or automatically wiping out all data on one peer). All of these technologies also fail over wide area networks (and consume enourmous bandwidth) which is why they are normally only seen if peers are located next to one another.

Detecting Failures

In order for a cluster to remain productive, it must know when a peer is healthy, when it is deteriorating, and when it has failed. The most simplistic failure scenario (and the easiest to handle) is when a peer completely shuts down and disappears from the cluster. However, this is rarely the case in real life, where peers slowly deteriorate, introducing issues, corruption, no longer bridging calls, etc. It is imperative that any clustering solution monitor and detect degrees of deterioration and make an intelligent determination of when a peer has failed. Consider the following scenarios:

  • System powered down: This is the easiest to detect. Normally accomplished by monitoring the link state of a direct crossover cable, or by lack of heartbeat responses.
  • Calls failing: This is a common scenario found with Asterisk systems. The hardware is running fine, the Asterisk process is alive, the SIP/IAX channels respond to devices, but calls are not completing.
  • External trunks not available: This commonly occurs during a network failure, an ITSP failure, a local/regional datacom or power outage etc, firewall/gateway failure etc.
  • NOC component failure: Failure of a router, firewall, etc at one network operations center will leave asterisk Alive and running, but telephony services will be down.

Some solutions use the open source 'Heartbeat' package which does a good job of simplistic detection of a process being alive. There are add-ons for Heartbeat which try to extend Heartbeat into a few more telephony type features (eg: SIP INFO), but it is not asterisk aware, Asterisk connectivity aware, not aware if Asterisk can bridge calls, etc. This is also an 'all or none' type solution since Heartbeat does not build any type of health profile - the process monitored is either a pass or fail. (So heartbeat may leave a crippled Asterisk process as active even though it is not doing any call bridging).

Some solutions use proprietary detection of Asterisk, the environment, etc. These solutions look at the environment as a whole, including upstream firewalls, ITSP's, response times, etc. They also look at Asterisk capabilities including successful registration of trunks, calls per minutes, T1/PRI health, etc.

Up & Downstream Transparency

Although not essential for large or complex telephony environments, up and downstream transparency of the cluster interface makes life a lot easier. In this case transparency refers to up and downstream devices not being aware that the active peer has changed. This is most often accomplished by the cluster moving an IP address from one peer to another, and notify neighboring network devices of the address change. The result is that upstream trunks and downstream phone sets continue to connect to the single IP address and are not aware that a change has taken place.

Physical Peer Separation

Although not essential for small office / home office telephony environments, a common request is that peers be geographically separated by significant distances. The benefit of physical separation is that disasters which strike one data center / city / region will likely not affect the other. This is something to be aware of when designing your HA cluster Solutions which use 'data sharing' struggle with physically separated peers as disk latency / synchronization latency grows. Solutions using 'data synchronization' do better as they maintain their own independent storage.

Encryption

Another consideration of physically separated peers is protecting the Asterisk nodes from corruption and man-in-the-middle attacks. Any data which must travel over exposed (internet) connections should be encrypted if possible (note that DRBD, NFS and iSCSI does not encrypt traffic). To meet PSAP requirements all control/handshaking between peers must be encrypted and protected. (NG911 Section IV.D.6)

Speed of Fail Over

Simplistic clusters cannot fail over quickly - most commonly because they do not properly detect deterioration of a peer. In fact, they wait until the health of a peer has deteriorated so substantially that the peer is almost dead and unresponsive before they declare a failure. Detection speed must be near immediate - but this is separate from failover speeds, which also must be quick to prevent a no-service scenario.

Fault tolerance

A HA system should have a combination of fault tolerance AND clustering. For example, you should have redundant network interfaces, power supplies and disks. This will allow the most common failures to happen transparently to the users, without an outage. But when a critical system error occurs (such as power completely removed from a rack, or a kernel panic), the cluster must be able to detect this and start the services up on another node, with as little downtime as possible.

Load Balancing

Load balancing and high availability are different but related functions. To help distinguish the two consider this:
  • The SUM OF ALL load balanced servers must be able to handle the load of an entire deployment, v.s. EACH high availability server must be able to handle the load of an entire deployment.
  • A load balanced server must contain m+n servers (m = minimum amount to handle call load, n=number of permitted server failures) which can become expensive.
  • Load balanced servers cannot share configuration/data as each must remain completely autonomous. This prevents load balancing from being considered 'HA' since one peer cannot really pick up where the other left off (voicemails, configurations, etc).

Ten years ago load balancing was necessary to handle high call volumes and introduce a degree of fault tolerance. However, with low-cost off the shelf servers now able to bridge hundreds of calls simultaneously load-balancing is rarely implemented

For large deployments (eg: 20 call enters) a load balancer is may be deployed centrally, along with high availability at EACH call center. It's important to note that load balancers have little / no awareness of the health of the PBX they are sending calls to. So if an Asterisk server is no longer bridging calls (failure) but the SIP stack is up, then the load balancer will keep sending calls to that dead Asterisk server. The role of HA is to detect the failed PBX and transfer control to the peer (or shutdown the cluster completely removing it from the load balancer's destination list).

Stability and Support

In an environment where the phone system (Asterisk) may be free, some integrators resent any commercial products. Just like Digium charges for a commercial version of Asterisk and support, add-on software and hardware vendors do too. That's how you get product stability and support. The cost of many hardware and software HA solutions are easily justified by clients running call centers, large commercial enterprises, etc.

Commercial and Free Products

You must decide based on the per-minute cost of an outage what it's worth for commercial features and support. If designing a SOHO solution, then using free components and assembling a do-it-yourself solutions makes the most sense. (Or look for a "free" version of commercial products). If designing large-scale solutions, then a system with the right balance of features, price, and 24/7 support are essential.

There are a number of commercial and free products which claim to be HA or clustering solutions (see Asterisk High Availability Solutions for examples). Some issues to consider when evaluating these products:

  • Do they meet the criteria above (and offer the trade-offs right for your deployment)
  • Are they too simplistic (e.g.: Adding a RAID disk and calling it a HA solution)
  • Are they too complex (e.g.: Building a full Linux HA with heartbeat solution without deep Asterisk awareness)
  • Are they too generic (e.g.: Using a VM cluster which doesn't detect real life failures, external failures, etc)
  • Is there support and assistance if/when needed
  • Am I buying a commercial 'module' that is really just a couple of free Linux packages put together (eg: DRBD + Heartbeat)

If you are targeting the SOHO market then you may have no choice but to go with homebrew scripts, or a simple 'HA' module of a commercial distribution. If you are targeting large scale telephony installations then you may have to go with more expensive commercial products; however, be sure you can try before you buy! (In particular, ask the vendors questions based on the criteria above to really understand what you are buying). Note that commercial products are not "better" than free ones: it depends on your criteria, and no single product will score perfectly on each criteria.
Created by: ocgltd, Last modification: Mon 14 of Mar, 2016 (22:28 UTC)
Please update this page with new information, just login and click on the "Edit" or "Discussion" tab. Get a free login here: Register Thanks! - Find us on Google+