High Availability
The very concept of a central process scheduler means that it is likely that it is important that it is operating, as when it is not then background jobs on many computer systems are affected. This section explains how the software operates, how it can be set up so that it is highly available and how the software can inter-operate with third party products that ensure high availability.
High availability depends on a number of factors, some of which are common to all components (like host availability) and some of which are component specific (like how much processing can continue during a network failure). The factors common to all components include highly available infrastructure and eliminating single points of failure. Those that are component specific include how particular failures (host, node, network and software) are handled by individual components.
Supported High Availability Solutions
The central Redwood server and platform agents support any high availability solution that can fulfill the requirements:
All central Redwood servers in high availability cluster must use the same port, SharedSecret, FQDN, DataRootDirectory for their platform agents. The HA solution must ensure the failing central Redwood server or platform agent(s) are stopped on the respective failing host(s) and started on the respective failover host(s), all network traffic that went to the failing host(s) is directed to the respective failover host(s).
Redwood Server High Availability Considerations
The central Redwood Server is unlike many standard web applications as its primary role is not to serve content to great number of concurrent users but to process background tasks, mostly processes and database queries. This means that traditional web application server high availability scenarios do not apply to Redwood Server. The more threads Redwood Server can get, the better; the same applies to memory as Redwood Server will cache as much as possible to keep the number of database lookups low. This means that having two web application servers on the same host will not yield any advantages over one web application server with twice the amount of memory, worse, not only will memory be wasted by duplicate caches, the two will have to remain in sync which means more threads will also be used.
The heavy caching alone means that multiple concurrent application servers is a waste of resources. For optimal performance it is recommended to use high throughput servers with many CPU's/CPU cores and as much memory as possible in an active-passive high availability setup for the central server.
Factors Common to all Components
Many high availability factors are common to all components. The Redwood Server architecture has been designed to isolate the effect of many types of failure to the components dependent on the failed resource. These effects can be reduced or eliminated by implementing highly available solutions. For example:
- Host failures will only affect the components on that host. Using Active-Active platform agents can ensure that processing occurs without interruption.
- Network failures only affect components communicating over that link. Processing can continue on both sides of the link, and in many cases processing can complete and report back results when the network returns.
Single Points of Failure (SPOF)
A single point of failure (SPOF) is a resource whose failure will cause a larger system failure. It is important to be aware of potential SPOFs when designing a highly available system. The SPOFs in a system depend not only on the architectures of the products involved, but how they are deployed and the hardware dependencies involved. For example a disk drive may be a point of failure, but not if it is provided by a Storage Area Network (SAN).
When analyzing a system for SPOFs it is important to consider both the impact of the SPOF and the cost of eliminating it. In some cases it may be more expensive to remove an SPOF than it is to deal with its potential down time. For example satellite links are often the only connectivity to remote areas. In these cases it is important to raise awareness of these failure points, and to have contingency plans for operation during failures
Redwood Server can be configured to operate without any SPOF if the high availability features are configured correctly, and it is run on appropriate infrastructure. This includes high availability for:
- the central Redwood Server, including the application server.
- the database.
- the remote systems being managed.
Network Dependencies
High availability setups should do their best to reduce dependencies on network resources, or where this is not possible, to ensure that the network resources themselves are highly available. The most important of these are the Domain Name Servers (DNS) and the Network Time Protocol (NTP) servers. If DNS or NTP are in use, then multiple servers should be available.
Another major area where network dependencies should be avoided is network file systems like the UNIX Network File System (NFS) and Windows file shares (using Server Message Block or SMB). Such file systems are not sufficiently fast or reliable for many scenarios, including:
- Storing the software itself. Software should always be deployed locally.
- Storing data intended to be stored on a local disk.
The DataRootDirectory
where configuration and output files are stored as well as the installation directory must not reside on a NAS (NFS or SMB share); a SAN will be considered local if it is mounted via iSCSI, for example.
note
Redwood recommends strongly against installing the software on a networked file system. If this recommendation is ignored, and you have random errors that Redwood believes are caused by the NAS (NFS or SMB share), that Redwood cannot reproduce on local storage, you will be required to demonstrate that the issue can be reproduced when installed on local storage. The resolution to this issue may require that you reinstall on local storage.
Data Storage
Each component of Redwood Server has a 'data root directory' where important files like tracing, persistent state and output are stored. It is important that this directory is on a local disk or SAN (not on a network file system) and that it is appropriately sized. Similarly, the tables in the database should have sufficient space to grow, particularly when significant amounts of processing may occur: overnight, over the weekend, or during period end processing.
All components of Redwood Server have an active tracing facility that will log errors (particularly fatal errors) as they are running. In general these trace files do not take up much space (10-100Mb), but when debugging is activated, or a catastrophic failure occurs they can grow more quickly.
Adequate Sizing
Adequate sizing is an important part of any high availability solution, as many outages are caused by sizing problems. The most common are:
- Disks, logical volumes or file systems filling up. These prevent data from being written and will eventually stop all processing until the condition is corrected.
- Correctly sized primary and backup servers. If insufficient CPU or memory is allocated and machines are overloaded they can appear to be 'down' when they are in fact very busy.
Testing
High availability setups should be tested regularly, or at least when the HA or software configuration changes. Without testing small errors like typing ".con" instead of ".com" in a host name may not be picked up until fail-over occurs, in which case a small, easily correctable error could cost many hours (or thousands of dollars).
Redwood Server Support for Networked File Systems
When you want to use Redwood Server in a networked environment there are occasions when you want to use networked file systems in your infrastructure. The term networked file system in this context designates NAS systems, also known as SMB X/NFS/SSHFS file shares; note that SAN file systems, for example mounted via iSCSI, are considered local storage.
This statement outlines the support of the different installable components for using networked file systems.
Redwood Platform
Redwood Platform is the web hosting environment that Redwood delivers explicitly for hosting Redwood RunMyJobs and Finance Automation.
Redwood Platform is supported on UNIX when installed on a networked file share.
Note, that when Redwood Platform is installed on a networked drive you increase the risk that the system will be unavailable due to infrastructure outages.
Redwood Platform is not supported on a networked drive on Windows. The main issue here is ensuring that the shared drives are available before the service starts.
Platform Agent
Platform agents are responsible for the actual execution of processes on an OS level (OS jobs). Platform Agents are not required to run all types of processes, notably RedwoodScript, SAP, reports, and chains do not require a Platform Agent. When a platform agent is required, it must be installed on the target system that the jobs need to be executed on. Platform Agents also have the ability to monitor file systems for changes in directories, and can raise file events upon these changes being noticed.
On Microsoft Windows servers the installation directory and DataRootDirectory must be located on a local file-system.
As stated earlier, Redwood does not recommend the use of a networking file system for the DataRootDirectory. Reliable communication between the Platform Agent processes depends on the reliability of the underlying file system, especially when it comes to the persistent message store in case one of the processes crashes. Practice learns that NFS is not always reliable when a job-processor writes a message to the persistent store and informs the network-processor about it. These messages must be available to the network-processor at that same time for reliable job processing.
Other restrictions that may be limiting network file systems from being used are:
- The network file share should support UNIX file modes such as
setuid
andsetgid
bits, or restrictions may apply to user switching. - The network file system should support file locking. File locking checks are being used by the
network-processor
to verify whether or not ajob-processor
is still alive.
It is always possible to link files from outside of the DataRootDirectory
to processes, if large files need to be stored and managed with the process.
Performance on a network file system will be substantially lower, due to the additional networking overhead when accessing files over the network.
High Availability of the Central Application Server
Redwood Server is not traditional JEE application server software in that it is not merely a light user interface to a backend system; the application server is both user interface and backend system. You do not usually have great numbers of concurrent users on production systems, instead, the application server schedules many processes across your data center. Productive systems can have hundreds or thousands of processes starting, running, or completing at any given time and fewer than 5 concurrent users. In development systems you have more users, if heavy development is being done, and much lower workload.
HTTP load balancing to scale the number of Redwood Server users is largely pointless as a single node can handle far more users than there are ever likely to be. In active-active clusters the primary server handles the bulk of the background operations and there is communication overhead to keep data synchronized. Redwood Server benefits from more RAM and CPU cores. Load balancing only makes sense for failover between machines, to switch from an unresponsive to a healthy system. If you choose to balance HTTP load in active-active clusters (not recommended), Redwood Server needs sticky sessions.
Redwood Platform
You use the integrated high availability of the central Redwood Server by configuring two or more nodes to use the same database connection settings. When additional application servers access the database (same database and user/schema), the first to access the database becomes a primary server, additional application servers automatically become secondary servers. When a primary node fails, a secondary server is immediately promoted primary. This type of setup requires licenses for each application server.
You can use third-party fail-over software with Redwood Server using one single license for both nodes as long as both nodes use the same hostname and port. On Windows, using MSCS, you configure Redwood Platform as a service and use the cluster software to control that service on all nodes. On UNIX, you use the platform-specific clustering solution and configure Redwood Platform to start with init
or the platform-equivalent.
As the central Redwood Server runs in a Java Enterprise Edition (JEE) application server stack, availability of this stack is paramount. For this reason Redwood Server can run in multiple nodes if the application server is configured to do so. This is often called clustered or multi-node mode. When this is the case then:
- Users are load-balanced across all the nodes by the JEE Application server, so users can end up on any node.
- All nodes cache read data for improved performance, and participate in cache coherence.
- One node (the primary) performs certain time-critical operations. If this node fails, one of the secondary nodes will automatically take over this functionality.
The time critical operations are performed on a single node as this is more accurate (global locks can take 10ms or more to acquire), and allows the use of both aggressive caching and fast (nanosecond) local locking. These significantly boost performance in all scenarios (single and multi-node), and provides faster, more reliable fail-over.
Application Server Setups
There are three classes of application server setup:
- single node, single host - a single application server node is running the software, on a single host.
- multiple node, single host - multiple application server nodes run the software, but still on on a single host.
- multiple node, multiple host - multiple application server nodes run the software, on multiple hosts.
The "single node, single host" and "multiple node, single host" setups both require a separate host fail-over mechanism to handle the host itself failing (and thus all nodes failing). This is possible, but not required in a "multiple-node, multiple host" setup.
The advantages and disadvantages of the different setups are:
Setup | Advantages | Disadvantages |
---|---|---|
Single node, Single host | Simple set up. Fits in with existing host based fail-over mechanisms | Relies entirely on host-based fail-over. Most efficient use of memory, network and CPU. |
Multiple node, Single host | Fits in with existing host based fail-over mechanisms | More complex set up. Some duplication of software and memory usage. Increased CPU usage |
Multiple node, Multiple host | No host-level high availability scenario needed | Most complex set up. May require an external load balancer. Some duplication of software and memory usage. Increased CPU and network usage. |
The multiple mode setups may allow some reconfiguration of the application server (for example JVM parameters) while the application is running, depending on the application server vendor. This does include patching or upgrading Redwood Server while it is running.
The bindAddress can potentially be the same for all nodes on the same host, as it is set to the IP address of the server.
So the registry paths will be as follows:
/configuration/boot/cluster/<host>/<instance ID>/<node ID>/bindAddress
/configuration/boot/cluster/<host>/<instance ID>/<node ID>/port
/configuration/boot/cluster/<host>/<instance ID>/bindAddress
Where
<host>
- the hostname as defined by InetAddress.getLocalHost().getHostName() on Redwood Platform.<instance ID>
- the identifier of the instance in the cluster, usually0
, on Redwood Platform the second and third digit of the port number, by default.<node ID>
- the identifier of the node in the cluster usually1
You can also set JVM parameters for the above which will have precedence over the registry entries. To set system properties, you replace the /configuration/boot/cluster
with com.redwood.scheduler.boot.cluster
for example:
com.redwood.scheduler.boot.cluster.<host>.<instance ID>.bindAddress
will override /configuration/boot/cluster/<host>/<instance ID>/<node ID>/bindAddress
.
note
You can use the System_Info process to gather the information you need to set these parameters and inspect their values.
Performance in Multiple Node Setups
Workload in a multiple node setup is not homogeneous, at a given point in time some nodes may be doing very little while others are processing large, complex tasks. This can result in different performance profiles on different nodes.
Multiple node setups require more hardware resources, particularly memory. In order to operate the Redwood Server software, its data (including cache), the application server and operating system all need to be in memory. Having a single node per host leaves the most memory available for data and cache, as there is only one copy of the other items (application server, Redwood Server software and OS). With multiple nodes per host additional copies of the application server, Redwood Server software are required. Hence the nodes will end up having to share the remaining memory for their data and cache. This means that there is less memory for each node's cache, and that some items may be cached twice (once on each node). This affects performance (more database accesses) and is less efficient (due to the duplication of the application server, Redwood Server software and cache).
Node & Host Failure
Both node and host failures are dealt with by the same mechanism. If a node or host fails then other nodes will continue processing their current tasks, depending on the type of node that failed (primary or secondary), other activities may take place:
- If a secondary node fails - it can simply be restarted with no impact on the time critical operations on the primary server.
- If the primary node fails - a single node can quickly take over as there is no need to co-ordinate all nodes.
Network Failure
The server can deal with multiple modes of network failure. If the network failure is between the central Redwood Server and a managed system. More information can be found in the 'Network failure' sections of individual components.
Software Failure
If a node fails because of a software induced error (for example out of memory, or a bug in the software stack) then the effect is generally limited to that node. In many cases the failure will not recur as it was the result of a specific action, or specific circumstances. In all setups (but particularly for single node setups) the application server should be configured to restart nodes automatically if they fail, to alert an operator, and to implement a brief delay between restarts.
High Availability of the Database.
High availability of the database used by the application server is also important, as all data is stored in the database. You should follow the guidelines from your application server vendor for setting up high availability between the application server and the database. Without access to the database many components will wait for it to become available again before continuing, since they cannot write any data.
Database operations are automatically retried if they fail and the failure code indicates that a retrying may fix the error. These retries are done at the transaction level to ensure data integrity.
Database Host Failure
Should the database host fail, then the database HA solution is responsible for making the database available again. Common solutions include:
- Restarting the database host.
- Fail-over to another host that has access to the same database files
- Active-Active setups like Oracle Real Application Clusters.
In all cases the database HA and the application server are responsible for ensuring that the database is available, and that the data from transactions that have previously been committed is available to Redwood Server. Should this not be the case then data inconsistencies may result.
One important case of database host failure is when the disks (or SAN, etc) that store the database files become unavailable or full. You should ensure that sufficient disk space is available for database data, particularly before significant period end or overnight processing is likely to occur.
Network Failure
Transient network failures will be handled by the automatic retry system. This will retry operations using a back-off system that allows brief outages to be recovered from quickly (seconds or less), but will retry less and less often as the length of the network outage increases.
Software Failure
The automatic retry system can cope with some software failures, particularly where the failure did not result in a transaction being committed to the database and where retrying the same transaction will result in it being committed successfully. This includes handling errors related to connection pools, some driver errors, and deadlocks in the database caused by row level locking.
Should the database software fail in a more permanent manner then Redwood Server will act as if the database host failed.
High Availability of the SAP Connector
SAP System Failure
The way that the SAP connector has been designed means that it is as resilient as possible against host failure. If the central Redwood Server goes down, it will not receive further jobs, but the jobs that are already running on the SAP system can finish. The results of these job terminations will be picked up by the SAP connector when the central Redwood Server becomes available and will be applied to the repository (database).
If the SAP system itself goes down, the central Redwood Server will keep trying to contact the SAP system. There is no timeout or such; it will keep doing this until the SAP system is back up. The central Redwood Server then examines which processes were running according to its information, and asks the SAP system what happened to each of these SAP jobs. The SAP connector then in turn determines whether any of these processes finished. If no information is found, the outcome is undetermined and the status is set to Unknown. This can then be used to trigger escalation procedures or automatic restarts.
It is recommended to connect the SAP connector to an SAP system via the message server rather than connecting it to a specific ABAP application server as this gives you the following advantages:
- resilience to ABAP application server failures
- automatic load balancing of RFC connections across ABAP application servers
For making the SAP system itself high available you have to consider all its components:
- database
- network
- hardware
- operating system
- software - ABAP application servers, message server, enqueue server, etc
Please refer to SAP documentation and support for more details on making the SAP system high available.
The site http://help.sap.com/ contains information on how to make your specific SAP release high available. It also lists all SPOFs that you have to consider.
Network Failure
If the network goes down, but the hosts on both side stay up, the central Redwood Server is effectively disconnected from the SAP system. It can no longer send new job start requests to the SAP system, nor can status updates be transmitted to the central Redwood Server. The server will start reconnecting until the connection is available again. It keeps trying until it succeeds. Once the connection is re-established, operations are picked up where they were left. Some delay will have been introduced, but no status has been lost.
Software Failure
Experience so far with the SAP connector shows that it is very reliable, with the highest incidence of customer problems related to configuration issues.
SAP Connector Fail-Over Scenarios
The SAP connectors run inside the central Redwood Server, so making the JEE server high available will also make the SAP connectors high available.
High Availability of Platform Agents
When examining the availability of platform agents that monitor and control the execution of operating system jobs on the computer systems under the schedulers span of control there are three types of failure that can occur:
- Host failure - what happens if either side of the connection goes down.
- Network failure - what happens if only the network goes down.
- Software failure - crash or kill of the platform agent or just a single component of the platform agent.
Some user setups that guard against the above require redundancy: having a fall-back or backup platform agent available. We will show that this can be accomplished in (again) three different manners.
Host Failure
The way that the platform agent has been designed means that it is as resilient as possible against host failure. If the central Redwood Server goes down, it will not receive further jobs, but the jobs that are already running on the remote system can finish. The results of these job terminations are stored on disk until the central Redwood Server is ready to pick this data up and apply it to the repository (database).
If the host itself goes down, the central Redwood Server will keep trying to contact the platform agent. There is no timeout or such; it will keep doing this until the server that the platform agent runs on is back up and the platform agent has restarted. The central Redwood Server then examines which processes were running according to its information, and asks the platform agent what happened to each of these OS jobs. The platform agent then in turn determines whether any of these jobs finished. If no information is found, the outcome is undetermined and the process status is set to Unknown. This can then be used to trigger escalation procedures or automatic restarts.
Network Failure
If the network goes down, but the hosts on both side stay up, the central Redwood Server is effectively disconnected from the platform agent. It can no longer send new job start requests to the agent host, nor can job termination messages be transmitted to the central Redwood Server. The server will start reconnecting until the connection is available again. It keeps trying until it succeeds. Once the connection is re-established, operations are picked up where they were left. Some delay will have been introduced, but no status has been lost.
Software Failure
The platform agent consists of four different parts, some of which have extended run-times. These parts are:
- A monitor component that has no other task than monitor the network-processor called the platform-agent process.
- The network-processor, which is a long running process that is responsible for communicating with the central Redwood Server over the network.
- The job-processor, which is started for every job. It starts, monitors and controls a single job.
- The job can call various command-line tools such as jftp, mail etc.
The command-line tools all serve a single purpose, so there is a low chance of software failures causing long-term effects: every run is stand-alone. If a failure occurs, it only affects that single execution. All tools are designed so that they are easily scriptable, and return reliable error codes.
The job-processor is a little more important, but if it fails this still affects only a single OS job. The job-processor tries to perform as much as it can before telling the network-processor that it has indeed set the job up as far as possible. It has been designed to do as little as possible after the OS job has finished. In this way the exposure to software failure that cannot be recovered has been reduced.
The platform-agent process also has a single well-defined and very simple job: monitor the network-processor. As long as the network-processor keeps running it does not do anything. Only when the network-processor stops for any reason does it verify that the network-processor stopped in a controlled manner. If this is not the case this indicates an abnormal failure. In such a case it will sleep for a number of seconds (currently fixed at 10 seconds) and then restart the network-processor. This way it guards against memory errors and other recoverable problems in the network-processor.
The network-processor itself is the most complicated part of the platform agent. It runs multi-threaded, with one task per thread, for an indefinite amount of time. As such there is a higher risk involved with software failure of the network-processor. To keep the actual effect of such a failure low, the network-processor has been designed such that such a failure will not cause a total breakdown of the agent. When the network-processor process aborts completely it is restarted by the platform-agent. When this happens already executing jobs are not affected, as the messages that need to be sent from the job-processor to the central Redwood Server are stored on disk. The network-processor will simply pick these up when it is restarted.
Experience so far with the platform agent shows that it is very reliable, with the highest incidence of customer problems related to configuration issues.
Platform Agent Fail-Over Scenarios
The platform agents normally run on a particular system. If you need redundancy in processes execution you are not forced to use a clustered operating system or hardware solution.
The central Redwood Server connects to platform agents using the following criteria:
hostname
- should be set to an FQDNport
sharedSecret
The port and sharedSecret must be the same for each server group; this way you simply change DNS entries to switch to another server. Ideally, the software should be installed on SAN storage attached to all servers in a group, this easies seamless transition and accessibility of output regardless of the active server. Each server is configured to start the platform agent at startup; on Windows, you create a Windows service Scheduler Service Manager ( servicemanager.exe
), on UNIX you use the platform-specific functionality ( SMF, init, or launchd ).
There are (at least) three scenarios how to handle fail-over between platform agents:
- Use a clustering solution provided by the hardware or operating system vendor. In such a case you have a single process server and multiple platform agents on the clustered nodes. The clustering software provides virtual IP and file system handover.
- A single process server, with two platform agents (active and warm) on two different machines. Fail-over from the active to the warm node is handled either by changing the RemoteHostName in the process server (and restarting it) or virtual IP handover.
- Using two (or more) process servers each contacting a single platform agent. This provides redundancy and enhanced throughput, since it is a warm-warm scenario where both systems can perform useful work if both are functioning properly.
Let us examine these three scenarios in more detail.
Clustering Software by 3rd Party Vendor
The platform agent can run on a system that is part of a hardware cluster. Popular solutions that provide this are IBM HACMP, Sun Cluster, HP ServiceGuard, Microsoft Cluster Service, and Microsoft Failover Clustering.
Operation of this scenario is as follows: the cluster software determines where the software should run. If this changes, it moves resources to the (winner of the) surviving node(s). Typical resources that are available are virtual IP addresses, hostnames, file systems and software components.
When you intend to use such a solution you must store the agent instance data on a file system that is locally available to the active node in the cluster, and this file system must move with the agent service. The reason that this is important is that the file system is used to persist data about OS jobs. If the file system is not available, knowledge about jobs that finished on node A will be unavailable once the service moves to node B.
Custom Active-Warm Setup
Instead of a vendor solution you can also prepare a custom setup where a 'warm standby' platform agent is already installed and configured. When problems occur with the active node, it is only a question of either changing the RemoteHostName or changing the IP address of the warm standby host. This does provide a solution for being able to start new processes, but if you are not able to make the file system where the status files of the old active node available to the new active node then the status for those processes is lost. Whether this is serious depends on your environment; please note that it is not the outcome itself that is lost, but only the status that the process scheduler has.
Active-Active Setup
In many circumstances it is possible to set up two or more systems that can each run a particular class of processes, for instance OS reporting jobs that connect to a common database, or OS jobs that perform computations.
Redwood Server supports such scenarios using the queue mechanism. Create a single queue where all processes of the class of processes that can run on these systems are sent. Attach multiple process servers, each with a single platform agent, to this queue.
This gives great redundancy characteristics: if a system becomes unusable or unreachable then all remaining jobs will be executed on the remaining process servers. The status of the processes that were executing on those systems is determined once the connectivity is restored; either the system has crashed in which case the status becomes Unknown or the job has finished successfully and the status was written to (local) disk where it can be retrieved after the failure is repaired.
Note that you can have as many classes of processes as required, as you can create as many queues as you like, and every process server can work for as many queues as desired. For more information see the Queue mechanism in the Concepts section.
Licensing
As configuring high availability often involves allocating and configuring additional resources, licensing may be affected. Areas where licensing becomes a consideration are:
- When the application is run on multiple server nodes - the license must be valid for all server nodes.
- Where multiple platform agents are being used in an active-active setup - the license must contain a sufficient number of process servers.
If you configure a multi-node system and the license is not valid on all server nodes you will get a warning in the user interface indicating which nodes are not covered by the license. Should this occur, contact your account representative to get an appropriate license.
See Also
- Configuring Platform Agents for High Availability
- Configuring Web Application Clusters for High Availability
onsiteTopic