1. High Availability


In Platform Release 1 (PR1), a PI Server can become a collective composed of several machines supporting the same data services.  The main goal of this High Availability (HA) release is to maximize the user’s access to PI data while keeping details of the implementation as transparent as possible.  The most dramatic contribution of the PI-SDK to this goal is the support of failover.  When an application is connected to a collective server with the 1.3.4 PI-SDK and the server becomes unavailable, the Server object is automatically reconnected to another collective member server.

 

Primary and Secondary Servers

A limitation of the PR1 release is that not all member servers in a collective are equal.  Each collective has a single primary server which supports all the calls a non-replicated PI 3.4.370 server was capable of.  In addition, configuration changes made to a primary server are replicated to other members of the collective.  A collective also contains one or more secondary servers which have limited functionality.  The limitations include:

o       Writing “configuration data” is not supported.  This includes:

o       Point attributes

o       State Sets

o       Users and Groups

o       Module Database

o       Reading and writing batch data is not supported by default

o       Writing time series data using the SDK is not supported by default

Because of this functionality, an application which successfully connects to a primary server can function as with earlier versions but if they connect, or become connected to a secondary server they may become subject to limitations.  In general, the PI-SDK determines if  a feature required by a call is not available on the currently connected member server and automatically changes the connection to a server that supports the functionality.  The exception to this is when there are no servers available that can support the requested function.  See Failover below for more details.

Initial Connection - Server.Open

To support the choice among member servers of a collective, the syntax of the connection string passed to Server.Open has been enhanced.  A new name value pair, ServerRole is added with the following values:

o       PreferPrimary (default)

o       RequirePrimary

o       Any

These values are used to influence the PI-SDK's selection when choosing a collective member server. 

“PreferPrimary” checks to see if the primary server is available and if so selects it.  If the connection attempt to the primary fails, (other than a permission error) then a secondary server is selected and another connection attempted.  This continues until a connection is made or each server in the collective has been tried.

Specifying “RequirePrimary” instructs the SDK to only attempt a connection to the single primary server.  The result of the connection attempt is returned.  Once connected this way, if the primary server becomes unavailable, no attempt at failover is made.  Calls requiring server support will fail until the server becomes available.  Specifying RequirePrimary can be used to guarantee the behavior of earlier versions but it prevents client failover and does not support the high availability goal.

Using a ServerRole set to “Any” allows the SDK to choose the target member server based on its own criteria.  Currently this is based on a configurable priority property but may change to use dynamic load criteria as this becomes available.  With PR1, a user can alter the priority settings programmatically or using the PI-SDK Connection Manager.  By varying priority settings among a user base, customers can achieve static load balancing of applications that use the “Any” connection preference. 

If no ServerRole is specified the Open command will use a setting of “PreferPrimary”.  This default typically provides an application with behavior consistent with earlier servers but still allows initial connection or failover to a secondary when the primary is unavailable, enabling high availability in existing applications.

Another method of specifying a ServerRole is using a property of the Server’s IPICollective interface called ConnectionPreference.  This property, a string, takes the same syntax as the connection string passed to Open and can be used to specify the preference at any time.  This is useful if the initial Open uses one preference but for a subsequent failover you want to use a different preference.  An application may force a switch to a different collective member using the IPICollective::SwitchMember method.

 

Automatic connection and reconnectionn

With any server version, when an SDK call is made which requires a call to the server and a connection has not yet been established or the connection has been lost, the SDK automatically attempts a connection. When attempting this connection to a server collective, the HA PI-SDK will consider the setting of the ConnectionPreference as described above and in a reconnection will use the previous connection string (if any was entered) so this new connection attempt follows the same server selection rules as the initial open.
 

Secondary Server Limitations

In a collective the secondary servers do not support all the functions of a regular PI server.  In particular, general configuration changes (managing PIPoints, Statesets, users and groups, and the module database) are not supported on a secondary.  Other functions are disabled by default on secondary servers (Reading and writing batch, writing time history values with the PI-SDK).  These behaviors can be modified by changing the PITimeout table on the PI server.  The HA SDK provides a view into these limitations in the IPICollective methods Behaviors and AvailableCollectiveBehaviors.  These behaviors are primarily determined by querying the server to determine what has been disabled or enabled.  The server flag names and their corresponding representation in the behavior collections is shown below:

 

Name

Primary default

Secondary Default

Override Setting

 

 

 

 

AllowBatchReads

True

False

Replication_EnableBDBAccess

AllowBatchWrites

True

False

Replication_EnableBDBAccess

AllowSDKWriteValues

True

False

Replication_EnableSDKWriteValues

AllowConfigWrites

True

False

None

 

When these restrictions are encountered the HA SDK will attempt to switch to a Primary server (see automatic failover below).  If this is not possible due to availability an error is returned without placing a server call. 

 Client Collective Configuration

The configuration information for a collective is more extensive than for a single server but the process is simple because every member of a collective can provide the details of the collective structure and the SDK queries for this structural information and configures the client accordingly.

The Servers.Add  method behavior has been modified so that when the confirm flag is set to true, a connection is made to the specified path and the collective structure information is retrived and  added to the Known Servers Table (KST).  The confirm flag has been available all along but only in the current version of the connection manager is it set to true by default. 

When an existing server is opened that was formerly a non-collective that has been turned into a collective, or when an unconfirmed server is opened that is a collective, the initial connection process in Server.Open will determine the change in structure and query the server for the collective structure.  It will then update the KST with the required information. 

The new connection manager supports this process through the use of these methods.

Collective member servers are in constant communication synchronizing configuration changes and status information.  When the PI-SDK establishes an initial connection with a collective member it retrieves availability information for the whole collective.  While connected, the member server will alert each client connection of any changes in availability of the member servers.  Knowing the availability of each member server allows the PI-SDK to avoid trying to connect to servers that are marked unavailable except as a last resort. For example, in a collective consisting of 1 primary and 3 secondaries, an application initially establishes a connection with the primary.  During the connection it learns that two of the secondary servers with local priority settings of 1 and 2 are down (perhaps being upgraded). When due to some problem the primary goes down, the PI-SDK attempts a failover to the third secondary member server first even though the priority would suggest one of the other secondaries, because it has been told those servers are unavailable.  Note if this attempt fails the PI-SDK will still attempt to connect to each member in case its status information is out of date. 

A new Windows event has been added to the Servers collection to notify applications when an availability change has occurred.  An application can use calls to the CollectiveMember StatusInfo and the IPICollective Behaviors and AvailableCollectiveBehaviors properties in conjunction with this event to enable and disable user interface features to reflect the availability of functionality it provides. 

Server Selection Algorithm

The rules described below are those followed in the initial release of HA.  As the platform evolves and servers supply more information (CPU usage, connection count, etc.), these rules will likely change to promote more efficient resource allocation through dynamic load balancing.

The connection preference, described above, governs the connection logic.  A setting of “RequirePrimary” explicitly sets the server to the single primary.  A setting of “PreferPrimary” will also return the primary server unless the primary is unavailable at which point the condition is equivalent to a connection preference of “Any” and the following selection rules apply..

The member server with the smallest priority setting, described above, (number 1 is first, 2 is second) is chosen unless the collective has indicated the member is unavailable.  A connection attempt is made to the chosen server.  If it succeeds the process ends.  If it fails, the next available member server with the next smallest priority is chosen.  If all available servers have been tried without success, the servers marked unavailable are considered and tried in order of priority as above.  The one exception is if the user has marked a member with a -1 priority, which indicates that connections to that member should not be made. This process continues until a connection to every member of the collective has been attempted.  If no connection has been established an error is returned.  The next PI-SDK call which requires server access will begin the process again starting with the connection preference. 

Failover

When an application using the HA-PISDK attached to a collective loses its connection (e.g. network failure, software crash, hardware failure), or the user initiates a member switch with the Connection dialog, or the server is shutdown, the PI-SDK will automatically attempt a reconnection to another member.  In a typical case, the application is attached to the primary and upon losing its connection it tries a connection to each secondary member server of the collective that is available until a new connection is achieved or all are tried.  With the primary down, the secondary member servers are tried in order of their configured priority and availability (see Server Selection Algorithm above). 

During the failover process, when a server connection is lost, a Disconnect event is fired and when reconnected, an Open event is fired. 

Note that during a failover, signups for change notification (EventPipes) on one server need to be reestablished on the next server.  During the transition period some events can be missed.  Also because replication is not instantaneous and because data delivery is fanned from input sources and there is latency (or even disconnection and buffering) between the data source and the server, the data streams between the initial connection and the new connection after a failover may not be perfectly synchronized.  For example, after failing over, the new server may not yet have seen data that was already delivered to the first server resulting in duplicate data arriving in EventPipes.  When the new server is ahead of the original server events that had not yet been delivered to the original EventPipe are already on the new server and aren't sent again.  This can result in a gap in the event stream in an EventPipe.  However, the archive of both servers eventually has the same complete history.  This behavior can be observed in ProcessBook when during a failover a gap is displayed but on refreshing the data from the new server, a continuous trace is displayed. 

Automatic Failover

When a call is made to use a server feature that is disabled on the currently connected server (see Secondary Server Limitations above), the PI-SDK will attempt to automatically fail over to the primary server if it is available before placing the call.  This allows client applications unaware of restrictions to continue to work properly when the primary server in a collective is available.

 

Enabling Operational Intelligence