Application Continuity for Oracle workloads with Amazon RDS Custom for Oracle

Customers running Oracle workloads choose Amazon Relational Database Service (Amazon RDS) Custom for Oracle to access their database environment and operating system, while maintaining the administrative ease, durability, and scalability of a managed service. Access through RDS Custom for Oracle enables the customizations that your workloads demand. Application Continuity is one such customization that you can implement on RDS Custom for Oracle. Application Continuity helps critical workloads such as banking applications with strict Recovery Point Objective (RPO) and Recovery Time Objective (RTO) meet their availability requirements.

In this post, we show you how to implement Application Continuity in an RDS Custom for Oracle environment using a sample application. We also show you how to test the implementation to see that, when an outage occurs at the database tier, the application recovers and resumes without any data loss—automatically and transparently—along with the database failover. Finally, we show you how to verify the results before cleaning up the environment.

What is Application Continuity?

Application Continuity (AC) is a feature that enables replay of a request against an Oracle database after a recoverable error that makes the database session unavailable. As a result, the outage appears to the user as no more than a delay in the request. Client requests can contain transactional and non-transactional work. After a successful replay, the application can continue where that database session left off. With AC, the user experience is improved by masking many outages—planned and unplanned—without the application developer needing to attempt to recover the request. Oracle Active Data Guard supports AC, and RDS Custom for Oracle lets you configure Oracle Active Data Guard. See Ensuring Application Continuity and Application Continity for Java to learn more about AC. On the client side, AC supports thin Java-based applications, OCI and ODP.NET based applications with support for open source drivers, such as Node.js and Python. See the Application Checklist for Continuous Service for MAA Solutions to know more about configuring clients.

Solution overview

For this solution, we have two RDS Custom for Oracle instances, one configured as the primary and the other as the physical standby, in an Oracle Active Data Guard configuration. The instances are running in two different Availability Zones within the same AWS Region to provide high availability. The observer instance is running in a third Availability Zone. The observer instance also acts as the database client, and hosts the sample Java application, so we don’t need another Amazon Elastic Compute Cloud (Amazon EC2) instance. Also, we open the standard ports for Oracle Notification Service (ONS) to propogate database events only between the database on the database servers and the connection pool on the client host. Auto-ONS is turned on as a result, so no manual configuration of ONS is necessary.

There is no restriction on the placement of these instances. All of them could be in the same Availability Zone, or spread out in different Regions. It depends on the high availability and disaster recovery protection you are looking for and your constraints.

The solution has the following components:

The Active Data Guard configuration runs in the maximum availability mode
The Data Guard broker runs on all the three instances
The observer process runs on the observer instance
The observer instance also has Java JDK 11, along with the JAR files needed for AC (ojdbc11.jar, ucp11.jar, and ons.jar)

The following diagram illustrates the architecture.

Architecture

To implement and test this solution, you complete the following high-level steps:

Configure an Oracle database service to support AC.
Create database triggers to make sure the AC service runs automatically and only on the primary DB instance.
Configure the database connect string to support AC.
Enable Fast Application Notification for all three instances.
Verify that the Data Guard configuration is up and running with Fast-Start Failover enabled and observer running.
Create necessary database objects for the test.
Create a sample Java application incorporating AC enablement.
Run the Java application, and as it progresses, initiate a database failover by crashing the primary database (using shutdown abort), interrupting the program’s progress.
Watch the log file of the observer process as the failover happens.
Watch the application as it resumes automatically after the failover without error.
Connect to the database and verify that the transactions are complete and error free.

Prerequisites

To follow along, you must have the following:

Two RDS Custom for Oracle instances in an Oracle Active Data Guard configuration and an EC2 instance with Oracle 19c client software to serve as the observer as well as the client. A Data Guard broker is running with Fast-Start Failover enabled, and the observer is started. For instructions, see Build high availability for Amazon RDS Custom for Oracle using read replicas. Alternatively, to build it using AWS CloudFormation templates, see Automate high availability setup in Amazon RDS Custom for Oracle. The preceding procedures help you create a standby in the same Region as the primary. To use different Regions, you need to build the standby manually. RDS Custom for Oracle is a bring your own license (BYOL) service. You should make sure you have Active Data Guard option along with an Oracle EE license. See Amazon RDS Custom pricing to learn about AWS charges.
Java JDK 11 installed on the EC2 instance created for the observer in the preceding step (the sample program given here works with Java JDK 17 as well). The code uses a thin JDBC driver and an Oracle Universal Connection Pool (UCP) to connect to the database.
JAR files ojdbc11.jar, ucp11.jar, and ons.jar on the observer EC2 instance (downloaded from Oracle Database JDBC driver and Companion Jars Downloads).
Port 6200 open for inbound traffic in the security groups of all three EC2 instances, allowing traffic only from those three instances. Auto-configured Oracle Notification Service (ONS) uses this port for intra-node communication. If you want to use a different port, you also have to start ONS explicitly and keep it running on all the three instances.

Create the database services

Connect to the primary database as an admin user through SQL*Plus, and using the DBMS_SERVICE package, create a database service that supports AC. We set the service name to DGACO in this example. The network name, the name of the service as used in SQLNet connect descriptors for client connections, can be different, but for ease of use, we use the same name. Although the values of the first six parameters are necessary to enable AC, you can adjust the numerical values of the following parameters as needed. See the following code:

execute dbms_service.create_service ( -
  service_name => 'DGACO', network_name => 'DGACO');

DECLARE
  params dbms_service.svc_parameter_array;
BEGIN
  params('FAILOVER_RESTORE'):='LEVEL1';
  params('COMMIT_OUTCOME'):='TRUE';
  params('FAILOVER_TYPE'):='TRANSACTION';
  params('AQ_HA_NOTIFICATIONS'):='TRUE';
  params('STOP_OPTION'):='IMMEDIATE';
  params('SESSION_STATE_CONSISTENCY'):= 'DYNAMIC';
  params('REPLAY_INITIATION_TIMEOUT'):=600;
  params('RETENTION_TIMEOUT'):=86400;
  params('DRAIN_TIMEOUT'):=300;
  params('FAILOVER_DELAY'):=5;
  params('FAILOVER_RETRIES'):=30;
  dbms_service.modify_service('DGACO',params);
END;
/

execute dbms_service.start_service (service_name => 'DGACO');

Create a database trigger

From the same session, create a database trigger on the primary database and enable it. This trigger verifies that the AC service created in the preceding step runs automatically and only on the primary DB instance at a given time. See the following code:

CREATE OR REPLACE TRIGGER "SYS"."START_AC_SERVICE"
AFTER STARTUP ON DATABASE

DECLARE
    CURRENT_DATABASE_ROLE       VARCHAR2(16)    :='' ;

BEGIN
--  Find out if the database is the primary database.
    SELECT DATABASE_ROLE
      INTO CURRENT_DATABASE_ROLE
      FROM SYS.V_$DATABASE ;
--  Ensure that the SERVICE is stopped
    BEGIN
       DBMS_SERVICE.STOP_SERVICE('DGACO') ;
    EXCEPTION
       WHEN OTHERS THEN
         NULL ;
    END ;
--  Start the SERVICE for PRIMARY role
    IF CURRENT_DATABASE_ROLE = 'PRIMARY' THEN
       DBMS_SERVICE.START_SERVICE('DGACO') ;
    END IF ;
END ;
/

ALTER TRIGGER "SYS"."START_AC_SERVICE" ENABLE ;

Edit the tnsnames.ora file to define a connect string for the AC service

Use the following code to edit the tnsnames.ora file:

DGACO =
  (DESCRIPTION =
    (CONNECT_TIMEOUT=90)
    (RETRY_COUNT=30)
    (RETRY_DELAY=5)
    (TRANSPORT_CONNECT_TIMEOUT=5)
    (ADDRESS_LIST=
      (LOAD_BALANCE=off) (FAILOVER=on)
      (ADDRESS = (PROTOCOL = TCP)(HOST = <PriDBHostIP>)(PORT = 1521))
      (ADDRESS = (PROTOCOL = TCP)(HOST = <SbyDBHostIP>)(PORT = 1521))
    )
    (CONNECT_DATA = (SERVER = DEDICATED) (SERVICE_NAME = DGACO))
  )

Enable Fast Application Notification and ONS

Fast Application Notification (FAN), the feature that enables the application to respond to the FAN events, needs the ONS ports to be opened on the database and the client hosts. Modify the inbound rules of the security groups of all three EC2 instances to allow port 6200.

Security group settings for ONS

There are no code changes to use FAN. FAN is automatically configured and enabled out of the box. You need to confirm appropriate settings on the client side. See the appendix of the FAN whitepaper for details. When the client connects, the Oracle database reads the URL or TNS connect string and automatically configures FAN at the client.

Verify a valid Data Guard configuration

If you have fulfilled the prerequisites discussed earlier, the Data Guard configuration should be valid and running. If necessary, use DGMGRL to restart the observer to capture the logs in a file:

DGMGRL> stop observer;
DGMGRL> start observer ObsOnCli in background 
file is '/home/oracle/gramacha/observer.dat' 
logfile is '/home/oracle/gramacha/observer.log' 
connect identifier is '<Pri_StaticConnectIdentifier>';

DGMGRL> show configuration verbose;

The output should be similar to the following:

DGMGRL> show configuration verbose; 

Configuration - nyc-lax

  Protection Mode: MaxAvailability
  Members:
  lax_a - Primary database
    nyc_a - (*) Physical standby database 

  (*) Fast-Start Failover target

  Properties:
    FastStartFailoverThreshold      = '15'
    OperationTimeout                = '30'
    TraceLevel                      = 'USER'
    FastStartFailoverLagLimit       = '30'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '0'
    FastStartFailoverLagGraceTime   = '0'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'CONTINUE'
    ConfigurationWideServiceName    = 'NYC_CFG'
    FastStartFailoverLagType        = 'APPLY'

Fast-Start Failover: Enabled in Zero Data Loss Mode
  Lag Limit:          30 seconds (not in use)
  Lag Type:           APPLY
  Threshold:          15 seconds
  Active Target:      nyc_a
  Potential Targets:  "nyc_a"
    nyc_a      valid
  Observer:           obsoncli
  Shutdown Primary:   TRUE
  Auto-reinstate:     TRUE
  Observer Reconnect: (none)
  Observer Override:  FALSE
  Lag Grace Time:     0 seconds

Configuration Status:
SUCCESS

DGMGRL>

Create database objects for the sample application

Connect to the primary database as an admin user in SQL*Plus and run the following:

create user ACDEMO identified by “<password>” 
     default tablespace USERS 
     temporary tablespace TEMP 
     quota UNLIMITED on USERS;

grant create session, create table to ACDEMO;

Connect to the primary database as the user ACDEMO in SQL*Plus and run the following:

create table transactions (
     xact_id    number(8),
     xact_date  date,
     host_submitted_on  varchar2(24));

create table steps (
     xact_id    number(8),
     xact_step  number(8),
     step_date  date,
     host_executed_on   varchar2(24));

Create a sample application

In the observer instance (which is also the database client in this case), create a Java program and save it as ACDemo.java using the following sample code. When compiled and run, the program connects to the database user ACDEMO and runs a nested loop of transactions. The outer loop inserts a parent transaction with a single INSERT statement and commits, and the inner loop step incorporates a series of INSERT statements without a commit in between, representing individual steps of the parent transaction referenced earlier. When testing, you will simulate a failure during the running of the inner loop, so you will have some INSERT statements that have run but not been committed at the time of the database failure.

The code also incorporates other requirements for enabling AC:

Draining – For planned maintenance, the recommended approach is to provide time for current work to complete before maintenance is started. You do this by draining work. Using an Oracle connection pool is one of the methods available for draining. You will use Oracle’s Universal Connection Pool (UCP) here. Always return the connections to the connection pool, and use connection tests (ValidateConnectionOnBorrow = true).
Fast Connection Failover (FCF) – The FCF feature is a FAN client implemented through the connection pool. Enable FCF (setFastConnectionFailoverEnabled = true).
Connect wait timeout – Set connect wait timeout (ConnectionWaitDuration = 3).
Auto-commit – Disable auto-commit (CONNECTION_PROPERTY_AUTOCOMMIT = false).
JDBC statement cache – Use the JDBC statement cache (CONNECTION_PROPERTY_IMPLICIT_STATEMENT_CACHE_SIZE = 100).

Create your application with the following code:

import java.sql.Connection;
import java.sql.Statement;
import java.util.Properties;
import java.util.Random;
import java.time.Duration;
 
// We define the Pool Data Source here
import oracle.jdbc.OracleConnection;
import oracle.ucp.jdbc.PoolDataSourceFactory;
import oracle.ucp.jdbc.PoolDataSource;
 
public class ACDemo {
 
    public static void main(String[] args) {
        usePoolDataSource();
    }
     
    public static void usePoolDataSource() {
        try {           
            // Choose the UCP
            PoolDataSource pds = PoolDataSourceFactory.getPoolDataSource();
            pds.setConnectionFactoryClassName("oracle.jdbc.datasource.impl.OracleDataSource");
            pds.setConnectionPoolName("JDBC_UCP");
            pds.setMinPoolSize(4);
            pds.setMaxPoolSize(20);
             
            // Configure the database connection
            pds.setURL("jdbc:oracle:thin:@(DESCRIPTION=(CONNECT_TIMEOUT=90)(RETRY_COUNT=30)(RETRY_DELAY=5)(TRANSPORT_CONNECT_TIMEOUT=5)" +
                       "(LOAD_BALANCE=off)(FAILOVER=on)" +
                       "(ADDRESS=(PROTOCOL=TCP)(HOST=10.0.0.36)(PORT=1521))(ADDRESS=(PROTOCOL=TCP)(HOST=10.0.0.168)(PORT=1521))" +
                       "(CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=DGACO)))");
            pds.setUser("ACDEMO");
            pds.setPassword("<password>");
             
            // Application Continuity prerequisites
            pds.setConnectionWaitDuration(Duration.ofSeconds(3));
            pds.setFastConnectionFailoverEnabled(true);
            pds.setValidateConnectionOnBorrow(true);
             
            // Disable the auto-commit and set the cache size
            pds.setConnectionProperty(OracleConnection.CONNECTION_PROPERTY_AUTOCOMMIT, "false");
            pds.setConnectionProperty(OracleConnection.CONNECTION_PROPERTY_IMPLICIT_STATEMENT_CACHE_SIZE, "100");
            
            Random rand1 = new Random();
            int xid = rand1.nextInt(2)+3; 
            System.out.println("Total transactions " + xid);
            for (int i=1; i<=xid; i++) {
                Random rand2 = new Random();
                int xstep = rand2.nextInt(2)+3;

                //get first connection from the pool
                Connection dbconn01 = pds.getConnection();
                //dbconn01.beginRequest(); // Unnecessary since getConnection() sets the explicit Begin Request boundary
                dbconn01.setAutoCommit(false);
                Statement insstmt1 = dbconn01.createStatement();                                
                // Insert the transaction
                String insert01 = "insert into transactions (xact_id, xact_date, host_submitted_on) "
                                  + "values (" + i + ", sysdate, "
                                  + "sys_context('userenv','server_host'))";
                insstmt1.executeUpdate(insert01);
                dbconn01.commit() ;
                //dbconn01.endRequest(); // Unnecessary since close() sets the explicit End Request boundary
                dbconn01.close();
                dbconn01=null;          

                // Get second connection from the pool
                Connection dbconn02 = pds.getConnection();
                //dbconn02.beginRequest(); // Unnecessary since getConnection() sets the explicit Begin Request boundary
                dbconn02.setAutoCommit(false);
                Statement insstmt2 = dbconn02.createStatement();                                
                // Insert the steps 
                // We will simulate the failure during this loop
                for (int j=1; j<=xstep; j++) { 
                    String insert02 = "insert into steps (xact_id, xact_step, step_date, host_executed_on)"
                                      + "values (" + i + ", " + j + ", sysdate, "
                                      + "sys_context('userenv','server_host'))";
                    insstmt2.executeUpdate(insert02);
                    Random rand3 = new Random();
                    int steptime = rand3.nextInt(6000);
                    Thread.sleep(steptime);
                    System.out.println("Executing Transaction ID " + i + " Step: " + j + " of " + xstep + "; Duration: " + steptime + "ms");  
                }
                dbconn02.commit();
                //dbconn02.endRequest(); // Unnecessary since close() sets the explicit End Request boundary
                dbconn02.close();
                dbconn02=null;          
            }
        }
        catch (Exception e) {
            e.printStackTrace();
        }  
    }
}

This is a sample program (a modified version of code from database-heartbeat.com), employing a minimalistic approach to demonstrate AC in its basic form. For other production uses, you should develop your own approach to employ all checks and features as dictated by your enterprise standards and business use case. Also, do not use hard-coded passwords in the code, and follow necessary policies while choosing the passwords. Finally, you should explore other associated features such as Fast Application Notification, Oracle Notification Service, Fast Connection Failover, Transaction Guard, Database Draining, and the extended feature of AC called Transparent Application Continuity (TAC) before proceeding further.

Compile the code using the following commands:

$ JAVA_HOME=<JDK11-Dir>
$ CLASSPATH="/<path>/ojdbc11.jar:/<path>/ucp11.jar:/<path>/ons.jar"
$ export JAVA_HOME
$ export CLASSPATH
$ $JAVA_HOME/bin/javac ACDemo.java

Run the application and invoke a primary database failure

From the same terminal (T1), run the application:

$ $JAVA_HOME/bin/java ACDemo

While the application is running, go to another terminal (T2) connected to the primary database as SYSDBA in SQL*Plus and run the following command:

SQL> shutdown abort

The following screenshots show the T1 terminal (left) as the application waits, and the T2 terminal (right) of the forced database shutdown.

Application waits as the database failover is initiated

Monitor the database failover

In another terminal (T3), connected to the observer node, monitor the observer log file using the following command:

$ tail -f observer.log

The failover starts after FastStartFailoverThreshold property value, which is set to 15 seconds here.

Observer log: Fast-start Failover starting

After a few seconds, the failover will be complete.

Observer log: Fast-start Failover completing

Watch the application resume and finish

Now that the failover has succeeded, go back to terminal T1. In the following screenshots, the application resumes and finishes (left) and the original primary database is still down (right).

Application resumes and finishes as the database failover is completed

Verify transactions

Now you can connect to the ACDEMO user in SQL*Plus and issue the following commands:

break on xact_id on xact_date on host_submitted_on skip 1
select a.xact_id, a.xact_date, 
       a.host_submitted_on, b.host_executed_on,
       b.xact_step, b.step_date
from   transactions a, steps b
where  a.xact_id=b.xact_id
order by b.xact_id, b.xact_step;

Verifying the transactions

The result should show that all transactions, including all their steps, are complete with a delay of under 1 minute between them, but without any errors. In our example, the second transaction was submitted when the session was connected to the primary database (on ip-10-0-0-36), and the replay actually ran on the new primary database (on ip-10-0-0-168). You can see when the database failover occurred from the time stamps of the second transaction’s step 2 (14:56:07) and step 3 (14:56:58), and that the application resumed within a minute on the new primary DB instance.

Clean up

The services and components that you used for this solution are limited to the two RDS Custom for Oracle instances and the third EC2 instance for the observer. To clean up the resources you created, delete the RDS Custom instances and stop the EC2 instance with the observer.

Summary

In this post, you learned how to implement Application Continuity for your Oracle workloads in an RDS Custom for Oracle environment. You also learned how to test the implementation using a sample Java application. In the test, when you simulated a database failure, not only did the database fail over, but the application also failed over—automatically and transparently. Lastly, you verified that the application ran successfully without any interruptions or errors.

If you have questions or comments, please leave them in the comments section.

About the Author

Govardhanan Ramachandran is an Enterprise Support Lead Technical Account Manager at AWS. As part of AWS Energy, he helps his customers in the energy sector drive their adoption and use of AWS services. He provides them technical guidance to plan and build solutions using best practices, and to strive for maximum value from AWS. Govardhanan likes reading books and watching movies in his spare time.

AWS Database Blog