Real Time Analytics DW DR Architecture (Part II) - DR Recovery Operations
Introduction
In the first part of the blog Real Time Analytics DW DR Architecture (Part I), I described how to enable a disaster recovery architecture for a Real-Time Data Warehouse solution in OCI as depicted in the picture below:
In summary, with this architecture configuration, you have:
In RegionA:
- An ADW primary instance (OCIDBTARGET01) that, with Autonomous Data Guard, continuously aligns the Standby remote instance OCIDBTARGET01_Remote in RegionB.
- An OCI GG deployment (ocigg-RegionA) up and running that replicates data to the target ADW (OCIDBTARGET01).
- An oci-cli command scheduled on a Linux VM that periodically creates a manual backup of the OCI GG deployment and stores the backup files in the bucket GG-backup-RegionA.
- The OS bucket GG-backup-RegionA with cross-region replication enabled has as a target the bucket cross-GG-backup-RegionA in RegionB.
In RegionB:
- An ADW “Standby” instance (OCIDBTARGET01_Remote) aligned with the primary instance OCIDBTARGET01 in RegionA.
- An OCI GG secondary deployment (ocigg-RegionB) that can be restored with the backup files stored in the bucket GG-backup-RegionB.
- The OS bucket cross-GG-backup-RegionA that contains the OCI GG backup files (ocigg-backup) replicated from RegionA.
- An OCI Function triggered by an Event Rule whenever a backup file is created or updated in the bucket cross-GG-backup-RegionA, that copies the backup files in the bucket GG-backup-RegionB.
In the second part of the blog, I simulate a disaster in RegionA and describe the operational tasks that you need to perform to properly start the workload in RegionB.
Real-Time DW failure simulation
Initial State
Before a failure occurs, the workload is up and running properly in the primary OCI Region (RegionA, eu-milan-1). OCI GoldenGate processes are running and ready to update in real-time the primary ADW with changes that occur in the SourceDB. In the secondary OCI Region (RegionB, eu-frankfurt-1) the Standby ADW instance is active and constantly aligned with the primary instance, while the OCI GoldenGate secondary deployment is inactive.
RegionA, Autonomous Data Warehouse Primary instance:
RegionB, Autonomous Data Warehouse Standby instance:
RegionA, OCI GoldenGate primary deployment:
Fig.5: RegionB, OCI GoldenGate secondary deployment:
Backup of the OCI GoldenGate primary deployment in RegionA is periodically created in an Object Storage bucket in RegionA (GG-backup-RegionA), replicated in a bucket in RegionB (cross-_GG-backup-RegionA_) and finally copied in the Object Storage bucket where secondary OCI GoldenGate stores the manual backup files (GG-backup-RegionB).
RegionA, OCI GoldenGate manual backup (triggered by the oci-cli command script):
RegionA, OCI GoldenGate manual backup file in the Object Storage bucket:
RegionB, OCI GoldenGate manual backup file from RegionA replicated in the Object Storage bucket in RegionB:
RegionA, OCI GoldenGate backup file copied (by an OCI Function) in the Object Storage bucket reserved for OCI GG manual backup in RegionB:
Simulation of failure in RegionA
To simulate a failure in RegionA, I just turn off the primary OCI GoldenGate deployment and the primary ADW instance in RegionA.
RegionA, OCI GoldenGate deployment inactive:
RegionA, Autonomous Data Warehouse primary instance stopped:
The picture below shows how the architecture deployment would appear in real-life immediately after the failure in RegionA.
After the failure simulation, I insert a record in a table in the SourceDB that should be replicated in the target ADW. This is to show how, after OCI GoldenGate and ADW will be recovered in RegionB, it will also be possible to recover the transactions that have taken place in the source database in the meantime.
As an example, I insert a row in the table SRC_REGION of SourceDB configured in the OCI GoldenGate extract parameter file.
Connecting with SQL Developer Web to the SourceDB, I insert a record and commit the transaction:
The new record is then visible in the SRC_REGION table of SourceDB:
Real-Time DW Recovery Operational Tasks
You need to execute some operational tasks to properly start the workload in RegionB:
- Execute ADW Switchover.
- Start OCI GoldenGate deployment.
- Restore OCI GoldenGate deployment.
- Update the connection assigned to the OCI GoldenGate deployment.
- Start OCI GoldenGate Extract and Replicat process.
Steps to recover the workload in RegionB:
1) Execute ADW Switchover: From the OCI console, you can execute the ADW switchover. In this way, the ADW instance in RegionB (OCIDBTARGET01_Remote) changes its role to Primary:
Switchover of the ADW instance in RegionB:
ADW instance in RegionB is now primary:
By querying the SRC_REGION table in the replicated schema (SRCMIRROR_OCIGGLL) you can see that the new record inserted in the source DB during the downtime, as expected, is not yet present in the target ADW:
RegionB, OCI_REGION table not yet aligned with SourceDB after switchover:
2) Start OCI GoldenGate secondary deployment: From the OCI Console, start the OCI GoldenGate deployment (oci-GG-RegionB).
OCI GoldenGate deployment in RegionB is now active:
Once active, you can see that, since it’s never been used so far, at this moment the deployment has no extract or replicat process.
RegionB, OCI GoldenGate extract and replicat processes before restoring:
And it has two assigned connections. Connection to the source database (SourceDB) and the connection to the target database (TargetOCIDB) in RegionB (eu-frankfurt-1) which is the target ADW database that has now become primary.
RegionB, OCI GoldenGate associated connections before restoring:
3) Restore OCI GoldenGate secondary deployment
You must restore the OCI GoldenGate deployment in RegionB using the manual backup replicated from RegionA. You need to browse the deployment backups to find the manual backup that has the file bkp-ocigg as Object name and the bucket GG-backup-RegionB as Backup location. Then you can restore from that file.
OCI GoldenGate deployment backup to restore:
RegionB, OCI GoldenGate deployment restoring:
4) Update the connection assigned to the OCI GoldenGate secondary deployment
After restoring, you need to update the connections in the OCI GoldenGate configuration because now:
- The OCI GoldenGate backup from RegionA has the connection to Source DB (SourceDB) with the COLOCATION and WALLET_DIRECTORY parameters pointing to RegionA (eu-milan-1)
- The OCI GoldenGate backup from RegionA has the connection to Target DB (TargetOCIDB) pointing to ADW in RegionA (eu-milan-1), in addition to the wrong COLOCATION and WALLET_DIRECTORY parameters_.
RegionB, OCI GoldenGate connections after restoring:
To update connections, you can reassign (Unassign and then Assign again) one of the two connections to the OCI GoldenGate deployment using the OCI Console.
Update OCI GoldenGate connections, first step: unassign the connection:
Update OCI GoldenGate connections, second step: reassign the same connection:
Reassigning a connection to the OCI GoldenGate deployment forces a refresh of the connections used by the OCI GoldenGate processes. As you can see from the GoldenGate console, the OCI GoldenGate deployment again has the proper connection for RegionB:
OCI GoldenGate updated connections:
Note: Alternatively, to update the connections, you can also manually edit the connections using the OCI GoldenGate console (in the Configuration section). To do so, you need to manually enter the connection string (keep a copy of connection strings and passwords for ggadmin users before restoring).
5) Start OCI GoldenGate Extract and Replicat process
Now you are ready to start the OCI GoldenGate Extract and Replicat processes. Before starting those processes, you need to check the current CSN (Commit Sequence Number) in the checkpoint table of the replicat process. In order not to lose any transactions that occurred in the meantime in the source DB, Extract and Replicat processes must start from CSN immediately following the one specified in the CHECKTABLE of the target ADW.
You need to verify the CSN in the checkpoint table of the target ADW for your replicat process (REP in this case):
Checking the CSN of the target ADW:
Then you can start the Extract and Replicat processes with the proper CSN:
Start Extract process with the start point being after the CSN of the target ADW:
Start Replicat process with the start point being after the CSN of the target ADW:
The OCI GoldenGate deployment is now active in RegionB with all the processes properly updating the target ADW with changes starting from the ones that occurred during the failure.
RegionB, OCI GoldenGate processes up and running:
As an example, with a query against the SRC_REGION table in the schema SRC_OCIGGLL of the target ADW, we can see the record inserted in the source database during the unavailability of the workload processes.
RegionB, record inserted during unavailability in the SourceDB, recovered in the target ADW table:
Possible Enhancements
Automation Available
The five workload recovery steps listed above can be automated by leveraging the OCI and GoldenGate APIs.
Using Rest API, SDKs or oci-cli commands you can automatically:
1) Perform ADW Switchover
2) Start OCI GoldenGate deployment
3) Restore OCI GoldenGate deployment
4) Update the connection assigned to the OCI GoldenGate deployment
For more information, please refer to this Oracle documentation:
Regarding Steps #5 (Start OCI GoldenGate Extract and Replicat process), you need to perform a couple of operations:
- Check the CSN of the Replicat process in the checkpoint table of the target DB. To do this, you can query directly the DB table or call an Oracle REST API created on that purpose with Oracle Rest Data Service.
- Start the Extract or Replicat processes by leveraging GoldenGate REST API. The GoldenGate REST API endndpont can be found by running the following command:
curl -i -X GET -u username:password -H request-header:value https://OCI GG host/path/resource-path
For more information, please refer to this Oracle documentation:
Credits
Valuable ideas, suggestions, and reviews for this post have been kindly provided by:
- Claudia Filippini, Oracle Account Cloud Engineer
- Eloi Lopez, Oracle Domain Specialist Cloud Engineer - Data Management
- Alessandro Stella, Oracle Account Cloud Engineer