Friday, February 05, 2016

No Data Loss without Sync Data Guard

A few months ago I wrote about an exciting product that allows you to achieve no data loss in a disaster even without using the maximum protection Data Guard that relies on super expensive synchronous networks. Recently I had a chance to actually work on that system and replicate the scenarios. Here I explain how that actually works, with commands and actual outputs.

Background

First, you may want to refresh your memory on the concept of that product in the earlier blog http://arup.blogspot.com/2015/04/no-data-loss-without-synchronous-network.html The product is Phoenix system from Axxana. To reiterate, I am not an employee of that company nor in its payroll; I just want to make others aware of this innovative out of the box solution to a problem we are all familiar with. I usually don't write about specific products unless I feel strongly about its effectiveness to the user community.

Let's examine the issue we are facing:
  1. In a Data Guard environment, some amount of redo information may be left behind at the primary site.
  2. If the primary site meets with a disaster where nothing is left, this data is lost for ever.
  3. If this data loss is not acceptable then you have to run the Data Guard in maximum protection mode which guarantees the updates occurring on the primary to be reflected on the standby database immediately. That way the data is updated at standby site and is not affected by the loss of the primary site.
  4. However, this also means that the network has to be super fast and super reliable to ensure the updates at the standby site. Unfortunately this also means the network is super expensive. And this could also become technically challenging for networks across very large distances.
  5. This is why many corporations that can't afford the expensive network infrastructure do not implement the maximum protection Data Guard and merely accept the data loss as an unavoidable problem.
Well, until now. Axxana's product addresses the issue in a non-traditional manner. The only thing that stands between the full recovery on the standby site in case of a disaster and an expensive network is the redo information that has not been transmitted yet to the standby server. If this information were available to the standby host, you would have been able to recover it completely. The Phoenix system from Axxana is a solid state drive enclosed inside a case protected from common disaster elements such as flood, fire, bomb, etc. You put this special box at the primary site and create one member of the redo log group on its storage, as shown in the figure below. This box has a protected port of network, etc. that allows you to access its contents even if the normal ports are all destroyed by the disaster. It even has a cellular modem that allows it to be connected remotely. In fact the Axxana software can pull the necessary data from this special box remotely from the standby host and apply to the standby database. And the best thing about it is that it is the plain vanilla Data Guard you are familiar with; not any special magic that goes under the hood.
Axxana Blackbox
Fig 1. Axxana Blackbox Internals



Let's see how it works with a hands-on approach inside a lab. I have two hosts:



Host
Database
Primary
e02lora adb
Standby
e02rora1 adbsb


The storage is ASM; not a filesystem. Replication in a filesystem is relatively simple; so I deliberately chose it to see if any ASM nuances coming up. The Data Guard is configured with maximum performance, so there is no need for an expensive fast low latency network infrastructure. The setup has an Axxana Phoenix system which consists of three components:
  1. Blackbox - the "special" protected storage system.
  2. Collector - a small server at the primary site that collects the changes to the files under the Axxana protection. In this case, they are controlfile, redo logs and archived logs. I don't need to place any database files there because they are useless for recovery. Keeping this small set of files also makes it possible to put the protected files of many databases--not just one--inside a single blackbox.
  3. Recoverer - a small server at the standby site that receives the changes from the blackbox and keeps the files up to date when disaster occurs.

Setting up

First I ensured that the managed recovery process is running and the standby database is in MOUNTED mode.

Then I created a test schema in the production database.
e02lora$ sqlplus / as sysdba
SQL> grant create session, create table, create procedure, unlimited tablespace to arup identified by arup;
Then I created some objects and data inside this schema.
SQL> connect arup/arup
SQL> create table T (c1 number, c2 timestamp);
SQL> create index IN_T ON T (C1);
SQL> create or replace procedure INS_T (p_c1 in number, p_commit in number) as
2 begin
3 for i in 1..p_c1 loop
4 insert into T values (i,current_timestamp);
5 if mod(i,p_commit)=0 then
6 commit;
7 end if;
8 end loop;
9 end;
10 /

SQL> create or replace procedure DEL_T (p_commit in number) as
2 begin
3 delete T where mod(C1, p_commit)=5;
4 commit;
5 end;
6 /
SQL> create or replace procedure UPD_T (p_commit number) as
2 begin
3 update T set C1 =-1 where mod(C1, p_commit)=3;
4 commit;
5 end;
6 /

SQL> exec INS_T (400000, 5000);

SQL> commit;
After creating the test data, I switched a few log files and waited a bit to make sure the changes are replicated to the standby.

Simulating a Failure


Now to simulate a disconnected network. I didn't have any special tools with me. So, I did the next best thing: I disabled the ethernet interface.

e02rora$ su -
Password:
e02rora$ ifdown eth0

At this point the standby host will not receive the updates. All the changes made to the primary database will be located at the primary site only. I canceled the recovery process first.

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

I made a few changes to the primary data.

SQL> connect arup/arup
SQL> exec DEL_T (500);
SQL> exec UPD_T (500);

SQL> conn / as sysdba
SQL> alter system switch logfile;

SQL> conn arup/arup
SQL> select count(*) from T

COUNT(1)
----------
399200

The table has only 399,200 records. Since the Managed Recovery Process is stopped these final changes will not be replicated to the standby. I can confirm that by opening the standby in read only mode and checking the actual tables.


On the standby host:

SQL> conn / as sysdba
SQL> ALTER DATABASE OPEN READ ONLY;

Database altered.

SQL> SELECT COUNT (1) FROM ARUP.T;

 COUNT(1)
----------
400000


The standby has all 400,000 rows, compared to 399,200 rows in primary. The discrepancy is due to the unpropagated changes to the standby. At this point I simulate a failure in the primary by killing the pmon process.

e02lora$ ps -aef|grep pmon
UID PID PPID TTY STIME COMMAND
oracle 10184 10564 pty0 16:26:17 pmon
e02lora$ kill -9 10184


The primary database is down. The data in the redo logs is lost as well since they have not been propagated to the standby system yet.

Performing Recovery

Now that we have simulated a disaster, let's see how to recover from it. Remember, since the Data Guard configuration is Maximum Performance, the database can only be recovered up to the most recent log entry. With Axxana software, however, there is additional data that can be pulled from the Blackbox (the storage unit at the primary site that would not be destroyed).  But how do you--the DBA--know what files are available at the standby site, which are still left at the primary site, and, most important, how to get those into the standby host? It gets even more complicated since the Blackbox is just a storage device; you have to mount the volumes, and mount the ASM disks, etc. These may be fun when creating a brand new database; definitely not so when you are under the gun to recover the database and bring your business online.



No worries. This is where the Axxana software comes into rescue. I spun up the provided shell script at the standby site. This script contacts the blackbox at the primary, pulls the needed data and completes the creation of necessary files at the standby site. Once the files are at the standby site, all you have to do is to perform the typical managed standby database recovery to complete the recovery process. The best part of all? The script even gives you a step by step instruction sheet along with specific files names which you can copy and paste when Oracle prompts for it. Here is how I call the script and the resultant output.

root@e02rora1 AxxRecovery# ./AxxRecovery.sh
Logging to '/home/oracle/AxxRecovery/logs/axxana.debug.log'
Calculating Optimization Parameters [done]
Attempting to connect to Axxana's Recoverer [done]

Perform Failover?

Warning!
This operation terminates the Axxana protection for all DBs.

1) OK
2) Cancel & Exit

At this point the script pauses and asks me for confirmation. I enter "1" and the script continues:


Requesting Axxana's Recoverer to 'FAILOVER' [done]

Recoverer communicates with the Black Box (ENTER_FAILOVER_START) [done]
Recoverer communicates with the Black Box (ENTER_FAILOVER_BBX) [done]
Recoverer communicates with the Black Box (ENTER_FAILOVER_CONNECT) [done]

Recoverer is in FAILOVER mode

Transferring required logs files of the database 'adb'.

The script shows me the names of the files along with their respective transfer status and the %age progress.

FILE NAME | Size in MBytes | Stage | Progress
==================================|=================|==============|===================
ADB-group_3.258.902686913 | 50 (0) | Completed | 100%
ADB-thread_1_seq_111.266.902852033| 46 (46) | Completed | 100%
ADB-group_2.257.902686911 | 50 (50) | Completed | 100%
ADB-1454516139530A12435456XT_112.C| 11 (11) | Completed | 100%
ADB-group_1.256.902686901 | 50 (50) | Completed | 100%
==================================|=================|==============|===================
Total: | 208 (159) | 5 of 5 files recovered

All required log files exist at '/home/oracle/AxxRecovery/axx_recovery_files'

Preparing user recovery scripts ...

You can safely perform the Oracle RPO=0 recovery process

Please follow the RPO=0 recovery instructions that are described in the file '/home/oracle/AxxRecovery/axx_required_files/recovery_instructions.txt'


As I mentioned, the script creates a detailed step-by-step instruction sheet to be followed for the standby recovery. I am actually glad that it does not perform a recovery automatically. That is one process you want to watch and proceed with caution. You probably have only one chance at it; and rushing through it may force you to take missteps. But at the same time you want to think as less as possible under those stressful conditions. So, the detailed instruction sheet comes handy. The last line shows the location of the instructions files. Here is how the file looks like in my case, after removing some banner items:

|---------------------------------------------------------------------------------------------------------------|
| RECOVERY INSTRUCTIONS (03/02/2016 16:26:19) |
|===============================================================================================================|
===============================================================================================================

Step 1) Output setup
====================

Please log into the standby database (as sys user)
and then run the following commands at the SQL prompt:

SQL>
SET SERVEROUTPUT ON
SET LINES 999
SET PAGES 0

Step 2) Recovery using archive log files
========================================

Please run the following commands at the SQL prompt:
* (You can safely ignore the following ORA Error 'Managed Standby Recovery not active')

SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE CANCEL;

SQL> RECOVER STANDBY DATABASE UNTIL CANCEL;

-- when requested, feed the following file(s):

+DATA/adbsb/archivelog/2016_02_03/thread_1_seq_110.382.902852027
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ARCHIVELOG/2016_02_03/thread_1_seq_111.266.902852033

-- finally enter 'Cancel'

SQL> CANCEL

Step 3) Switching to primary control file
=========================================
Please run the following commands at the SQL prompt:

SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP NOMOUNT
SQL> ALTER SYSTEM SET CONTROL_FILES='/home/oracle/AxxRecovery/axx_required_files/1454516139530A12435456XT_112.C' SCOPE=SPFILE;
SQL> SHUTDOWN IMMEDIATE
SQL> STARTUP MOUNT

Step 4) Renaming data and redo log file names
=============================================
Please run the following SQL statement

SQL> @/home/oracle/AxxRecovery/axx_required_files/logAndDateFileRename.sql

Step 5) Recovery using primary control file
===========================================
Please run the following command at the SQL prompt:

SQL> RECOVER DATABASE UNTIL CANCEL USING BACKUP CONTROLFILE
-- when requested, feed the following file(s):

/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_1.256.902686901
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_3.258.902686913

-- You should now see a message saying 'Media Recovery complete'

Step 6) Open the Standby Database
=====================================================
Please run the following commands at the SQL prompt:

SQL> ALTER DATABASE OPEN RESETLOGS;

Well, it's dumbed down enough for those stressful moments associated with a standby database recovery, down to even the set pagesize commands in SQL*Plus--quite easy to forget in stressful situations. Note it shows the time it got generated at the very beginning, non-US date format as dd/mm/yyyy. Anyway, I followed the instructions step by step. Step 4 requires some attention. It shows the how to change the names of the redo logs and data files after the switchover.

For the sake of brevity I don't want to show the entire output. Here is an excerpt from the tail end of the activity:

...
... output truncated ...
...
Specify log: {=suggested | filename | AUTO | CANCEL}
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911
ORA-00279: change 5172890 generated at 02/03/2016 16:16:56 needed for thread 1
ORA-00289: suggestion : +DATA
ORA-00280: change 5172890 for thread 1 is in sequence #114
ORA-00278: log file '/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_2.257.902686911' no longer needed for this recovery

Specify log: {=suggested | filename | AUTO | CANCEL}
/home/oracle/AxxRecovery/axx_recovery_files/E02LORA1/BBX/ADB/ONLINELOG/group_3.258.902686913
Log applied.
Media recovery complete.

SQL> ALTER DATABASE OPEN RESETLOGS;

Database altered.

SQL> SELECT COUNT (1) FROM ARUP.T;

COUNT(1)
----------
399200


Voila! The output shows "Media recovery complete". The count is 399,200--the same as the number in the production database. The recovery process got those last changes. I accomplished my objective without a synchronous replication.

Summary

Recovery without data loss has always been dependent on a high throughput, low latency, ultra-reliable network. Unfortunately the cost of this infrastructure often precludes the use of the no data loss setups in organizations. The Axxana solution is innovative in the sense that it addresses the issue with a completely out of the box solution slashing costs dramatically, in fact so dramatically that most corporations will be able accomplish no data loss in Oracle Data Guard setups. At the heart of the setup is the "blackbox" which is just solid state storage with protection to withstand water, fire and bombing. When a disaster obliterates the primary site, this blockbox likely survives and through its cellular modem allows its content to be pulled to the standby site to reconstruct the important files that have not been replicated yet. Once those files are retrieved, it's quite simple to recover the standby database completely. In this blog you saw an example of how I did exactly that as a user.

Hope this helps in making some decisions about no data loss setup in Data Guard with respect to cost.

Translate