Here are some thoughts and a couple of issues raised by the handling of a 10TB data base migration. This 10gR2 data warehouse had to be migrated from a non RAC, Solaris UFS file system environment towards a 3 nodes ASM RAC. What makes this move a not-so-easy endeavor is not primarily its size, but its redo log generation rate of 1GB every minute 24 hrs/ 7 days a week, though some loading is done in NO LOGGING mode. Storage replication is not an option here because of distance and budget. So I went to Data Guard, and found out a couple of things worth sharing:
Data base duplication
The key issue I tried to workaround was the estimated 30 hours of full backup/restore coupled to the 1.8TB of redo log to transfer generated in this time frame I would have to recover on the standby data base. The archived logs were furthermore backed up and purged every 3hrs to make room.
I found this little trick to trigger the Remote File Server (RFS) to start writing on the target standby platform even BEFORE the backup process starts. Here are the steps:
- After generating a standby control file, I just started mount a dummy standby database on the target server. That’s enough for the RFS to start the log file copy.
- I ran the rman backup on the source
- When the backup was ready, I had to bounce in nomount mode the target DB to allow an rman “duplicate for standby”.
The alert file shows the RFS every 5’ the RFS restart attempts :
PING[ARC0]: Heartbeat failed to connect to standby ‘RACSTANDBY’. Error is 12528.
This is where it gets interesting, because the “duplicate for standby” mounts the standby data base, allowing the RFS to restart. The rman memory script shows :
set until scn 19415505108;
restore clone standby controlfile;
sql clone ‘alter database mount standby database’;
and into the alert file on the target system:
Redo Shipping Client Connected as PUBLIC
– Connected User is Valid
RFS: Assigned to RFS process 27224
RFS: Identified database type as ‘physical standby’
- Allow enough space for the copied log file on the target to allow the restoration and the recovery to complete.
- A Log archive gap was created in step 3. rman mounted the standby and started copying but some archive log files had to be manually copied. (Although the unchanged FAL configuration worked fine after the data guard setup).
ASM restoration bug
I used Metalink note 273015.1 as a starting point. You have to add to the primary (non RAC data base) 2 new logfile threads, 2 additional undo tablespaces for the second and third RAC instances and run catclust.sql. What this note does not say is the
ORA-00600: internal error code, arguments: [kjrfr7], , , , , , , 
you get in the ASM instance when you attempt to restore the data base (usually after several hours – grrr -). This came down as the unpublished bug 4750469, fixed in 10.2.0.3 which only went out on August 13th on Solaris x86-64, the platform I was working on. The bug can occur when oracle does a sanity checking of a resource whose memory has been released to the shared pool. I worked around the bug by “unclustering” the DB (cluster_database =false) both at the ASM and at the DB level during the restoration phase, and then put back the cluster_database to true.