ABSTRACT organization. No matter how meticulous your DBA

ABSTRACT

 

 

INTRODUCTION

 

 

Database failures are common and could severe consequences for any organization. A database failure is a nightmare for CTO and IT Managers, especially if it is used as a main data centre for an organization. No matter how meticulous your DBA is or how reliable your system is, database failures can always occur and becomes uncontrollable. It can be carefully managed but it is impossible to prevent a failure from occurring. Broadly speaking, there are many types of Database Failure, explained within scenarios and illustrations on what and how they can occur.

 

File Corruption

Databases may fail at the file level, which means one or more files in the database have become damaged, causing corruption. Corrupted files represent logical damage to the database and hard drive. However, do-it-yourself data recovery may overwrite the data and result in permanent data loss. If you don’t have experience in dealing with data recovery, it’s best to call an expert. Prices for logical data recovery may be lower than you expect, and you could get your data back in less than 24 hours. 

 

Instance Failures

Instance failures occur when an instance shuts down without the database files synchronizing to the same system change number, requiring recovery operations the next time the instance starts. These are directly out of manual control. A few causes of Instance Failure are Power Outages, a Server hardware failure or failure of an Oracle background process. The database start up will automatically trigger a recovery in this instance.

 

Media Failures

Media Failures are touted as the most dangerous, as there is not only a potential to lose the data if proper backup procedures are not followed, but it takes much more longer to recover from. A typical example is a disk controller failure or disk head crash, which causes all, databases residing on that disk to be lost. Types of essential files that can be lost from this type of failure include Datafiles, Control Files or Redo log files. The database can corrupt in this regard for many reasons:

·         Failure of a disk drive

·         Failure of a disk controller

·         Deletion of corruption of a database file

Considered as more of a serious failure, the entire data can be lost or corrupt if an appropriate backup process is not followed. Where the disk head crashes, the data also becomes inconsistent and corrupt, to a point in which it is non-recoverable.

 

Network Failures

Network failures can occur while using a client-server configuration or a distributed database system where multiple database server are connected by communication networks. Network failures such as communication software failures or aborted asynchronous connections will interrupt the normal operations of the database system. A listener process on the Oracle 12 server can fail or the network card on the server itself can fail.

 

Transaction/Statement Failures

Defined as the inability of the database to execute a SQL Transaction. A transaction could have multiple statements and one of the statements may fail to execute due to multiple reasons. Typical examples are selecting from a table that does not exist, inserting data into a table where the structure is incorrect, lack of a tablespace to input more data. Statement failures normally generate error codes and messages by the application software. The statements are usually rolled-back and are not executed when fails. Recovery of these transactions are automatic. User will return control to the users to re-execute the statement. These are logical errors in the program that is accessing the database, causing one or more transactions to fail.

These are not key issues to contend with like networks and architecture, however any errors are instantly recovered and to prevent the database from executing the transaction.

 

 

 

 

 

Database Backup and Recovery

A vital part of business continuity, the use of database always requires a back-up and recovery to recover from any corruption or failures. Below covers the certain topics relating to data loss and what are the types of backup and recovery available within an Oracle database. Oracle has the technologies and capabilities and below illustrate components used within a backup and recovery structure:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Backup and Recovery Scenarios and Techniques

Database Administrators are mainly responsible to ensure a comprehensive plan of the database is made accountable. The backup plan in the Oracle 12c RDBMS should cover the following areas and techniques:

 

List of Backup Components that need to be backed up:

o   OS Software – An event such as a hardware failure will require a complete system restore, starting with the OS, so there is a need to back up the database server OS initially and after any system updates or configuration changes.

o   RDBMS software—The RDBMS software should be backed up initially and after any patches/upgrades.

o   Application software where applicable— This applies especially to Oracle E-Business Suite, Oracle Application Server and Oracle Enterprise Manager (OEM). The application DBA should complete an initial full backup of the applications to disk using an appropriate OS command and, then, schedule future incremental backups, e.g., after any patches/ upgrades. These backups should also be transferred to tape.

o   Passwords—All superuser passwords that may be required during recovery should be preserved. It is a clever idea to ensure that the default passwords that came with the initial installation of the RDBMS are changed.

o   All components of Oracle databases:

§  Database parameter file—A parameter file or server parameter file (SPFILE) defines persistent initialization parameters of a database, including information about database control files.

§  Database control file(s)—The control file stores the status of physical structure of the database. If it becomes unavailable, the database cannot operate. It is imperative that these files be backed up while backing up other components of the database. In later versions of Oracle (9i onward), the DBA can configure automatic backup of the parameter file as well as the control file to ensure that these get backed up after each backup and after any structural changes in the database.

§  Database data files—These should be backed up during cold backup as well as during online backup, using Oracle’s Recovery Manager (RMAN) or, in Oracle Database versions in which RMAN was not introduced, by putting tablespaces in backup mode. The DBA should try to run all production databases in Archive log mode so that recovery to the point of failure is possible.

§  Redo log files and archived redo logs—While making a cold backup, the DBA needs to backup redo logs. When the database is running in archive log mode and doing and online backup, the DBA needs to archive redo logs manually or automatically and then back up all archive redo logs.

§  Oracle network files—It is important to back up all Oracle network files initially and after any change.

Determine the appropriate backup type to use for your data.

Oracle databases:

Logical backups – The whole database, individual schemas, tables or tablespaces can be backed up. Restore is done using “imp” or Data Pump. With such backups, recovery to the point of failure is not possible.
Physical offline or cold backups—The database must be shut down and a copy must be made of all essential data files and other components of the database.
Physical online or hot backups—This method enables the database to be backed up while the database is up and running. The following points should be kept in mind while doing online backups:

Either put the tablespaces in backup mode and back up the associated data files using an OS copy command, or use RMAN, a robust tool provided by Oracle for backup and recovery with version 8.x onward. Oracle adds new functionality to this tool with each version. RMAN can use the database control file to keep its catalog, or the DBA can setup schema for each database, in a separate database for RMAN catalogs.

 

 

 

Establish an appropriate backup schedule and window— It is good practice to select a backup window at a point when the lowest amount of activity affects the database so that the backup does not reduce available database server resources and slow down the database user’s activity. The DBA can tune the backup window by parallelizing backups using multiple channels; however, the DBA must review the version and edition of the database to confirm availability of this option. In the vast majority of cases, it is best to set up a weekly backup cycle starting with full backups on Friday night or Saturday morning and incremental/differential backups throughout the weekdays. Archive/transaction log backups can be scheduled for every few hours, depending on the volatility of the database.
Decide where to store backups—Both Oracle and MS SQL Server databases can be backed up directly to tape or disk (locally or over the network), and then the backups can be archived to tape. It is good practice to back up to disk, transfer to tape and store tapes offsite for disaster recovery (DR). The backups to disk are faster; DBAs have more control and can better monitor these and, with this method, DBAs hold two sets of backups—one on disk, the other on tape. During restore, if backups are still on disk, it will be a faster restore, reducing mean time to recover (MTTR).
Develop a backup retention policy—The backup retention policy relates to both the disk and tape rotation schedule and should be decided upon based on the SLA established with the business-user community. The data owner should specify the retention period for the data. The retention period may vary from months to years, depending on local laws. Accordingly, the DBA should be deleting old backups to create space for current backups. The data retention policy should be chosen carefully, making sure that it complements the backup media subsystem retention policy and requirements for the backup recovery strategy. If not using a catalog, the DBA must ensure that the control file record keep time instance parameter matches the retention policy.

Effective Backup Management

After making a solid backup plan and completing initial work, the DBA should properly manage backups, keeping the following points in mind:

Automating backups—For Oracle, either set backups through OEM or use an OS scheduling tool, and Spool output to a log file that can be reviewed for any errors. In SQL Server, use Maintenance Plans for scheduling backups.
Monitoring backups—Set up monitoring using appropriate tools so that the DBA gets an e-mail or alert through a pager or cell phone for any failed backups, which should be rerun as soon as possible.
Backup logs and catalogs—Review backup logs and backup catalog information periodically for any issues. Use RMAN reporting to show backup status. For Oracle, back up the RMAN catalog database by exporting all catalog schemas periodically as well as by doing an export backup of RMAN catalog schema at the end of each backup. For SQL Server, backup system databases, especially master and msdb.
Database catalog maintenance—With Oracle databases, use “delete obsolete” to remove backups that are outside the organization’s retention policy. If obsolete backups are not deleted, the catalog will continue to grow and performance will become an issue. Cross-checking (cross-check backup) will check that the catalog/control file matches the physical backups.
Validating backups—Validate and verify backups without doing actual restores.

 

 

 

 

 

RMAN (Oracle Recovery Manager)

RMAN (or Oracle Recovery Manager) is used to perform backup, restore and recovery operations for Oracle 12c database. RMAN is required to perform incremental backup processes. A utility built into Oracle databases to automate backup and recovery and includes features that are strictly unique to the program.

The use of RMAN is predominantly for Database Administrators in order to protect data on the Oracle 12c database, rather than requiring back up administrators to support the initiation of data protection.

What does RMAN do?

RMAN automates administration of backup strategies and ensures database integrity. Block-level corruption detection is provided during backup and restore. Backup techniques such as parallelization of backup/restore data streams, a backup files retention policy and a detailed history of backup operations are supported.

RMAN handles underlying maintenance tasks to perform during or before any database back up or recovery. The type of backup and recovery tasks it can conduct are:

–          Incremental backups

–          Block media recovery

–          Binary compression

–          Encrypted Backups

The RMAN looks for a client and a target database to perform the backups on. The backups incur on the target database, but the client is the application to manage the process of database backup and recovery. RMAN copies files to the directory specified by the user using an API with the backup hardware. Backup sets creates backups on disk or tape.

The important features of RMAN

Backup sets – Oracle RMAN stores data in image files or backup sets, which are made up of backup pieces. A backup piece is an RMAN-specific binary file that only RMAN can create or restore. Backup pieces are grouped into a backup set, allowing DBAs to protect multiple data files, control files, server parameter files and archive logs together. RMAN can encrypt and decrypt data written to backup sets.

Archived redo logs – Redo logs are another key piece of RMAN backup. A redo log stores all changes made to a database, and every Oracle database has an associated redo log. Groups of redo files can be saved off-site in an archived redo log. Archived redo logs allow restoration of a database from an inconsistent backup, which occurs when the database does not shut down normally. An inconsistent backup allows backups to occur when the database is open. Oracle RMAN can also conduct consistent backups, which occur after a database is shut down normally. A consistent backup does not require media recovery to restore the database.

Flash recovery – RMAN backups are created in the Oracle database flash recovery area (FRA) on disk. The FRA is a directory that contains online and archived redo logs, flashback logs, control files and image copies. When disk space is required for new backups, the Oracle database removes backups that are no longer needed to make room. A DBA sets policies to determine which FRA files are obsolete and can be safely deleted. Files that have been moved to tape are also candidates for removal when disk space is needed. Using a flash recovery area can save time because a DBA does not have to manually delete files to make room for new backups.

RMAN Oracle Flashback Database and Media Recovery – RMAN can restore data through Oracle Flashbacks or Media Recovery. Flashbacks enable point-in-time recovery that allows DBAs to go back to a previous time, and is used for data corruptions and user errors. Media Recovery is used to correct media failures. Archived media files and online redo logs can update data files so they can be restored. Datafile Media Recovery can recover a single lost or damaged file; Block Media Recovery can restore a few blocks of data while the database files remain available.

Why choose RMAN?

RMAN is preferred within all Oracle 12c databases. It is however possible to back up a cold database without using RMAN, cold meaning if the database is not mounted. Un-mounted databases can be still being backed up at the file level without any database-level tools. The Database Administrators will be given the responsibility to manage RMAN via commands, used for backup, conversion of data files, data manipulation, recovery catalog, backup encryption and other recovery tasks.

 

 

 

 

 

 

 

http://searchdatabackup.techtarget.com/definition/Oracle-RMAN-Oracle-Recovery-Manager

 

https://www.ibm.com/support/knowledgecenter/en/SSGSG7_6.4.0/com.ibm.itsm.erp.doc/c_dperp_o_ovr_rman.html

http://ora10g.com/8_1_failures.html

How Oracle database does instance recovery after failures?