Recovering from Logical Corruption: Thinking outside the data page

By Tim Peer eNVy Systems, Inc.

On a Monday not long ago we received an urgent call from the systems manager of a MANMAN/VMS site regarding possible corruption in their MANMAN/MANDB database. Inventory reports and utilities were failing (bug-checking), and MANMAN transaction data displayed inconsistent values. The system manager believed the corruption probably occurred during or following a database recovery performed two weeks earlier. Apparently an operator had encountered a “database corruption” error and decided to use established database recovery procedures—a DBO/ Restore and After Image Journals (AIJs). The manager was uncertain as to the exact recovery steps they had taken; the only concern now was that the database was broken and needed to be repaired.

The client’s system, a MicroVAX running CODASYL DBMS 4.0-3, had insufficient capacity to perform the database verify utility (verification under V4.0-3 does not support set optimization). The client estimated that under the best conditions, the Verify would take many days to complete. The database backup file was placed in a save-set and uploaded to eNVy. We then converted the broken database to version 7.0 and began preparations for the analysis/Verify.

Let the verification begin. The DBO/Verify uncovered corruption in PRTREC, QOHREC, FORREC, INVREC, LOCREC, LOTREC, WARREC and FORREC. We identified the underlying corrupt data and provided a report of the affected inventory accounts, parts, and locations.

We observed set pointers that pointed to non-existent database lines and member records that did not belong to their listed owners (said owners' chains were complete and did not include the “member” in question). The combination of bad data, broken chain sets, orphaned records and confused record ownership required analysis beyond that of the corrupt pages. It would be necessary to check the consistency of all records that participated in the questionable sets, matching every owner—next and prior—to see if it pointed back correctly. We found that the corrupt data appeared to be localized to MFG records only.

A database recovery plan is often predicated on the type of corruption present. Most corruption is handled using the recovery steps outlined in the Oracle CODASYL Database Administration Guide, wherein a database backup is restored, and transactions are rolled forward from the database’s AIJ file. We were not able to use this procedure in this case, and it was soon determined that a manual recovery would be needed when we discussed the recovery plan options with the systems manager:

o Response: This option was accepted.

The manual repair option combining data entry with a database repair was the best option for this corruption event. MFG users were at the point where they did not believe reported values from MANMAN transactions, so our first priority was to restore that confidence.

We advised the cost accountant to create an alternate inventory account for each location that was affected by the corruption. Inventory locations would be counted and inventory transactions executed to move the inventory from the corrupt location to the alternate location. The remaining records in the set chain should then include only those records requiring repair. Once repaired, a complete physical inventory of all warehouses and locations would be performed. The physical inventory would provide a baseline for valid inventory data and an inventory quiet-point from which inventory adjustments could be reconciled.

We were unable to pinpoint the cause of the corruption to a specific activity, hardware, software or operator error, and since we suspected that we had a mix of recent and old corruption, analysis time would be required to resolve this event. We found over 300 corrupt FORREC records alone from an earlier corruption event. According to the DBO Verify output, corruption affected thousands of dbkeys. Historically, we use automated procedures to apply the database repair. For this case, the nature of the corruption required not only manual analysis but manual repair as well.

A database de-corruption strategy. We were tasked with analyzing output from the DBO/Verify utility. Although it reported various areas of corruption, we required more details than the Verify itself could provide. Corrupt records would require various tests wherein each record and all connected records could be checked. DBO Verify alone would also have taken a lot more time to analyze and reconstruct broken record relationships. We developed a Consistency Check tool in which DBMS records and chain sets are thoroughly analyzed. For long chains consisting of hundreds or more rows of data, broken records were identified with “corrective action hints” to formulate corrective action repair.

The DBO/Verify output:

%DBO-F-BADOPTREC, area LOCAREA, page 6117, line 4 (4:6117:4)

pointer cluster for set 31 (LOCLOTSET)

its owner dbk 4:6117:1 may be wrong

not pointed back by prior record 4:6117:1

owner dbk of prior record may be wrong

prior record 4:6117:1 may be corrupted

not pointed back by next record 4:6117:1

owner dbk of next record may be wrong

next record 4:6117:1 may be corrupted

Constructing Consistency Check. The program was written in Microsoft Visual C++ on the Microsoft Windows NT/2000 platform. The following general steps were performed:

1. Build consistency logic into Consistency Check to test and report broken dbkeys

a.
Inconsistent Owner Keys
b.
Inconsistent Next and Prior Keys
c.
Orphaned Keys
  1. The GUI permits navigation of adjacent (owner, next and prior) keys via point-click of the mouse.

  2. Generate chains-set reports from which broken chains could be reconstructed.

  3. Open and process the entire DBO/DUMP output of the storage area.

    1. Extract all DBKEYS for the questionable set types

    2. a. Implement an algorithm to support hashed-unique key access to hundreds of thousands of keys (nine set types were affected in this event). Consistency Check should be optimized to execute on very large storage areas.
  4. Generate specific messages for Consistency Check failures.

  5. A sort key would be used to order the dbkeys to simulate a Codasyl DBMS REALM walk of the storage area. Keys were analyzed in REALM storage order.

eNVy CODASYL Check Report:

Copyright (c) eNVy Systems Inc. (2002), All Rights Reserved, Worldwide CODASYL DBMS DBKEY Report Date: June 29 2001

Current Dbkey: 5:9026:1 Set Type: 51 Flag Messages

1) Next record (2:12034:3) does not point back to this one.

======================================================================= Prior Ptr DBKey Page Next Ptr Owner Ptr =======================================================================

Errant Key List:

1) 2:139:1 5:104:2 2:139:1 2:139:1
2) 2:175:5 5:131:1 2:175:5 5:131:2
3) 5:131:15 5:131:16 2:175:5 2:175:5
4) 2:249:1 5:187:1 2:249:1 5:187:2

Corruption identified. On Saturday, Consistency Check was executed. Consistency Check processed an ASCII dump of the storage area in approximately 10 hours. The number of owners per set-type varied. Consistency Check processed over 200,000 owner records spanning nine set-types, reporting only those dbkeys that failed a consistency check. An estimated 2,000,000 rows of data were processed.

Problem DBKEYs were written to a list box in the Consistency Check application. From there, we were able to navigate the problem records by selecting the problem DBKEY. We navigated the bad chains by selecting associated PRIOR, NEXT and OWNER dbkeys. At each point in the navigation, messages about the corruption and potential corrective action were displayed. When completed, we were able to examine and navigate DBKEYS and reconstruct all broken records.

On the following day, we repaired our copy of the clients’ database. We notified the system manager and scheduled the repair of their production database. We estimated that six hours would be needed to repair the database based on the previous repair session. We were given 13 hours to complete the repair and perform the post-repair operations. We completed the repair, the database backups and the DBO/VERIFY of all affected sets in about 13.5 hours. The system manager was advised to perform an application unload/load (MANMAN ULE) when time permits.

The scope of the corruption spanned nine set and eight record types and included over 500 incidents of corruption. Using the temporary fix, manufacturing function was essentially restored within three days of our having been notified. Within eight days, the database was repaired, and the plant resumed normal operations. Because of the Consistency Check application, we were able to affect the repair very quickly.

Our final recommendation was that, should they encounter another “database anomaly” in the future, they might benefit more by contacting a database recovery specialist at the outset, instead of first attempting a recovery in-house.

Tim Peer works with eNVy Systems Inc. Located in San Jose, California. ENVy Systems provides software and services for Oracle Codasyl DBMS, Oracle Rdb and MANMAN. Envy is the sole provider of data warehousing & data replication tools for MANMAN and Codasyl DBMS. ENVy services include: Remote DBA, System Manager, OpenVMS Support & database recovery and repair for VAX and Alpha systems. Tim can be reached at peert@envysys.com Voice: (408) 363-8896.