Database Repair-A Case Study, Oracle Codasyl DBMS

Oracle Codasyl DBMS for OpenVMS MANMAN software provides a host of features to restore data integrity if it is every compromised. After-image journaling insures roll-forward of changes to a restored copy of the last backup. DBO/Backup allows saving a full or incremental copy of the database a quiet-point of the administrator’s choosing, and after-images journals that follow.

The path to recovery is simple and foolproof, so long as you have backups. If backups are not available, a corrupted database becomes little more than pages of costly and useless bytes on a disk.

Data corruption is often introduced by hardware, (caching) software failure and Operator error. The only tool provided by Oracle for salvaging what is left of your data when backups do not exist is the dreaded DBO/ALTER command.

Case History. In preparation for a database reload, an Operator at MANMAN shop X issued a DBO/INITIALIZE command on the forecast (FORAREA) area of their production database. Unfortunately, forecast records and planned orders existed in the area and each of which, where owned by a part record in the PRTAREA storage area.

The Problem. By initializing the FORAREA storage area, all the set clusters on part records into the FORAREA were left “dangling”, i.e., pointing to nonexistent records. As users tried to access these missing records, the MANMAN application would bug-check; this included writing a huge file to disk, which described, absolutely everything going on with the application session at that moment.

From a business perspective the error had crippled manufacturing only. Because of the way MANMAN executes, users could not use planning or purchasing functions. Users could still list and report everything except forecast and planned orders. Other MANMAN applications worked as before.

The solution to the problem, under most circumstances, would be to perform a DBO/RESTORE from a prior backup of the database and DBO/RECOVER from the database after image journal file all transactions prior to the time the DBO/INIT command was executed. Unfortunately, for shop X, no useful database backups where available to restore. The MIS administrator was new to the company and assumed the automatic backup procedure was doing its job. The backup jobs ran, but did not alert the operators that backup save-set was being overwritten. After-image journals had not been saved since the last successful backup, taken a few weeks earlier.

This database corruption created an urgent situation for business X, which demanded the fastest, resolution. X’s database is kept available 24 hours per day, 7 days each week, to users around the world. X’s management insisted that users be allowed to access the database until the problem was resolved.

Calling Compaq and Oracle provided X’s MIS staff a suggestion or two on how to proceed, including how to use DBO/ALTER. DBO/ALTER is provided with Oracle DBMS but requires arcane knowledge of the internal workings of DBMS data pages. Decimal and hexadecimal calculations must be performed on offsets within the pages to rearrange data within it and leave the page clean. DBO/ALTER is primarily an interactive tool, requiring analysis and computations of which often changes from page to page. While DBO/ALTER will verify that a page is internally consistent, it provides no assurances of leaving you with a clean, working database when you have finished.

Fortunately for business X, they had a copy of eNVy Systems’ DBMS ToolKit. They knew of eNVy’s expertise with Oracle DBMS, and knew that eNVy would be the likeliest team to respond to this crisis.

The eNVy Team has over 30 years combined experience in administration and operations of Oracle Codasyl DBMS production environments. ENVy appreciates the needs and demands of client management, and works flexibly within client company demands and restraints. They also have experience solving problems on this order. They have de-corrupted Oracle Codasyl DBMS for several clients, most of whom where restored to running production within a few days. At most sites production was allowed to continue with few interruptions while eNVy did the work.

The eNVy team uses certain disciplines in every job they take. Necessary interruption to production are kept as brief as possible. They take precautions so problems are never made worse, like taking intermittent backups as is prudent, and testing fix procedures on offline copies of the database. They automate their methodologies wherever feasible and are consistently creative, fast and accurate.

Analysis. Only a few options where available for dealing with this problem:

  • Restore a several-week-old backup and re-enter transactions by hand
  • Unload/reload the clients data to a new database, or
  • DBO/ALTER the live production database

Whichever solution was chosen, exclusive access to the database was required while fixing it was underway. A key rule in this case was that user downtime must be kept to a minimum.

The demands of up-time and the likely speed of completion suggested that we try DBO/ALTER first, use the unload/reload as a fallback, and that we use the restore option as a fallback to that. The cost, staffing and downtime required for the restore option would make it the roughest one for the company to survive.

Unload/reload would also require some significant downtime, but is assures the client of a clean, working database when finished. Standard unload routines will bug-check on a corrupt database, so a customized unload would need be developed which avoids walking the broken sets. Development time for the custom program could be expensive, too. Modifying the MANMAN standard unload is another option, but the program as it stands is a block box; research would be required before development time to modify it could be estimated.

The best first approach was to alter the database. It was easy to estimate the time required and to get immediate feedback from the process. We were able to test very quickly if we should continue or fall back to another option, and we could drop the alter process at any time with no further harm to the database.

Interlude: DBMS page internals

A DBMS database data area is a file containing user data on pages sized by the database administration during a DBO/CREATE, /RESTORE or /MOVE with the /BLOCKS_PER_PAGE qualifier.

Each page has its own internal addressing scheme using the bottom of the page as offset 0, up to the top with a hexadecimal offset equal to 512 x block_per_page – 1. A four-block page ends at offset ((512×4)-1) or 2047 decimal, 7FF hex.

At the bottom (offset 0) the page contains a header, which is a count of line indices and free-bytes on the page. A line index is a descriptor (offset and length) of a data line; lines can contain records, fragments of records and b-tree nodes.

Lines are packed against the top of the page like stalactites; low number lines typically have the highest offsets, and higher numbered lines layered just below. Free space on the page is the gap between header items at the bottom and data lines at the top.

Individual records begin with a count and length of predefined dynamic items, such as set clusters and variable-length data items. Each of these items begins with its internal lengths followed by the static-record data length and the fixed-length data items.

Lines are repacked against the top of the page (high offset) whenever any data length changes. This happens when a user adds or deletes a record, fills or empties a set cluster, or changes the length of a variable-length item. All affected offsets are recalculated as the page is remapped and a new checksum is calculated for the newly mapped pages.

Solution definition. The steps to repair the X’s data base was as follows:

  • Identify pages in the part area with part records with dangling set clusters
  • DBO/ALTER identified pages to eliminate dangling set cluster by:

 – Locating the dangling cluster in it record
 – Shifting all data below the cluster address upward by the length of the cluster
 – Updating all storage counts and pointers affected by move
 – Verify the page and compute its new checksum
 – Commit the change

  • Perform a DBO/BACKUP of the database immediately
  • unload and reload the data to a new database as soon as time allowed

DBQ, the interactive database query tool for Oracle Codasyl DBMS, was used to identify pages with records with dangling set clusters. A DBQ loop walked the PRTAREA realm to find each record owning an occurrence of the corrupted sets, and tested if the occurrence was empty. If it was not empty, i.e., it pointed to what should have been a forecast or order record, DBQ displayed the database key (DBKEY) of the CURRENT part record. Page numbers from the listed DBKEYs indicated which pages in the PRTAREA needed alteration.

A DBO/DUMP was made of each page in the list of DBKEYs, and the dump was scanned for offsets and lengths of dangling set clusters and the lines they occupied. By locating the lowest data byte and locating the start and the length of the set cluster to be eliminated, the data to be shifted was from the lowest byte up to the last data byte before the set cluster. DBO/ALTER could then move that block of data up the page by the length of the set cluster, thus overwriting the broken set cluster.

Updating all storage counts and pointers affected by the move required first changing the count and length of dynamic items on the affected line. This included decrementing the count of items in the line – there was now one less item on the line – and decreasing the line length by the length of the overwritten cluster. Second, in the indices of the line and all moved lines, offsets needed to be increased by the move distance, the length of the eliminated set cluster. Lastly, the free space on the page was increased as determined by the move distance.

Upon verifying the altered page, the only acceptable errors where checksum mismatches and bad SPAM values. Other errors required further analysis and corruption for unique situations. When only acceptable errors where seen, the displayed expected checksum value from the verify step was deposited as the new checksum and the change was committed.

Once all identified pages had the dangling set cluster removed and DBQ verified that there where no more occurrences of the corrupted set, the de-corruption effort was complete. A DBO/BACKUP of the database was taken immediately and users were allowed to resume activity. The site administrator was encouraged to perform a MANMAN unload of the database and load the data into a new database as soon as possible.

Resolution. The crisis was first described to eNVy Systems on a Sunday evening. Development and testing of repair DCL procedures were done using an off-line copy of the database. On Monday evening, procedures were ready for use on the production database.

Users were locked out and we executed the fix procedure. The count of dangling set clusters dropped from 1950 to 260. Users were then (Tuesday morning) allowed to resume work while further analysis was done. Starting late that evening, users where locked out for a few hours and the count was further dropped, down to two dangling set clusters. These set clusters were eliminated manually the following afternoon.

The job was complete with no further bug-checks. All work was completed within three days with users were able to continue working most of each of those days. In follow-up discussions with X’s database administrator, we learned that the crisis was over and we had one more relieved and satisfied customer.

 ENVy Systems provides extensive software solutions and services for corporate data warehousing, systems integration, database administration & repair and systems management.