Oplog divergence madness

I never thought I’ll ever blog about ClearCase, but I have to write about the latest issue I had.

One of the sites I maintain has a couple of VOBs (Versioned Object Base, it’s called repository in other version control systems) which got corrupted. Someone from NAS team (it’s not your home NAS system) tweaked with NFS parameters, and locked out the VOB server for a couple of seconds. It’s not nice.

Anyways, the server reboots every now and then, to add more spice.

"At the eight day, God said: I need a drink." Now I need it.

To serve users better, I got the problem report two days later (a problem report in normal severity should be fixed in 5 business days). You know, you have to fight with Level 1, then comes Level 2 where they think they can solve the issue by putting them aside.

Fortunately, I keep a week-long nightly snapshots around, and I was able to restore damaged vobs. I had no issues with the first one, but the second vob had local changes since the last good backup.

Let me digress a bit to tell you how ClearCase keeps replicated data up to date between nodes.

First off, the core principle is quite simple: you can write your changes, but you can only read others’ changes. It goes down to ClearCase’s core change principle: you have to check out files you want to modify, and it makes read-only for others.

How you can support multiple users then? The answer is branching. Everybody creates his/her own topic branch, which then will get merged back to baseline. To broaden the solution back to multisite usage, every version, branch, label is mastered by a replica, where it can be written or modified. This simple principle is the core of base ClearCase, which then causes trouble in various places.

OK, we have a couple of replicas, now update them. For this, the vob maintains a log of operations (a.k.a. oplog) for every replica. The local replica keeps its oplog tidy, but it also stores all the other replicas’ oplogs, in case we have to update a replica with another’s changes. Yes, you can create your own update topology.

To make things more quirky, the local replica keeps track of oplog counter (also called epoch number) of what it thinks other replicas have. This is how a replica can update a replica with another one’s changes.

So far it’s easy to follow, but here comes the problem: when a replica updates another, it assumes the update will be imported, and it increases the appropriate counters, no matter the packets reached destination. This fragile system causes the most common problem with ClearCase: synchronization issues.

You can fight them with updating your copy of a remote replica’s epoch list, by manually updating it, or by using restorepackets. In the latest version, you can even have a diagnostic tool which shows you what command should be run on the other replica. These methods don’t solve the problem once and all, but they give job security for a ClearCase admin.

 Now I can be a bit more technical: the second vob’s epoch number was higher than the restored version’s. This means we have to switch to restorereplica mode, where we ask others to give our changes back. Fortunately ClearCase has an admin guide which describes what to do in a situation like this. Integrity check of the database (where relations are stored), vob source pool check (where file differences are stored), fix all issues, and then switch to restorereplica, and wait for the updates from all other replicas.

Nice. Source pool checking pointed to 6 containers being missing. Sure enough all of them were in place. Nevermind, I fixed them, and then checkvob returned with no errors. Let’s jump right into restorereplica!

I started with France. Packets back and forth, looks right… or not? Importing says there is still a missing container.

At this point the server crashed, and got rebooted.

I started all the tests over. Database check went well, but checkvob showed 6 missing containers again, but now with really missing files. Oh my. Switch back from normal mode, fix the holes, turn on restorereplica again, then send out updates to replicas. Done.

I started with France again. Packets back and forth. Now there’s another issue in importing: the source container I just restored has some versions which are not in the database.

How can I remove a version from a source container?

It turned out the format is not too difficult. A line starts with caret (^) is a command. We have a couple of commands:

  • ^E: element, it tells you the element’s UUID
  • ^V: version. After version’s UUID, it contains the branch number (in hex), and version number (hex).
  • ^B: branch. Branch UUID, branch number, and some other data.
  • ^I: insert lines in a version. Branch number, version number, and number of lines. Then, the actual lines.
  • ^D: delete lines from a version. Branch number, version number, and the number of lines to be deleted.

It has some other commands as well, but they’re not important if you want to hack a container.

I solved all my restoring needs, but this hack was the tip of the iceberg.

And it happened, soon enough. All packet import says there’s an oplog divergence: the replica thinks the change was in a different oplog entry. Now every replica is stuck, no one can update any other.

Now you can either call IBM, or you put an axe on the table.

After a long meeting we decided option two will be the solution. We had more than 1000 changes in a replica, which should be sent to other sites, and we don’t have weeks to figure out which bit went to the wrong place.

When you have a divergence, the easiest way is to get rid of the whole oplog. Just choose a replica which will survive, remove other replicas one by one, and when the last remote replica gets removed, your vob will return to it’s original state. No oplogs, no export_sync records (it is a way to reset a replica’s epoch matrix, but may I ask why we have two different methods for this?), no mastership.

To do this, all you have to do is to make other replicas obsolete, and then remove them at the selected site.

Yes, sure.

At this point I learned a vob database’s journal can hold at most 2GB transaction data, and sometimes it’s not enough to remove the last replica. Why? Because of the enormous amount of oplogs and export_sync records.

Sadly it never told me. It just hung. After a couple of hours I got suspicious. I’ve listened into all the processes in question with truss, but all of them were just sleeping. Aha, deadlock. Also cleaning up corrupted transaction file droppings, which was all described in a technote.

At this point I felt 2.7GB for a database is a bit too much. I dumped then loaded the database, and it reduced to under 1GB. Well, ten years is a long time to fragment a database.

Here comes another fancy feature of ClearCase: from time to time a lot of unnecessary data gets collected. Nobody really cares about who and when locked the vob (we have to lock the database every day during backup), or who applied labels. To remove these data we have a vob_scrubber utility, which weeds our garden.

Oplogs are kept forever. Export sync records kept forever. I changed the rules to keep only 7 days, but scrubber failed miserably.

In a technote I never seen before I found the solution: peel off oplogs and export sync records. Start with 120 days, then go on to 90, 60, and 30 days. Then finish off with a 7 days amount of export sync records being kept, and then, you’ll be able to remove the latest replica.

Now I’m at this point, but another 9 hours passed, and I need a cigar.