Migrating Sun Grid Engine qmaster from lx24-x86 to lx24-amd64

We shifted our Sun Grid Engine qmaster (queue master) from an old server to a new virtual machine. The old server was so old it had a different architecture from the VM and thus required a little bit of extra tweaking to get working.

Essentially when the SGE qmaster was started on the new host it would immediately quit.

Running strace -f before the call to qmaster found that there was a problem with reading the database. The root of the problem was traced to the Berkeley DB that the qmaster uses to store information. This was stored in a native format, but the architecture of the old server (x86) and the new VM (amd64) was different.

The first solution, to test that this was indeed the problem, was to hobble the architecture detection and see if the lx24-x86 qmaster would start on the new qmaster and read the database. It did.

From x86 to amd64

As the only x86 node on the network, there wasn't really much point in keeping the qmaster running on x86, so a fix for the database files was looked for. This was found in the db_dump and db_load utilities from db4-util.

On the old qmaster the database was dumped:

cd /opt/sge/uoa-dos/spool/spooldb
db_dump sge > sge.dump
db_dump sge_job > sge_job.dump

These files were then copied across to sge-qmaster.stat and then reinstated:

db_load -f sge.dump sge
db_load -f sge_job.dump sge_job

Note that in the original directory there was a gigantic message of files taking roughly 16MB:

total 16M
 12K __db.001
452K __db.002
264K __db.003
 40K __db.004
236K __db.005
 12K __db.006
 11M log.0000000303
176K sge
4.7M sge_job

Only the sge and sge_job files were created by db_load. When the qmaster started for the first time, the missing files were initialised:

total 2.2M
 12K __db.001
500K __db.002
228K __db.003
 36K __db.004
380K __db.005
 12K __db.006
852K log.0000000001
120K sge
 76K sge_job

qstat shows that it is correct.

Changing qmaster on the exec hosts

A puppet file to change the act_qmaster was created. It was only a brief file. A more sophisticated one would softstop the daemon and then start the daemon with the changed settings.

file {'act_qmaster':
      path    => '/opt/sge/uoa-dos/common/act_qmaster',
      ensure  => present,
      owner   => "sgeadmin",
      mode    => 0644,
      content => "sge-qmaster.stat.auckland.ac.nz",
}

Procedure for updating each exec host:

  1. sge_execd softstop (this leaves jobs running)
  2. puppet apply
  3. sge_exec start

As this was applied to each host, the host would be disappear from the list maintained by the old qmaster and appear on the list maintained by sge-qmaster.stat. Our submit host was changed first, so that users of that system would read from the new qmaster and submit jobs to the right place.

Once satisfied, qmaster was stopped on the old qmaster, and we could prepare that system for turning off.

That worked - or did it

Now SGE had been switched over to the new qmaster (although not using the recommended method) the jobs continued to run.

There were some small wrinkles: not everything was reflected accurately in the accounting that the qmaster was keeping. Jobs would complain, some wouldn't be culled off the list and remained as phantom "running" jobs for days afterwards as they expected to report to the old qmaster. That took a bit of cleaning up. PE had incorrect accounting, leaving SGE thinking more slots were occupied than there really were.

Restarting the execd on each host helped. Not long after this a grid-wide upgraded was pushed through, where each host was restarted. That fixed the PE accounting problem.

Summary

Urgh.


Stephen Cope 2011-10-27
http://www.stat.auckland.ac.nz/~kimihia/sun-grid-qmaster