ResourceMgrs/Cobalt
From BGL
Cobalt is the Argonne developed resource manager based on the Scalable Systems Software components. It is currently online on the Argonne BGL machine.
Contents |
Where to ask questions, request help, report problems or bug fixes
support@bgl.mcs.anl.gov
Download
- Download Binary RPMS and source tar balls.
Also find the latest release and information at the Cobalt BGL web site.
Documentation
- Installation Documentation on the installation and configuration of Cobalt.
- Using Cobalt Commands, options and examples.
- MCS BGL Tutorial Running Jobs on the MCS BG/L System (BGL).
Conventions for the requested features/bugs/design and conceptual issues
- To create a new bug/feature request/design: use the next # on the specific item (i.e. bug, feature request, design) list; create a link to a new page by putting the type of item (bug, feature, design) followed by the new # followed by a pipe followed by the # again inside double square brackets; save this page; click on the new item's page link.
- Do NOT delete items (i.e. bugs, feature requests, etc). If you wish to take back an item,
use the strike out. - Sign all additions with [[User:<username>|<real name>]]
- Put all details and comments in the page assocated with the bug.
- Number bugs, feature requests and design/conceptual issues with '#'.
- Place code snippets in their own section with a space at the start.
- Comments on a request, bug or design issue should be italized with 2 single quotes and put under the item by adding a '*' to the item marker.
Requested Features List
#1 ability to specify kernel and ramdisk images on the qsub line --Susan Coghlan
#2 add queues --Susan Coghlan
#3 add #processors to the qstat output
#4 #qstat -f should give long output
#5 #add elasped time running jobs (I one-offed a perl script to do this, for anybody interested: /bgl/home1/treddy/bgl_scripts/timqstat.pl) --Tim Reddy
#6 give an example of using mpirun with qsub. For example, is the -partition argument necessary? If so, what name should be used? Are the other arguments, such as -cwd, still required? Look at the cqsub man page.
#7 /usr/local/bin/qboot (resets and flushes everything and makes things work magically)
#8 #cqstat of a job id that is does not (or no longer) exists should exit with a non-zero status so that things like 'while cqstat $id ; do sleep 1 ; done' will work.
Bug List
Unfixed bugs are red. Fixed bugs are green.
#1 The machine is not running obviously runnable jobs in the queue.
#2 A job submitted that can't possible be fulfilled with the current configuration should give a warning, both after submission, and in the qstat output. -- Pete Beckman
#3Some sort of watchdog process or timer or handler needs to be added to catch and correct things when they hang.-- Pete Beckman
Even if it can't correct the problem, detecting and then reporting the problem via the qstat command will let users know the problem has been detected, and folks notified. Either it should detect this hung state, or send email, or restart itself. -- Pete Beckman
</font>
An issue is that components hang. There isn't an easy place to put a watchdog timer at the moment, but I will try to find a place for one. -- Narayan Desai
- qstat -a does not work and prints an error message. -- Bill Gropp
qstat doesn't take a -a option. What behavior are you looking for? -- Narayan Desai
The documentation on http://www.bgl.mcs.anl.gov/bgl/Documentation/ApplicationDocumentation/resourcemgr/bgl.php mentions this as the way to list all queued jobs, not just your own. The problem here may be that qstat isn't qstat - that is, a well-known and widely distributed queueing system uses qstat, and users will expect a command with a name that they know to behave in the same way that it does on other systems (see http://www.nas.nasa.gov/Groups/SciCon/Origins/Cluster/PBS/Man/qstat.html). Also see http://www.opengroup.org/onlinepubs/009695399/utilities/qstat.html. Given all of these, it may be best to think about unique names for these commands, at least within the space of queueing systems for HPC systems. -- Bill Gropp
Bill, the incorrect documenation was removed. The only one that should be looked at is the New User's document and the Job resource manager tutorial, both accessible from the Application Documentation page. --Susan Coghlan.
That said, your point about command names is well taken. We will have the package named, with associated client name changes soon. All client utilities have been renamed, prepending with a c (for cobalt) -- Narayan Desai
Design/Conceptual Issues
- In the past, we've built versions of mpirun that understood the queuing system. For people who only need to run programs, or who are in development, this is much simpler than manually constructing the script to give to the batch system. It would be nice if there was one, system-wide such program, rather than one per group or user. Of course, the general batch system commands would be around for those needing more capabilities than a simple "run an MPI program" command. -- Bill Gropp
- One thing is that many people have scripts that know how to run MPI programs using the "conventional" syntax; it would be good if these could still run. The "wrapper" that provides the mapping from mpirun (or even better, mpiexec) to qsub allows such scripts to continue working. What I'd like to avoid is having each group write their own version of this. -- Bill Gropp
- Also, I believe qsub is non-blocking; when used in a script, mpirun should return only when the process is complete. This is fairly simple to handle (at least with busy polling with qstat), but again, it would be great if there was one solution instead of n. -- Bill Gropp
Old Bug List, resolved or removed items
- 'man qsub' gives information about some other program entirely. This is bad, because there is no way to get help on parameters that the ANL qsub command really expects, and what they mean. It would be better to remove the current man page than show the wrong one. -- Pete Beckman
- I removed all the non-RM associated man pages. -- Susan Coghlan
- qstat has no man page -- Pete Beckman
SUSE appears to install pbs man pages by default. The right man pages are installed in /soft/apps/rm-0.90/man, and should be included in softenv shortly. -- Narayan Desai
- The qstat 'wall time' does not report correct time -- Pete Beckman
%qsub -t 1 -n 32 `pwd`/allred ; date
43
Mon Mar 14 08:07:00 CST 2005
....
%qstat ; date
JobID User WallTime Nodes State
============================================
43 beckman 00:00:01 32 running
Mon Mar 14 08:10:36 CST 2005
Walltime is the requested amount of wallclock time from job submission. A 5 minute grace period is provided before jobs are killed, so the job probably executed successfully, despite supplying to little wallclock time. Process startup is really slow on BG/L, so even cpi takes 90 seconds. I could add job start time to qstat output if it would be useful. -- Narayan Desai
- There is no way to kill the correct job. Either there needs to be a qkill command that takes a job ID, or qstat has to provide a mapping between job ID and user process ID. Currently, if I submit two MPI jobs, and then for whatever reason I realize that one needs to be killed, I can't map it back to the job id. This will be especially important once the file output bug reported above is fixed. Then, when a the output of a job shows it is doing the wrong thing, I want to kill it. But at this point, there is no associate process ID. Example below: -- Pete Beckman
%qsub -t 1 -n 32 `pwd`/allred ; qsub -t 1 -n 32 `pwd`/allred
44
45
%qstat
JobID User WallTime Nodes State
============================================
44 beckman 00:00:01 32 running
45 beckman 00:00:01 32 running
%ps -xU beckman
PID TTY STAT TIME COMMAND
10085 ? S 0:00 mpirun -np 32 -partition R000_J203 -cwd /bgl/home1/beckman/
10091 ? S 0:00 mpirun -np 32 -partition R000_J205 -cwd /bgl/home1/beckman/
Which process is job 44 and which is 45?
Try out qdel -- Narayan Desai
- qsub without arguments says 'qsub.py' not 'qsub'. This is confusing. -- Pete Beckman
This has been fixed -- Narayan Desai
- qstat fails on login2, login3 and login4 with the following error: -- Susan Coghlan
login2> qstat Traceback (most recent call last): File "/soft/apps/rm-0.90/bin/qstat.py", line 8, in ? from xml.dom.minidom import parseString ImportError: No module named xml.dom.minidom
- A block was left allocated after a job finished. That block is preventing any jobs from running because RM tries to use it but can't. RM needs to check to see if a block is Free before trying to start a job on it.
The scheduler now checks if blocks are free (in the db2 database) before allocating them for jobs -- Narayan Desai
- qsub returns conflicting 'Usage:' statements: -- Pete Beckman
%qsub Command required Usage: qsub [-d] [-v] -p project -t time (in minutes) -n (number of nodes) -c (processor count) -m (mode co/vn) <command> <args> %qsub --help option --help not recognized qsub Usage: -d debug -v verbose -p <project> -c <proccount> -m <mode> -t <time> -n <nodecount>
These two must match. The second does not specify that a 'command' or 'args' is required. Variable names are different, in different orders, etc. To reduce confusion for users such as myself, one Usage: is needed, that provides all the information.
I don't see how these are inconsistent. They are just messages reported from different points in the code.
This has now been fixed. -- Narayan Desai
- Sometimes, I get an error that the partition is busy, possibly with someone elses job? Either folks are not using the qsub, or there is a race condition somewhere in the allocation. Here is the output from my jobid.error file -- Pete Beckman
<Mar 14 08:06:01> BRIDGE (Info) : The machine serial number (alias) is BGL <Mar 14 08:06:01> MPIRUN (Info) : Initializing Stand-Alone Job... <Mar 14 08:06:01> MPIRUN (Info) : Specified partition id: R000_J203 <Mar 14 08:06:01> MPIRUN (Info) : Examining partition R000_J203... <Mar 14 08:06:01> CMNLIB (Info) : Partition R000_J203 - There is at least one active BG/L job assigned to this partition <Mar 14 08:06:01> MPIRUN (ERROR): Partition is busy - Aborting <Mar 14 08:06:01> MPIRUN (ERROR): Job cycle failed with code -5 <Mar 14 08:06:01> MPIRUN (ERROR): ------------------------------------ <Mar 14 08:06:01> MPIRUN (ERROR): - Job cycle completed abnormally - <Mar 14 08:06:01> MPIRUN (ERROR): ------------------------------------
We aren't checking that partitions were properly deallocated after job execution had completed. Some partitions had become stuck and require Susan to manually fix them. We will start checking partition status before allocating partitions. We now check that partitions are free before assigning jobs to them. -- Narayan Desai
- It seems that "done" jobs get stuck in the data structure for qstat until the lowest job-id item terminates. While it may not strictly be a bug, it is very confusing, since to a user, it seems like a done job or something just got stuck in the queue, or your queued jobs are somehow "behind" a done job in the queue. In reality, it is stuck in the datastructure (it seems), but has completed. An example is below: -- Pete Beckman
% qstat
JobID User WallTime Nodes State
===========================================
53 chad 00:00:05 1 running
54 hereld 00:00:10 1 done
55 hereld 00:01:40 128 done
56 riley 00:01:00 32 done
57 riley 00:01:00 32 done
58 riley 00:01:00 32 done
59 riley 00:01:00 32 done
60 hereld 00:01:40 33 queued
61 riley 00:01:00 32 done
::THEN JOB COMPLETES::
%qstat
JobID User WallTime Nodes State
===========================================
60 hereld 00:01:40 33 running
This was an artifact of a job-killing bug that I have fixed. It caused the removal of finished jobs to not occur. This is now fixed -- Narayan Desai
- qstat -xxx returns usage for 'qsub.py' not 'qsub'. This is confusing. -- Pete Beckman
Here is the output and path, showing the qstat.py thing. This is on Login1:
Mon Mar 14 16:55:36 CST 2005 login1 /home/beckman> qstat -xxx Usage: qstat.py [-d] [-f jobid] option -x not recognized login1 /home/beckman> which qstat /soft/apps/rm-0.90/bin/qstat
I can't replicate this one. Fixed. -- Narayan Desai
[leftover from previous discussion]There were two issues in play here. The first is that no partitions could satisfy Mark's job requirements (due to size). This is only a transient error, as a partition may appear that has the appropriate size. Moreover, users may wish to queue jobs for a reservation prior to its start. The second issue is that component hang. There isn't an easy place to put a watchdog timer at the moment, but I will try to find a place for one. -- Narayan Desai
Good point, if there are no partitions to satisfy the request, what should happen? Below is an example. The only reason to not abort the job with an error if there is no matching partition would be if we wanted to Q jobs for partitions that would be made later. Probably not a useful case to optimize for at this point, since the more common and useful behaviour is to tell the user if the job can't be satisfied. The example below shows two problems. 1) that I can submit jobs with 987M nodes, and 2) that I can submit jobs for negative one node, and have it run. -- Pete Beckman
login2 Selfish_suite/detour> qstat JobID User WallTime Nodes State ================================================ 143 beckman 00:00:01 987654321 queued 144 beckman 00:00:01 -1 running 145 rloy 00:00:20 32 queued
Good catch. There were two underlying problems. The first was that jobs with unrealistic node counts could be submitted, and the other was that negative node count jobs could be scheduled. Both have been fixed. -- Narayan Desai
- It looks like the fix affected the Usage: reported when called with no args (see below). Let me take this opportunnity to suggest a --version to help track things :-) -- Pete Beckman
%qsub node count out of realistic range
fixed. I will add the version strings once I have some things setup in version control. -- Narayan Desai
- It appears that output is not put in my home directory during the execution of my job. I'm guessing some form of temp file is used, and then copied to my directory. This makes it impossible for me to 'tail -f' the output and see if the job is executing correctly, or if something bad happened, and I need to kill it. At this point, if I submit an 8 hour job, there seems to be no way to know how it is doing until after the 8 hours. -- Pete Beckman
Output is directly written to the job cwd. File names are not currently selectable, but output can be tailed during job execution -- Narayan Desai
Just to head you off at the pass, negative time specifications have been blocked as well. -- Narayan Desai
- Heh. Heh. Too late... See below. -- Pete Beckman who is only pretending to be a QA guy.
login2 Selfish_suite/detour> qstat JobID User WallTime Nodes State ============================================ 149 rloy 00:00:20 16 running 150 beckman -1:23:59 32 running
fixed I assume....
That's right. -- Narayan Desai
- qstat, qsub, and qdel need a "--version" option linked to a CVS string some how. We are now reporting things in real time, and the code is getting fixed in real time, inside live environments, and we can't tell which version the bug is reported against. Bugs reported here should be accompanied by a version number, and that number must be available from the command line so users can understand when things change. -- Pete Beckman
all utilties now take a --version option. -- Narayan Desai
