Sunday, August 4, 2019

Spectum LSF MultiCluster Job Forwarding Model - Configurations

IBM® Spectrum LSF (formerly IBM® Platform™ LSF®) is a complete workload management solution for demanding HPC environments. Featuring intelligent, policy-driven scheduling and easy to use interfaces for job and workflow management, it helps organizations to improve competitiveness by accelerating research and design while controlling costs through superior resource utilization. There are two Spectrum LSF  Multi-cluster Models : Job Forward Mode and Lease Mode. You could refer another blog for configuring your cluster to Lease Mode. Click here . Lets learn Job forwarding  Mode for Spectrum LSF cluster in sections below.

Job forwarding model overview
In this model, the cluster that is starving for resources sends jobs over to thecluster that has resources to spare. Job status, pending reason, and resource usageare returned to the submission cluster. When the job is done, the exit code returns to the submission cluster.

By default, clusters do not share resources, even if MultiCluster has been installed. To enable job forwarding, enable MultiCluster queues in both the submission and execution clusters.

How it works :
With this model, scheduling of MultiCluster jobs is a process with two schedulingphases:
- the submission cluster selects a suitable remote receive-jobs queue, and forwards the job to it
- the execution cluster selects a suitable host and dispatches the job to it.If a suitable host is not found immediately, the job remains pending in the execution cluster, and is evaluated again the next scheduling cycle.This method automatically favors local hosts; a MultiCluster send-jobs queue always attempts to find a suitable local host before considering a receive-jobs queue in another cluster.
 
Send-jobs queue
A send-jobs queue can forward jobs to a specified remote queue. By default, LSF attempts to run jobs in the local cluster first. LSF only attempts to place a job remotely if it cannot place the job locally.
Receive-jobs queue
A receive-jobs queue accepts jobs from queues in a specified remote cluster. Although send-jobs queues only forward jobs to specific queues in the remote cluster, receive-jobs queues can accept work from any and all queues in the remote cluster.
Multiple queue pairs
  • You can configure multiple send-jobs and receive-jobs queues in one cluster.
  • A queue can forward jobs to as many queues in as many clusters as you want, and can also receive jobs from as many other clusters as you want.
  • A receive-jobs queue can also borrow resources using the resource leasing method, but a send-jobs queue using the job forwarding method cannot also share resources using the resource leasing method.

In LSF multicluster capability job forwarding mode, filters out-put to display information on forwarded jobs, including the forwarded time and the name of the cluster to which the job was forwarded. -fwd can be used with other options to further filter the results. 
For example, bjobs -fwd -r displays only forwarded running jobs.In LSF multicluster capability job forwarding mode, you can use the local job ID and cluster name to retrieve the job details from the remote cluster. The query syntax is:

bjobs submission_job_id@submission_cluster_name 

Additional Output fields for bjobs
  
       +--------------------+-----+----------+--------------+--------+
       | Field name         | Wid | Aliases  | Unit         | Catego |
       |                    | th  |          |              | ry     |
       +--------------------+-----+----------+--------------+--------+
       +--------------------+-----+----------+--------------+--------+
       | forward_cluster    | 15  | fwd_clus |              | MultiC |
       |                    |     | ter      |              | luster |
       |--------------------|-----|----------|--------------|        |
       | forward_time       | 15  | fwd_time | time stamp   |        |
       |--------------------|-----|----------|--------------|        |
       | srcjobid           | 8   |          |              |        |
       |--------------------|-----|----------|--------------|        |
       | dstjobid           | 8   |          |              |        |
       |--------------------|-----|----------|--------------|        |
       | source_cluster     | 15  | srcluste |              |        |
       |                    |     | r        |              |        |
       +--------------------+-----+----------+--------------+--------+ 

Cluster Configurations:

List the clusters with  basic information :
[sachinpb@powerNode06 ~]$ lsclusters
CLUSTER_NAME   STATUS   MASTER_HOST         ADMIN      HOSTS  SERVERS
ppc_cluster1               ok            powerNode06              lsfadmin            5             5
x86-64_cluster           ok        RemoteClusterHost07      lsfadmin            8            8
[sachinpb@powerNode06 ~]$
List hosts on each cluster
[sachinpb@powerNode06 ~]$ bhosts -w
HOST_NAME             STATUS          JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV
powerNode01                    ok              -             80      0                 0      0                  0      0
powerNode02                    ok              -             80      0                 0      0                  0      0
powerNode03                    ok              -             80      0                 0      0                  0      0
powerNode04                    ok              -             80      0                 0      0                  0      0
powerNode05                    ok              -             80      0                 0      0                  0      0
 [sachinpb@powerNode06 ~]$
 --------------------------  Other cluster ----------------------
[sachinpb@RemoteClusterHost7 ~]$ bhosts -w 
HOST_NAME                STATUS          JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
RemoteClusterHost01         ok                -     40              0             0      0                  0      0
RemoteClusterHost02         ok                -     40              0             0      0                  0      0
RemoteClusterHost03         ok                -     40              0             0      0                  0      0
RemoteClusterHost04         ok                -     40              0             0      0                  0      0
RemoteClusterHost05         ok                -     40              0             0      0                  0      0
RemoteClusterHost06         ok                -     40              0             0      0                  0      0
RemoteClusterHost07         ok                -     40              0             0      0                  0      0
RemoteClusterHost08         ok                -     40              0             0      0                  0      0
[sachinpb@RemoteClusterHost7 ~]$

Displays information about IBM Spectrum LSF multicluster capability
[sachinpb@powerNode06 ~]$ bclusters -w
LOCAL_QUEUE                JOB_FLOW    REMOTE                  CLUSTER              STATUS
send_queue                     send              receive_queue          x86-64_cluster2         ok
x86_perf_q                       send               x86_perf_q               x86-64_cluster2        ok
x86_ibmgpu_q                  send               x86_ibmgpu_q         x86-64_cluster2         ok

[Resource Lease Information ]
No resources have been exported or borrowed
 
[sachinpb@powerNode06 ~]$

Configuration file :
 

To make a queue that only runs jobs in remote clusters, take the following steps:

Procedure

  1. Edit the lsb.queues queue definition for the send-jobs queue.
    1. Define SNDJOBS_TO. This specifies that the queue can forward jobs to specified remote execution queues.
    2. Set HOSTS to none. This specifies that the queue uses no local hosts.
    3.  Set MAX_RSCHED_TIME=infinit to maintain FCFS job order
  2. Edit the lsb.queues queue definition for each receive-jobs queue.
    1. Define RCVJOBS_FROM. This specifies that the receive-jobs queue accepts jobs from the specified submission cluster.
    2.  Set HOSTS  - list of all execution hosts

Update $LSF_HOME/conf/lsbatch/ppc_cluster1/configdir/lsb.queues
--------------------------------------------------
Begin Queue
QUEUE_NAME     = send_queue
SNDJOBS_TO     = receive_queue@x86-64_cluster2
HOSTS          = none
PRIORITY       = 30
NICE           = 20
End Queue

--- on Other cluster--- Begin Queue
QUEUE_NAME      = receive_queue
RCVJOBS_FROM    = send_queue@ppc_cluster1
HOSTS           = RemoteClusterHost01 RemoteClusterHost02 RemoteClusterHost03 RemoteClusterHost04 RemoteClusterHost05 RemoteClusterHost06 RemoteClusterHost07 RemoteClusterHost08
PRIORITY        = 55
NICE            = 10
EXCLUSIVE       = Y
DESCRIPTION     = Multicluster Queue
End Queue ---------------------------------------------------------------------------------

Begin Queue
QUEUE_NAME   = x86_gpu_q
SNDJOBS_TO   = x86_gpu_q@x86-64_cluster2
PRIORITY     = 90
INTERACTIVE  = NO
FAIRSHARE    = USER_SHARES[[default,1]]
HOSTS        = none
EXCLUSIVE    = Y
MAX_RSCHED_TIME = infinit
DESCRIPTION  = For x86jobs, Multicluster Queue - Job forward Mode
End Queue

----on Other cluster--
Begin Queue
QUEUE_NAME      = x86_gpu_q
RCVJOBS_FROM    = x86_gpu_q@ppc_cluster1
HOSTS           = RemoteClusterHost01 RemoteClusterHost02 RemoteClusterHost03 RemoteClusterHost04 RemoteClusterHost05 RemoteClusterHost06 RemoteClusterHost07 RemoteClusterHost08
PRIORITY        = 55
NICE            = 10
EXCLUSIVE       = Y
DESCRIPTION     = Multicluster Queue - Job forward Mode
End Queue
---------------------------------------------------------------------------------------

Begin Queue
QUEUE_NAME   = x86_perf_q
SNDJOBS_TO   = x86_perf_q@x86-64_cluster2
PRIORITY     = 40
INTERACTIVE  = NO
FAIRSHARE    = USER_SHARES[[default,1]]
HOSTS        = none
EXCLUSIVE    = Y
MAX_RSCHED_TIME = infinit
DESCRIPTION  = For P8 performance jobs, running only if hosts are lightly loaded.
End Queue


---on Other cluster---
Begin Queue
QUEUE_NAME      = x86_perf_q
RCVJOBS_FROM    = x86_perf_q@ppc_cluster1
HOSTS           = RemoteClusterHost01 RemoteClusterHost02 RemoteClusterHost03 RemoteClusterHost04 RemoteClusterHost05 RemoteClusterHost06 RemoteClusterHost07 RemoteClusterHost08
PRIORITY        = 55
NICE            = 10
DESCRIPTION     = Multicluster Queue - Job forward Mode
End Queue --------------------------------------------------------------


NOTE: 

 For LSF multicluster forward mode, jobs will be recalled to the submission cluster when job stays pending state in the execution cluster reaching MAX_RSCHED_TIME. Set MAX_RSCHED_TIME=infinit to maintain FCFS job order of MultiCluster jobs in the execution queue. Otherwise, jobs that time out are rescheduled to the same execution queue,  but they lose priority and position because they are treated as a new job submission.
How to submit  Jobs - example to show Job forward mode from "ppc_cluster1"  to "x86-64_cluster2"

Submit job1 - job1.script
[sachinpb@powerNode06 ~]$  bsub -n 8 -R "span[ptile=4]" -q x86_ibmgpu_q -R "select[type==X86_64]" job1.script
Job <25378> is submitted to queue <x86_ibmgpu_q>.
[sachinpb@powerNode06 ~]$

Submit job2 - job2.script
[sachinpb@powerNode06 ~]$  bsub -n 4 -q x86_ibmgpu_q -R "select[type==X86_64]" job2.script
Job <25383> is submitted to queue <x86_ibmgpu_q>.
[sachinpb@powerNode06 ~]$

List all forwarded jobs with -fwd option with bjobs

[sachinpb@powerNode06 ~]$ bjobs -fwd
JOBID   USER    STAT     QUEUE                              EXEC_HOST                            JOB_NAME       CLUSTER             FORWARD_TIME
25378   sachinpb  RUN   x86_ibmgpu_q  RemoteClusterHost02@x86-64_cluster2       job1.script        x86-64_cluster2      Aug  6 10:25
                                                                   RemoteClusterHost02@x86-64_cluster2
                                                                   RemoteClusterHost02@x86-64_cluster2
                                                                   RemoteClusterHost02@x86-64_cluster2
                                                                   RemoteClusterHost08@x86-64_cluster2
                                                                   RemoteClusterHost08@x86-64_cluster2
                                                                   RemoteClusterHost08@x86-64_cluster2
                                                                   RemoteClusterHost08@x86-64_cluster2
25383   sachinpb  RUN   x86_ibmgpu_q  RemoteClusterHost04@x86-64_cluster2      job2.script         x86-64_cluster2      Aug  6 10:39
                                                                   RemoteClusterHost04@x86-64_cluster2
                                                                   RemoteClusterHost04@x86-64_cluster2
                                                                   RemoteClusterHost04@x86-64_cluster2
                                                                   RemoteClusterHost04@x86-64_cluster2
[sachinpb@powerNode06 ~]$

--------------------
Observe the Job description and details for forwarded job:
[sachinpb@powerNode06 ~]$ bjobs -l 25378
Job <25378>, Job Name <sachinpb-TEST_ibm-smpi_1127>, User <sachinpb>, Project <defa
                     ult>, Status <RUN>, Queue <x86_ibmgpu_q>
Tue Aug  6 10:25:51: Submitted from host <powerNode06>,  Exclusive Execution, 8 Task(s), Requ
                     ested Resources < select[type == X86_64] span[ptile=4]>;
Tue Aug  6 10:25:51: Job <25378> forwarded to cluster <x86-64_cluster2> as Job<24347>;
Tue Aug  6 10:25:51: Started 8 Task(s) on Host(s) <RemoteClusterHost2@x86-64_cluster2> <i
                     bmgpu02@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <ibmgp
                     u02@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@
                     x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-
                     64_cluster2>, Allocated 32 Slot(s) on Host(s) <RemoteClusterHost2@x8
                     6-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64
                     _cluster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_clu
                     ster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster
                     2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <
                     RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <ibmg
                     pu02@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2
                     @x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86
                     -64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64_
                     cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64_clus
                     ter2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2
                     > <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <i
                     bmgpu08@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <ibmgp
                     u08@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@
                     x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-
                     64_cluster2> <RemoteClusterHost8@x86-64_cluster2>, Execution Home </
                     home1/sachinpb/>, Execution CWD </tmp>;
Tue Aug  6 11:41:25: Resource usage collected.
                     The CPU time used is 28681 seconds.
                     MEM: 82 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 22
                     HOST: RemoteClusterHost2
                     MEM: 82 Mbytes;  SWAP: 0 Mbytes; CPU_TIME: 15662 seconds
                     PGIDs:  7309 29438 29439 29440 29441 29635 29636 29637 296
                     38
                     PIDs:  7309 7322 7324 7377 29261 29418 29438 29439 29440 2
                     9441 29616 29635 29636 29637 29638
                     HOST: RemoteClusterHost8
                     MEM: 0 Mbytes;  SWAP: 0 Mbytes; CPU_TIME: 13019 seconds
                     PGIDs: -
                     PIDs: -
 RUNLIMIT
 480.0 min
 MEMORY USAGE:
 MAX MEM: 135.4 Gbytes;  AVG MEM: 11 Gbytes
 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -
 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == X86_64 ] order[r15s:pg] span[ptile=4]
 Effective: select[type == X86_64 ] order[r15s:pg] span[ptile=4]
[sachinpb@powerNode06 ~]$

--------------------
Job description and details on Remote cluster for same forwarded Job with different JobID :

-bash-4.2$ bjobs -l 24347
Job <24347>, Job Name <sachinpb-TEST_ibm-smpi_1127>, User <sachinpb>, Project <defa
                     ult>, Status <RUN>, Queue <x86_ibmgpu_q>, Command <sh /nfs
                     _smpi_ci/ibm-tests/smpi-ci/bin/smpi_test.sh 1127 pr x86_64
                      ibm-smpi "  ">
Tue Aug  6 10:30:55: Submitted from host <c712f6n06@ppc_cluster1:25378>,Executi
                     on, 8 Task(s), Requested Resources < select[type == X86_64] s
                     pan[ptile=4]>;
Tue Aug  6 10:30:55: Job <25378> of cluster <ppc_cluster1> accepted as Job <24347>;
Tue Aug  6 10:30:55: Started 8 Task(s) on Host(s) <RemoteClusterHost2> <RemoteClusterHost2> <ibmgpu
                     02> <RemoteClusterHost2> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8>
                     , Allocated 32 Slot(s) on Host(s) <RemoteClusterHost2> <RemoteClusterHost2> <i
                     bmgpu02> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <ibmg
                     pu02> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost
                     2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost8> <RemoteClusterHost8>
                     <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <ib
                     mgpu08> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <ibmgp
                     u08> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8>, Execution Home </ho
                     me1/sachinpb/>, Execution CWD </tmp>;
Tue Aug  6 12:13:45: Resource usage collected.
                     The CPU time used is 41628 seconds.
                     MEM: 88 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 13
                     HOST: RemoteClusterHost2
                     MEM: 88 Mbytes;  SWAP: 0 Mbytes; CPU_TIME: 22646 seconds
                     PGIDs:  7309 11162 11163 11164 11165 11719 11720 11721
                     PIDs:  7309 7322 7324 7377 11144 11162 11163 11164 11165 1
                     1379 11541 11700 11719 11720 11721

                     HOST: RemoteClusterHost8
                     MEM: 0 Mbytes;  SWAP: 0 Mbytes; CPU_TIME: 18982 seconds
                     PGIDs: -
                     PIDs: -
 RUNLIMIT
 480.0 min
 MEMORY USAGE:
 MAX MEM: 135.4 Gbytes;  AVG MEM: 9.2 Gbytes
 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -
 loadStop    -     -     -     -       -     -    -     -     -      -      -
 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == X86_64 ] order[r15s:pg] span[ptile=4]
 Effective: select[type == X86_64 ] order[r15s:pg] span[ptile=4]
-bash-4.2$

---------------------------------------------------------------------------------
You could also check  details on executed Jobs using "bhist command"

@Submission cluster:
[sachinpb@powerNode06 configdir]$ bhist -l 25378
Job <25378>, Job Name <sachinpb-TEST_ibm-smpi_1127>, User <sachinpb>, Project <defa
                     ult>, Command <sh /nfs_smpi_ci/ibm-tests/smpi-ci/bin/smpi_
                     test.sh 1127 pr x86_64 ibm-smpi "  ">
Tue Aug  6 10:25:51: Submitted from host <powerNode>, to Queue <x86_ibmgpu_q>,
                     ution, 8 Task(s), Requested Resources < select[type == x86-64
                     ] span[ptile=4]>;
Tue Aug  6 10:25:51: Forwarded job to cluster x86-64_cluster2;
Tue Aug  6 10:25:51: Job 25378 forwarded to cluster x86-64_cluster2 as remote j
                     ob 24347;
Tue Aug  6 10:25:51: Dispatched 8 Task(s) on Host(s) <RemoteClusterHost2@x86-64_cluster2>
                     <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <ibm
                     gpu02@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost
                     8@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x8
                     6-64_cluster2>, Allocated 32 Slot(s) on Host(s) <RemoteClusterHost2@
                     x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-
                     64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_c
                     luster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_clust
                     er2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2>
                     <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <ibm
                     gpu02@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost
                     2@x86-64_cluster2> <RemoteClusterHost2@x86-64_cluster2> <RemoteClusterHost2@x8
                     6-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64
                     _cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64_clu
                     ster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster
                     2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <
                     RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <ibmg
                     pu08@x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8
                     @x86-64_cluster2> <RemoteClusterHost8@x86-64_cluster2> <RemoteClusterHost8@x86
                     -64_cluster2> <RemoteClusterHost8@x86-64_cluster2>, Effective RES_RE
                     Q <select[type == any ] order[r15s:pg] span[ptile=4] >;
Tue Aug  6 10:25:51: Starting (Pid 7309);
Tue Aug  6 10:25:51: Running with execution home </home1/sachinpb/>, Execution CW
                     D </tmp>, Execution Pid <7309>;
Tue Aug  6 12:17:50: Done successfully. The CPU time used is 46995.0 seconds;
                     HOST: RemoteClusterHost2; CPU_TIME: 23908 seconds
                     HOST: RemoteClusterHost8; CPU_TIME: 23087 seconds;
 RUNLIMIT
 480.0 min of powerNode
MEMORY USAGE:
MAX MEM: 135.4 Gbytes;  AVG MEM: 8.5 Gbytes
Summary of time in seconds spent in various states by  Tue Aug  6 12:17:50
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  0               0            6719           0            0                     0        6719

-----------------------
@Execution Node:

-bash-4.2$ bhist -l 24347

Job <24347>, Job Name <sachinpb-TEST_ibm-smpi_1127>, User <sachinpb>, Project <defa
                     ult>, Command <sh /nfs_smpi_ci/ibm-tests/smpi-ci/bin/smpi_
                     test.sh 1127 pr x86_64 ibm-smpi "  ">
Tue Aug  6 10:30:55: Submitted from host <powerNode>, to Queue <x86_ibmgpu_q>,
                     r-ibm-smpi-1127/logs/smpi_test_lsf_out_25378>, Exclusive E
                     xecution, 8 Task(s), Requested Resources < select[type ==
                     x86-64] span[ptile=4]>;
Tue Aug  6 10:30:55: Job 25378 of cluster ppc_cluster1 accepted as job 24347;
Tue Aug  6 10:30:55: Dispatched 8 Task(s) on Host(s) <RemoteClusterHost2> <RemoteClusterHost2> <ibm
                     gpu02> <RemoteClusterHost2> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <ibmgpu
                     08>, Allocated 32 Slot(s) on Host(s) <RemoteClusterHost2> <RemoteClusterHost2>
                     <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <ib
                     mgpu02> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <ibmgp
                     u02> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost2> <RemoteClusterHost8> <RemoteClusterHost8
                     > <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <
                     RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8> <ibm
                     gpu08> <RemoteClusterHost8> <RemoteClusterHost8> <RemoteClusterHost8>, Effective RES_REQ
                      <select[type == any ] order[r15s:pg] span[ptile=4] >;
Tue Aug  6 10:30:55: Starting (Pid 7309);
Tue Aug  6 10:30:56: Running with execution home </home1/sachinpb/>, Execution CW
                     D </tmp>, Execution Pid <7309>;
Tue Aug  6 12:22:54: Done successfully. The CPU time used is 46995.0 seconds;
                     HOST: RemoteClusterHost2; CPU_TIME: 23908 seconds
                     HOST: RemoteClusterHost8; CPU_TIME: 23087 seconds
Tue Aug  6 12:23:07: Post job process done successfully;
 RUNLIMIT
 480.0 min of POWER8
MEMORY USAGE:
MAX MEM: 135.4 Gbytes;  AVG MEM: 8.5 Gbytes
Summary of time in seconds spent in various states by  Tue Aug  6 12:23:07
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  0                    0        6719              0             0                   0        6719
-bash-4.2$

-------------------------------------
 Example 2 :
[]$ bsub -n 8 -q x86_ibmgpu_q -R "select[type==X86_64] span[ptile=1]"
bsub> sleep 100
bsub> Job <71403> is submitted to queue <x86_ibmgpu_q>.
[]$

[]$ bjobs 71403
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
71403   smpici  RUN    x86_ibmgpu  MYhost_cluster1  
                                             Host01@x8 sleep 100  Jun 20 22:54
                                             Host02@x86-64_cluster2
                                             Host03@x86-64_cluster2
                                             Host04@x86-64_cluster2
                                             Host05@x86-64_cluster2
                                             Host06@x86-64_cluster2
                                             Host07@x86-64_cluster2
                                             Host08@x86-64_cluster2
[]$
-------------------------------------- ------------------------------------------------------------------------------

I hope this blog helped in understanding how to setup Spectrum LSF's Job forwarding Mode in a multinode cluster  followed by job submission

Reference:
https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_multicluster/job_scheduling_job_forward_mc_lsf.html 
https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_multicluster/queue_configure_remote_mc_lsf.html
https://www.ibm.com/support/knowledgecenter/en/SSETD4_9.1.2/lsf_multicluster/remote_timeout_limit_mc_lsf.html










No comments:

Post a Comment