IBM® Spectrum LSF (formerly IBM® Platform™ LSF®) is a
complete workload management solution for demanding HPC environments. Featuring intelligent,
policy-driven scheduling and easy to use interfaces for job and workflow management, it helps
organizations to improve competitiveness by accelerating research and design while controlling
costs through superior resource utilization.
Without a scheduler, an HPC Cluster would just be a bunch of servers with different jobs interfering with each other. When you have a large clusters and multiple users, each user doesn’t know which compute nodes and CPU cores to use, nor how much resources are available on each node. To solve this, cluster batch control systems are used to manage jobs on the system using HPC Schedulers. They are essential for sequentially queuing jobs, assigning priorities, distributing, parallelizing, suspending, killing or otherwise controlling jobs cluster-wide. Spectrum LSF is a powerful workload management platform, job scheduler, for distributed high performance computing.
Computational multi-clusters are an important emerging class of supercomputing architectures. As multi-cluster systems become more prevalent, techniques for efficiently exploiting these resources become increasingly significant. A critical aspect of exploiting these resources is the challenge of scheduling. In order to maximize job throughput, multi-cluster schedulers must simultaneously leverage the collective computational resources of each of its participating clusters. By doing so, jobs that would otherwise wait for nodes to become available on a single cluster can potentially run earlier by aggregating disjoint resources throughout the multi-cluster. This procedure can result in dramatic reductions in queue waiting times.
Organizations might have multiple LSF clusters manged by different business units. In this scenario it is good to share the resources across the cluster to reap the benefits of global load sharing.
- Ease of administration
- Different geographic locations
- Scalability
There are two Spectrum LSF Multi-cluster Models :
Job forwarding Model:
Job forwarding Model:
In this model, the cluster that is starving
for resources sends jobs over to the cluster that has resources to
spare. To work together, two clusters must set up compatible send-jobs
and receive-jobs queues.
With this
model, scheduling of MultiCluster jobs is a process with two scheduling
phases: the submission cluster selects a suitable remote receive-jobs
queue, and forwards the job to it; then the execution cluster selects a
suitable host and dispatches the job to it. This method automatically
favors local hosts; a MultiCluster send-jobs queue always attempts to
find a suitable local host before considering a receive-jobs queue in
another cluster.
You could refer another blog for configuring your cluster to Job forwarding Mode. Click here.
You could refer another blog for configuring your cluster to Job forwarding Mode. Click here.
Resource leasing model
In
this model, the cluster that is starving for resources takes resources
away from the cluster that has resources to spare. To work together, the
provider cluster must “export” resources to the consumer, and the
consumer cluster must configure a queue to use those resources. In this model, each cluster schedules work on a single system image, which includes both borrowed hosts and local hosts.
Two clusters agree that one cluster will borrow
resources from the other, taking control of the resources. Both clusters
must change their configuration to make this possible, and the arrangement,
called a “lease”, does not expire, although it might change due to
changes in the cluster configuration.
With this
model, scheduling of jobs is always done by a single cluster. When
a queue is configured to run jobs on borrowed hosts, LSF schedules
jobs as if the borrowed hosts actually belonged to the cluster.
Selection of Model:
- The job forwarding model can make resources available to jobs from multiple clusters, this flexibility allows maximum throughput when each cluster’s resource usage fluctuates. The resource leasing model can allow one cluster exclusive control of a dedicated resource, this can be more efficient when there is a steady amount of work.
In this blog, you could follow both Lease & Job forward Mode configurations for Spectrum LSF cluster .
[sachin@host1 ~]$ lsid
IBM Spectrum LSF Standard 10.1.0.3
My cluster name is cluster1_p8
My master name is host1
[sachin@host1 ~]$
lsclusters : displays configuration information about LSF clusters
bhosts : Displays hosts and their static and dynamic resources in cluster
-----------------------------------------------------------
Configuration Files:
/nfs_shared_dir/LSF_HOME/conf/lsf.shared
Begin Cluster
ClusterName Servers
cluster1_p8 (host1)
cluster2_p9 (host6)
cluster3_x86 (host11)
End Cluster
------------------------------------------------------------
/nfs_shared_dir/LSF_HOME/conf/lsbatch/ppc_cluster1/configdir/lsb.resources
Begin HostExport
PER_HOST = host1 # export host list
SLOTS = 20 # for each host, export 1 job slots
DISTRIBUTION = ([ cluster2_p9 , 1] [cluster3_x86, 1]) # share distribution for remo
MEM = 100 # export 100M mem of each host [optional parameter]
SWP = 100 # export 100M swp of each host [optional parameter]
End HostExport
In this example, resources are leased to 2 clusters in an even 1:1 ratio. Each cluster gets 1/2 of the resources. NOTE: This configuration required only for Lease Mode.
------------------------------------------------------
/nfs_shared_dir/LSF_HOME/conf/lsbatch/CI_cluster_ppc/configdir/lsb.queues
Begin Queue
QUEUE_NAME = send_queue
SNDJOBS_TO = receive_queue@cluster3_x86
HOSTS = none
PRIORITY = 30
NICE = 20
End Queue
Begin Queue
QUEUE_NAME = leaseq
PRIORITY = 20
HOSTS = all allremote
End Queue
Begin Queue
QUEUE_NAME = cluster1_p8
PRIORITY = 30
INTERACTIVE = NO
HOSTS = host1 host2 host3 host4 host5 # hosts on which jobs in this queue can run
DESCRIPTION = For submission of jobs to P9 machines
End Queue
Begin Queue
QUEUE_NAME = cluster2_p9
PRIORITY = 30
INTERACTIVE = NO
HOSTS = host6 host7 host8 host9 host10 # hosts on which jobs in this queue can run
DESCRIPTION = For submission of jobs to P9 machines
End Queue
Begin Queue
QUEUE_NAME = cluster3_x86
PRIORITY = 30
INTERACTIVE = NO
HOSTS = host11 host12 host13 host14 host15 # hosts on which jobs in this queue can run
DESCRIPTION = For submission of jobs to P8 machines
End Queue
-------------------------------------------------------------------------------
In case of job forwarding model you need to have following configuration on Remote cluster
/nfs_shared_dir/LSF_HOME/conf/lsbatch/CI_cluster_ppc/configdir/lsb.queues
Begin Queue
QUEUE_NAME = receive_queue
RCVJOBS_FROM = send_queue@cluster1_p8
HOSTS = host11 host12 host13 host14 host15
PRIORITY = 55
NICE = 10
DESCRIPTION = Multicluster Queue
End Queue
-------------------------------------------------------------------------------------------------------
Check Job Forwarding Information and Resource Lease Information by issuing bclusters command :
Submit LSF job - forwarding mechanism
Submit LSF job - Resource Leasing mechanism
In this article I wanted to illustrate how someone could get started creating their own LSF multi-cluster setup to run their application that needs more computational resource.
Reference:
Reference:
- http://www.slac.stanford.edu/comp/unix/package/lsf/LSF8.1_doc/8.0/multicluster/index.htm?multicluster_benefits_mc_lsf.html~main
- http://www-01.ibm.com/support/docview.wss?uid=isg3T1016097
- https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_welcome/lsf_kc_mc.html
- https://tin6150.github.io/psg/3rdParty/lsf4_userGuide/13-multicluster.html