The IBM Spectrum LSF Suites portfolio redefines cluster virtualization and workload management by providing a tightly integrated solution for demanding, mission-critical HPC environments that can increase both user productivity and hardware utilization while decreasing system management costs. The heterogeneous, highly scalable and available architecture provides support for traditional high-performance computing and high throughput workloads, as well as for big data, cognitive, GPU machine learning, and containerized workloads. Clients worldwide are using technical computing environments supported by LSF to run hundreds of genomic workloads, including Burrows-Wheeler Aligner (BWA), SAMtools, Picard, GATK, Isaac, CASAVA, and other frequently used pipelines for genomic analysis.
source |
IBM Spectrum LSF provides support for heterogeneous computing environments, including NVIDIA GPUs. With the ability to detect, monitor and schedule GPU enabled workloads to the appropriate resources, IBM Spectrum LSF enables users to easily take advantage of the benefits provided by GPUs.
- Enforcement of GPU allocations via cgroups
- Exclusive allocation and round robin shared mode allocation
- CPU-GPU affinity
- Boost control
- Power management/li>
- Multi-Process Server (MPS) support
- NVIDIA Pascal and DCGM support
- The largest GPU compute capability (gpu_factor value).
- GPUs with direct NVLink connections.
- GPUs with the same model, including the GPU total memory size.
- The largest available GPU memory.
- The number of concurrent jobs on the same GPU.
- The current GPU mode.
Configurations:
1) GPU auto-configurationEnabling GPU detection for LSF is now available with automatic configuration. To enable automatic GPU configuration, configure LSF_GPU_AUTOCONFIG=Y in the lsf.conf file. LSF_GPU_AUTOCONFIG controls whether LSF enables use of GPU resources automatically. If set to Y, LSF automatically configures built-in GPU resources and automatically detects GPUs. If set to N, manual configuration of GPU resources is required to use GPU features in LSF. Whether LSF_GPU_AUTOCONFIG is set to Y or N, LSF will always collect GPU metrics from hosts. On
When enabled, the lsload -gpu, lsload -gpuload, and lshosts -gpu commands will show host-based or GPU-based resource metrics for monitoring.
2) The LSB_GPU_NEW_SYNTAX=extend parameter must be defined in the lsf.conf file to enable the -gpu option and GPU_REQ parameter syntax.
3) Other configurations :
- To configure GPU resource requirements for an application profile, specify the GPU_REQ parameter in the lsb.applications file. i.e GPU_REQ="gpu_req"
- To configure GPU resource requirements for a queue, specify the GPU_REQ parameter in the lsb.queues file. i.e GPU_REQ="gpu_req"
- To configure default GPU resource requirements for the cluster, specify the LSB_GPU_REQ parameter in the lsf.conf file. i.e LSB_GPU_REQ="gpu_req"
---------------------------------------------------------------------------------------------
Configuration change required on clusters : LSF_HOME/conf/lsf.conf
#To enable "-gpu"
LSF_GPU_AUTOCONFIG=Y
LSB_GPU_NEW_SYNTAX=extend
LSB_GPU_REQ="num=4:mode=shared:j_exclusive=yes"
#To enable "-gpu"
LSF_GPU_AUTOCONFIG=Y
LSB_GPU_NEW_SYNTAX=extend
LSB_GPU_REQ="num=4:mode=shared:j_exclusive=yes"
--------------------------------------------------------------------------------------------
Specify additional GPU resource requirements
LSF now allows you to request additional GPU resource requirements to allow you to further refine the GPU resources that are allocated to your jobs. The existing bsub -gpu command option, LSB_GPU_REQ parameter in the lsf.conf file, and the GPU_REQ parameter in the lsb.queues and lsb.applications files now have additional GPU options to make the following requests:
Monitor GPU resources with lsload command
Options within the lsload command show the host-based and GPU-based GPU information for a cluster. The lsload -l command does not show GPU metrics. GPU metrics can be viewed using the lsload -gpu command, lsload -gpuload command, and lshosts -gpu command.
lsload -gpu
Specify additional GPU resource requirements
LSF now allows you to request additional GPU resource requirements to allow you to further refine the GPU resources that are allocated to your jobs. The existing bsub -gpu command option, LSB_GPU_REQ parameter in the lsf.conf file, and the GPU_REQ parameter in the lsb.queues and lsb.applications files now have additional GPU options to make the following requests:
- The gmodel option requests GPUs with a specific brand name, model number, or total GPU memory.
- The gtile option specifies the number of GPUs to use per socket.
- The gmem option reserves the specified amount of memory on each GPU that the job requires.
- The nvlink option requests GPUs with NVLink connections.
Monitor GPU resources with lsload command
Options within the lsload command show the host-based and GPU-based GPU information for a cluster. The lsload -l command does not show GPU metrics. GPU metrics can be viewed using the lsload -gpu command, lsload -gpuload command, and lshosts -gpu command.
lsload -gpu
[root@powerNode2 ~]# lsload -gpu
HOST_NAME status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physical
powerNode1 ok 4 0% 0% 4
powerNode2 ok 4 0% 0% 4
powerNode3 ok 4 0% 0% 4
powerNode4 ok 4 0% 0% 4
powerNode5 ok 4 0% 0% 4
[root@powerNode2 ~]#
lsload -gpuload
[root@powerNode2 ~]# lsload -gpuload
HOST_NAME gpuid gpu_model gpu_mode gpu_temp gpu_ecc gpu_ut gpu_mut gpu_mtotal gpu_mused gpu_pstate gpu_status gpu_error
powerNode1 0 TeslaV100_S 0.0 33C 0.0 0% 0% 15.7G 0M 0 ok -
1 TeslaV100_S 0.0 36C 0.0 0% 0% 15.7G 0M 0 ok -
2 TeslaV100_S 0.0 33C 0.0 0% 0% 15.7G 0M 0 ok -
3 TeslaV100_S 0.0 36C 0.0 0% 0% 15.7G 0M 0 ok -
powerNode2 0 TeslaP100_S 0.0 37C 0.0 0% 0% 15.8G 0M 0 ok -
1 TeslaP100_S 0.0 32C 0.0 0% 0% 15.8G 0M 0 ok -
2 TeslaP100_S 0.0 36C 0.0 0% 0% 15.8G 0M 0 ok -
3 TeslaP100_S 0.0 31C 0.0 0% 0% 15.8G 0M 0 ok -
powerNode3 0 TeslaP100_S 0.0 33C 0.0 0% 0% 15.8G 0M 0 ok -
1 TeslaP100_S 0.0 32C 0.0 0% 0% 15.8G 0M 0 ok -
2 TeslaP100_S 0.0 35C 0.0 0% 0% 15.8G 0M 0 ok -
3 TeslaP100_S 0.0 37C 0.0 0% 0% 15.8G 0M 0 ok -
powerNode4 0 TeslaV100_S 0.0 35C 0.0 0% 0% 15.7G 0M 0 ok -
1 TeslaV100_S 0.0 35C 0.0 0% 0% 15.7G 0M 0 ok -
2 TeslaV100_S 0.0 32C 0.0 0% 0% 15.7G 0M 0 ok -
3 TeslaV100_S 0.0 36C 0.0 0% 0% 15.7G 0M 0 ok -
powerNode5 0 TeslaP100_S 0.0 31C 0.0 0% 0% 15.8G 0M 0 ok -
1 TeslaP100_S 0.0 32C 0.0 0% 0% 15.8G 0M 0 ok -
2 TeslaP100_S 0.0 34C 0.0 0% 0% 15.8G 0M 0 ok -
3 TeslaP100_S 0.0 36C 0.0 0% 0% 15.8G 0M 0 ok -
[root@powerNode2 ~]#
lshosts -gpu
HOST_NAME gpuid gpu_model gpu_mode gpu_temp gpu_ecc gpu_ut gpu_mut gpu_mtotal gpu_mused gpu_pstate gpu_status gpu_error
powerNode1 0 TeslaV100_S 0.0 33C 0.0 0% 0% 15.7G 0M 0 ok -
1 TeslaV100_S 0.0 36C 0.0 0% 0% 15.7G 0M 0 ok -
2 TeslaV100_S 0.0 33C 0.0 0% 0% 15.7G 0M 0 ok -
3 TeslaV100_S 0.0 36C 0.0 0% 0% 15.7G 0M 0 ok -
powerNode2 0 TeslaP100_S 0.0 37C 0.0 0% 0% 15.8G 0M 0 ok -
1 TeslaP100_S 0.0 32C 0.0 0% 0% 15.8G 0M 0 ok -
2 TeslaP100_S 0.0 36C 0.0 0% 0% 15.8G 0M 0 ok -
3 TeslaP100_S 0.0 31C 0.0 0% 0% 15.8G 0M 0 ok -
powerNode3 0 TeslaP100_S 0.0 33C 0.0 0% 0% 15.8G 0M 0 ok -
1 TeslaP100_S 0.0 32C 0.0 0% 0% 15.8G 0M 0 ok -
2 TeslaP100_S 0.0 35C 0.0 0% 0% 15.8G 0M 0 ok -
3 TeslaP100_S 0.0 37C 0.0 0% 0% 15.8G 0M 0 ok -
powerNode4 0 TeslaV100_S 0.0 35C 0.0 0% 0% 15.7G 0M 0 ok -
1 TeslaV100_S 0.0 35C 0.0 0% 0% 15.7G 0M 0 ok -
2 TeslaV100_S 0.0 32C 0.0 0% 0% 15.7G 0M 0 ok -
3 TeslaV100_S 0.0 36C 0.0 0% 0% 15.7G 0M 0 ok -
powerNode5 0 TeslaP100_S 0.0 31C 0.0 0% 0% 15.8G 0M 0 ok -
1 TeslaP100_S 0.0 32C 0.0 0% 0% 15.8G 0M 0 ok -
2 TeslaP100_S 0.0 34C 0.0 0% 0% 15.8G 0M 0 ok -
3 TeslaP100_S 0.0 36C 0.0 0% 0% 15.8G 0M 0 ok -
[root@powerNode2 ~]#
lshosts -gpu
[root@powerNode2 ~]# bhosts -gpu
HOST_NAME ID MODEL MUSED MRSV NJOBS RUN SUSP RSV
powerNode1 0 TeslaP100_SXM2_ 0M 0M 0 0 0 0
1 TeslaP100_SXM2_ 0M 0M 0 0 0 0
2 TeslaP100_SXM2_ 0M 0M 0 0 0 0
3 TeslaP100_SXM2_ 0M 0M 0 0 0 0
powerNode2 0 TeslaP100_SXM2_ 0M 0M 0 0 0 0
1 TeslaP100_SXM2_ 0M 0M 0 0 0 0
2 TeslaP100_SXM2_ 0M 0M 0 0 0 0
3 TeslaP100_SXM2_ 0M 0M 0 0 0 0
powerNode3 0 TeslaP100_SXM2_ 0M 0M 0 0 0 0
1 TeslaP100_SXM2_ 0M 0M 0 0 0 0
2 TeslaP100_SXM2_ 0M 0M 0 0 0 0
3 TeslaP100_SXM2_ 0M 0M 0 0 0 0
powerNode4 0 TeslaV100_SXM2_ 0M 0M 0 0 0 0
1 TeslaV100_SXM2_ 0M 0M 0 0 0 0
2 TeslaV100_SXM2_ 0M 0M 0 0 0 0
3 TeslaV100_SXM2_ 0M 0M 0 0 0 0
powerNode5 0 TeslaV100_SXM2_ 0M 0M 0 0 0 0
1 TeslaV100_SXM2_ 0M 0M 0 0 0 0
2 TeslaV100_SXM2_ 0M 0M 0 0 0 0
3 TeslaV100_SXM2_ 0M 0M 0 0 0 0
[root@powerNode2 ~]#
The -gpu option for lshosts shows the GPU topology information for a cluster.
[root@powerNode2 ~]# lshosts -gpu
HOST_NAME gpu_id gpu_model gpu_driver gpu_factor numa_id
powerNode1 0 TeslaP100_SXM2_ 418.67 6.0 0
1 TeslaP100_SXM2_ 418.67 6.0 0
2 TeslaP100_SXM2_ 418.67 6.0 1
3 TeslaP100_SXM2_ 418.67 6.0 1
powerNode2 0 TeslaP100_SXM2_ 418.67 6.0 0
1 TeslaP100_SXM2_ 418.67 6.0 0
2 TeslaP100_SXM2_ 418.67 6.0 1
3 TeslaP100_SXM2_ 418.67 6.0 1
powerNode3 0 TeslaP100_SXM2_ 418.67 6.0 0
1 TeslaP100_SXM2_ 418.67 6.0 0
2 TeslaP100_SXM2_ 418.67 6.0 1
3 TeslaP100_SXM2_ 418.67 6.0 1
powerNode4 0 TeslaV100_SXM2_ 418.67 7.0 0
1 TeslaV100_SXM2_ 418.67 7.0 0
2 TeslaV100_SXM2_ 418.67 7.0 8
3 TeslaV100_SXM2_ 418.67 7.0 8
powerNode5 0 TeslaV100_SXM2_ 418.67 7.0 0
1 TeslaV100_SXM2_ 418.67 7.0 0
2 TeslaV100_SXM2_ 418.67 7.0 8
3 TeslaV100_SXM2_ 418.67 7.0 8
[root@powerNode2 ~]#
HOST_NAME gpu_id gpu_model gpu_driver gpu_factor numa_id
powerNode1 0 TeslaP100_SXM2_ 418.67 6.0 0
1 TeslaP100_SXM2_ 418.67 6.0 0
2 TeslaP100_SXM2_ 418.67 6.0 1
3 TeslaP100_SXM2_ 418.67 6.0 1
powerNode2 0 TeslaP100_SXM2_ 418.67 6.0 0
1 TeslaP100_SXM2_ 418.67 6.0 0
2 TeslaP100_SXM2_ 418.67 6.0 1
3 TeslaP100_SXM2_ 418.67 6.0 1
powerNode3 0 TeslaP100_SXM2_ 418.67 6.0 0
1 TeslaP100_SXM2_ 418.67 6.0 0
2 TeslaP100_SXM2_ 418.67 6.0 1
3 TeslaP100_SXM2_ 418.67 6.0 1
powerNode4 0 TeslaV100_SXM2_ 418.67 7.0 0
1 TeslaV100_SXM2_ 418.67 7.0 0
2 TeslaV100_SXM2_ 418.67 7.0 8
3 TeslaV100_SXM2_ 418.67 7.0 8
powerNode5 0 TeslaV100_SXM2_ 418.67 7.0 0
1 TeslaV100_SXM2_ 418.67 7.0 0
2 TeslaV100_SXM2_ 418.67 7.0 8
3 TeslaV100_SXM2_ 418.67 7.0 8
[root@powerNode2 ~]#
Job Submission :
1) Submit a normal job
[sachinpb@powerNode2 ~]$ bsub -q ibm_q -R "select[type==ppc]" sleep 200
Job <24807> is submitted to queue <ibm_q>.
[sachinpb@powerNode2 ~]$
Job <24807> is submitted to queue <ibm_q>.
[sachinpb@powerNode2 ~]$
2) Submit a job with GPU requirements:
[sachinpb@powerNode2 ~]$ bsub -q ibm_q -gpu "num=1" -R "select[type==ppc]" sleep 200
Job <24808> is submitted to queue <ibm_q>.
[sachinpb@powerNode2 ~]$
3) List jobs
[sachinpb@powerNode2 ~]$ bjobs
JOBID USER STAT QUEUE ROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
24807 sachinpb RUN ibm_q powerNode2 powerNode6 sleep 200 Aug 1 05:34
24808 sachinpb RUN ibm_q powerNode2 powerNode2 sleep 200 Aug 1 05:34
[sachinpb@powerNode2 ~]$
We can see that job <24807> submitted without "-gpu" option and so, it selected non-GPU node [powerNode6]. Other job <24808> was running on powerNode2 with 4 GPUs as listed in lshosts -gpu command shown in above example.
4) Submit a job with GPU requirements to Another cluster(x86-cluster2) where cluster was configured with Job-forwarding Mode:
[sachinpb@powerNode2 ~]$ lsclusters
CLUSTER_NAME STATUS MASTER_HOST ADMIN HOSTS SERVERS
power_cluster1 ok powerNode2 lsfadmin 5 5
x86-64_cluster2 ok x86-masterNode lsfadmin 8 8
[sachinpb@powerNode2 ~]$
[sachinpb@powerNode2 ~]$ bsub -q x86_q -gpu "num=1" -R "select[type==X86_64]" sleep 200
Job <46447> is submitted to queue <x86_ibmgpu_q>.
[sachinpb@powerNode2 ~]$
[sachinpb@powerNode2 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
46447 sachinpb RUN x86_q powerNode2 x86_intelbox@x86-cluster2 sleep 200 Feb 9 00:55
I hope this blog helped in understanding how to enable GPU support in Spectrum LSF followed by job submission.
4) Submit a job with GPU requirements to Another cluster(x86-cluster2) where cluster was configured with Job-forwarding Mode:
[sachinpb@powerNode2 ~]$ lsclusters
CLUSTER_NAME STATUS MASTER_HOST ADMIN HOSTS SERVERS
power_cluster1 ok powerNode2 lsfadmin 5 5
x86-64_cluster2 ok x86-masterNode lsfadmin 8 8
[sachinpb@powerNode2 ~]$
[sachinpb@powerNode2 ~]$ bsub -q x86_q -gpu "num=1" -R "select[type==X86_64]" sleep 200
Job <46447> is submitted to queue <x86_ibmgpu_q>.
[sachinpb@powerNode2 ~]$
[sachinpb@powerNode2 ~]$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
46447 sachinpb RUN x86_q powerNode2 x86_intelbox@x86-cluster2 sleep 200 Feb 9 00:55
I hope this blog helped in understanding how to enable GPU support in Spectrum LSF followed by job submission.
NOTE: GPU enabled workloads supported from IBM Spectrum LSF Version 10.1
Fix Pack 6 onwards. LSF systems using RHEL, version 7 or higher is
required to support LSF_GPU_AUTOCONFIG.
References:
https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_gpu/chap_submit_monitor_gpu_jobs.html
References:
https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_gpu/chap_submit_monitor_gpu_jobs.html