Quantcast
Viewing all articles
Browse latest Browse all 3435

CapacityScheduler not being elastic

I am trying to set up YARN MapReduce queues on my system so that I don’t have individual jobs sucking up all the resources on the cluster. However I still want the cluster to be 100% utilised if there are jobs ready to run. However this elastic use is not working for me on HDP 2.2.0 (on CentoOS6). Can you suggest things I should do to fix it or diagnose better.

To test this I added a second queue to the default one, and allocated default with a capacity of 80 and the new ingestion queue with capacity of 20

This seems to work fine – with some big MR jobs (teragen from the terasort benchmark) the default queue uses four times as many maps as the ingestion queue.

I can even change the config xml file (swap 20 and 80) tell yarn to refresh queues (without restarting) and I see that the new ingestion queue starts using more maps, and the default one less.

Result!

HOWEVER

If one of the jobs finishes and the queue is empty I would expect the other queue to stop limiting itself to the capacity, and instead go all the way up to its maximum-capacity (which is currently 100)
It does not do this. The cluster carries on with the smaller capacity queue trundling along with most of the cluster idle.

Is there anything I need to do to enable elastic use of the resources? Here is my Ambari Capacity Scheduler config

yarn.scheduler.capacity.default.minimum-user-limit-percent=100
yarn.scheduler.capacity.maximum-am-resource-percent=0.2
yarn.scheduler.capacity.maximum-applications=10000
yarn.scheduler.capacity.node-locality-delay=40
yarn.scheduler.capacity.resource-calculator=org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
yarn.scheduler.capacity.root.accessible-node-labels=*
yarn.scheduler.capacity.root.accessible-node-labels.default.capacity=-1
yarn.scheduler.capacity.root.accessible-node-labels.default.maximum-capacity=-1
yarn.scheduler.capacity.root.acl_administer_queue=*
yarn.scheduler.capacity.root.capacity=100
yarn.scheduler.capacity.root.default-node-label-expression=
yarn.scheduler.capacity.root.default.acl_administer_jobs=*
yarn.scheduler.capacity.root.default.acl_submit_applications=*
yarn.scheduler.capacity.root.default.capacity=80
yarn.scheduler.capacity.root.default.maximum-capacity=100
yarn.scheduler.capacity.root.default.state=RUNNING
yarn.scheduler.capacity.root.default.user-limit-factor=1
yarn.scheduler.capacity.root.ingestion.acl_administer_jobs=*
yarn.scheduler.capacity.root.ingestion.acl_submit_applications=*
yarn.scheduler.capacity.root.ingestion.capacity=20
yarn.scheduler.capacity.root.ingestion.maximum-capacity=100
yarn.scheduler.capacity.root.ingestion.state=RUNNING
yarn.scheduler.capacity.root.ingestion.user-limit-factor=1
yarn.scheduler.capacity.root.queues=default,ingestion

Now interestingly here is an example of what I am talking about…

$ yarn queue -status default ; yarn queue -status ingestion ; yarn queue -status root

15/04/28 17:35:39 INFO impl.TimelineClientImpl: Timeline service address: http://rmmachine:8188/ws/v1/timeline/
15/04/28 17:35:39 INFO client.RMProxy: Connecting to ResourceManager at rmmachine/10.34.37.2:8050
Queue Information :
Queue Name : default
State : RUNNING
Capacity : 80.0%
Current Capacity : .0%
Maximum Capacity : 100.0%
Default Node Label expression :
Accessible Node Labels : *
15/04/28 17:35:41 INFO impl.TimelineClientImpl: Timeline service address: http://rmmachine:8188/ws/v1/timeline/
15/04/28 17:35:42 INFO client.RMProxy: Connecting to ResourceManager at rmmachine/10.34.37.2:8050
Queue Information :
Queue Name : ingestion
State : RUNNING
Capacity : 20.0%
Current Capacity : 108.7%
Maximum Capacity : 100.0%
Default Node Label expression :
Accessible Node Labels : *
15/04/28 17:35:44 INFO impl.TimelineClientImpl: Timeline service address: http://rmmachine:8188/ws/v1/timeline/
15/04/28 17:35:45 INFO client.RMProxy: Connecting to ResourceManager at rmmachine/10.34.37.2:8050
Queue Information :
Queue Name : root
State : RUNNING
Capacity : 100.0%
Current Capacity : 21.7%
Maximum Capacity : 100.0%
Default Node Label expression :
Accessible Node Labels : *

So in this example I have nothing running in the default queue which has 80% capacity set, and 20% is the ingestion queue, but either queue should be allowed to elastically use 100% of their parent queue (root).

This seems to tell me that the ingestion queue is running at full speed, but it is only running at 20% of the full cluster capacity.


Viewing all articles
Browse latest Browse all 3435

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>