Energy Aware Scheduling Microconference Notes Welcome to Linux Plumbers Conference 2014. Please use this etherpad to take notes. Microconf leaders will be giving a summary of their microconference during the Friday afternoon closing session. Please remember there is no video this year, so your notes are the only record of your microconference. Speaker 1: Tools to Analyse scheduling and energy efficiency - Amit Kucheria (LInaro) 1. Goal is to help evaluate power efficiency of the patches 2. idlestat: Has been used by Linaro - Workload : RT App - Captures traces using ftrace - Information captured: cstate, pstate, mispredictions, max/min pstate - Captures wakeup sources - Available in a git tree - How do we get it merged - yet to be decided - Non-interactive tool 3. Upcoming features - If an energy model for arch is available, a single metric showing power efficiency can be done - diff mode: Difference in system behavior between patchsets - Being developed - Idle stat currently reflects kernel decisions. This can be extended to reflect hardware states as well. - Historgrams to view the captured trace ? - Topology information not included yet 4. We can have man pages as well for idlestat 5. There is nothing architecture specific about the tool. Just ftrace. 6. Daniel: What kind of scheduling statistics that we can potentially track in addition to what we have today ? 7. Morten: Can we track latency of tasks to know the performance impact of the patches ? 8. The RT App already tracks performance You can link ftrace with RT App. 9. Morten: What kind of topology information is required? Is sysfs information enough? Amit: CPU topology and power domain information 10. Morten: What about the topology of coupled cstates? Amit: It should not be too hard to expose this information to sysfs Daniel: Few platforms have coupled cstates - omap, tegra 3 11. Can we track the number of instructions executed so that we can evaluate pstate changes? We can thus better track the cpu intensity of the tasks. Perf may have this information. We can perhaps use it. 12. Can we integrate this into perf? Not at this point since we need to recompile perf for each kernel. 13. What is RTApp? - Not many maintainers run mobile workloads. - The mobile workloads are analysed and recreated using RTApp. - There are lots of patches on top of RTApp. Needs to cleaned - Essentially a library of mobile use cases. 14. Idle stat does not depend on using RTApp. Any workload can be tracked using idlestat. Speaker 2: The next event with IO Latency tracking - Daniel Lezcano (Linaro Power Management Team) (a repeat of Daniel's presentation at Linuxcon 2014) 1. Trace the next event more accurately. Integrate this information into the scheduler and the scheduler should decide the next cstate 2. Describes what we have currently in cpuidle subsystem. 3. The menu governor is the primary governor. The statistics captured is per-cpu. The more pending IOs there is , the shallower idle state is chosen. When we look at the wakeup sources, we see that we have Timers , IOs, rescheduling IPIs. 4. The menu governor does not distinguish between the different wakeup sources. The reschedule IPIs can be tracked better if cpuidle and scheduler is integrated. If we move tasks between cpus, we lose statistics that is used by the cpuidle subsystem. 5. Instead of focussing on all wakeup sources, why don't we track only the timers and IO wakeup sources? We could allow the scheduler to handle wakeups correctly. 6. We can track IO by looking at the blocked load and it follows the cpu on which it is scheduled. 7. PeterZ: Does IO involve both block and network? Daniel: Just block IO for now. 8. Group latencies into buckets. A bucket is an interval. A 200 bucket has latencies between 0-199us latency. 9. The smaller the bucket size, more the number of buckets 10. Rafael: Can we dynamically change the bucket size depending on the workload? Daniel: Thats a good idea, we need to do that. For now we observe that 200us is the best bucket size. 11. We can use these predictions to build the next IO model. 12. We begin the measurement of IO when the task blocks on IO and when the task wakes up the diff time is passed to the IO latency tracker which will predict the next IO the next time we block on the IO. 13. Peter: Does it make sense to integrate the information from the backing store drivers about the behaviour of the IO ? Daniel: For the proof of concept, we are currently using this model. We will need to implement this suggestion. 14. Peter: We already have a lot of tracking inthe BDI layer that we can use in the IO latency tracker. 15. IO latency infrastructure: - Group latencies into buckets and have index for each bucket. - When we add new latency we find a suitable bucket. We track the history of the bucket and see how many guesses we have got right and take the next guess. - How do we guess the next blocking time? - The buckets will move around in the list depending on how frequently they occur. - Compute a score for each bucket. The bucket with the biggest score is chosen - What happens when there are several tasks blocked on IO? - The next timer and next IO is tracked to decide the next idle state. The duration of idle = min(io-dur, timer_dur) 16. There are tools to evaluate the infrastructure. 17. The runs to evaluate the IO latency tracker are conducted under constrained conditions, with all noise off. 18. Results to show the right and wrong predictions. The predictions with the IO latency infrastructure is more accurate and stable than that of the menu governor. 19. The test program is a random read write one, with different block size numbers. 20. Amit : The workload that can break this scheme is when it reads from a slow device and write to a fast device. Daniel agrees. 21. When we integrate scheduling with cpuidle we can better track IO latency 22. Next steps: - Improve the algorithm - Investigate pointless IPI with IO completions. Peter: - The information from BDI layer can help change this. - You want to track the IO per device and not per cpu. To which cpus do the devices route their interrupts? - Multiqueue can pose a problem. - We can use the per device interrupt routing to better schedule tasks on cpus. The cpu on which the task slept need not be the one on which in will wake up. - The affinity tied during experiments is perhaps masking some loopholes in the algorithm - It makes sense to build statistical models for device IO latency. But it does not make sense to associate this with a task and therefore with a CPU. This is related to the "IO wait nonsense" (IO wait time being associated with a CPU) Kevin: The IO may not be going through the IO scheduler all the time. Morten: Can we use the target residency while tracking IO latency? Peter: If we dont have idle states, we dont need to worry about predicting idle times. But most CPUs have idle states. Rik: How do we apply the per-device interrupt routing information to wakeup tasks ? How do we know what CPU the task will wake up on? Peter: The blocked load may help. Audience: The borders of the buckets should be given by the CPU/system idle states. Daniel: We can perhaps set aside the bucket idea and use device interrupt routing information Morten: We can use idle state information for deciding the bucket size information Speaker 2 notes. Speaker 3: - Integrating scheduler and CPU power management frameworks - Preethi U Murthy (IBM) Speaker had no slides, simply wanted to start discussions 1. If the CPU goes idle and task wakes up, frequency not increased for 10ms(?) 2. Experiment: Use the load before going idle, and reuse it to choose the right p-state when exiting idle 2. per-task load for idle wakeups 2. PowerPC doesn't have sophisticated hardware that chooses P-states out sight of the OS. On PowerPC it is all software controlled. 2. Peter: We want to pull all the cpufreq contol into the scheduler. On "sane architectures" it is possible to control P-states directly from scheduler context. No async accesses over busses that require blocking. 2. Q: How to clean up the locking in the CPUFreq drivers? 1. IBM platform has " Turbo" (non-power-efficient/overdrive) vs "non-turbo" (power-efficient) P-states 2. Similar to ARM big.LITTLE 2. Want to Increase the p-state when a certain thresold is reached but keep the p-state 'power efficient' as much as possible. 2. Preeti: How does the Linaro big/little code decide when to migrate tasks from big to little cores, or vice-versa? 3. Morten: use per-entity load tracking history data 3. Preeti: does PELT give a good idea of a task's compute requirements? ( "large" or "small") 3. Morten: for mobile workloads, there is bursty behavior, and one wishes to run the burst workloads on big cores 3. Morten: we also look at how loaded the CPU is - otherwise there is a risk that one might overload the big cores and leave the little cores unused 3. MikeT: instead of using the ondemand governor, consider the Android interactive governor 3. Preeti: "we are looking for ideas" 3. PeterZ: there are a lot of variants of the interactive governor 3. Amit: and a lot of knobs 3. MikeT: would like to get away from CPUFreq governors 3. MikeT: perhaps add a hook into load_balance() ? Discussion: * Mike: How do you get the PELT data for the cpufreq governor modifications? * Preeti: We export them through sched.h, but it not too elegant. * Rik: What makes a task "large" or "small"? * Rik: what if tasks are exchanging messages with each other and going to sleep? * Preeti: we use a metric that treats long-running tasks as big tasks? * PeterZ: but one needs to monitor utilization as well Preeti: just relying on the per entity task load is not helping us - when we wake up, the load decays so signifcantly Morten: need to account for the sleeping time when the task next wakes up on a CPU MikeT: add some ability to experiment with different algorithms MikeT: try to tune the rate of decay? PeterZ: difficult Audience: pre-computed tables make this difficult Morten: QCOM does some magic tricks for big/Little - no per-entity load tracking - uses something window-based - want to know how long the task was running the last time it ran - assume that it will run the same amount the next time it wakes up Morten: might be able to revert a few patches to convert back from lookup tables to multipication, divison PeterZ: complaining about cycle cost of division insns in the fastpath MikeT: are your modifications public? Preeti: we can do that Speaker 4: - CPUFreq and scheduler integration - Mike Turquette (Linaro Power Management Team) * Going to describe code that isn't yet public but will be posting to the list "later today" RFC series CPUFreq as a function of CPU utilization from the scheduler Locking problems make scaling difficult from the scheduler Would also like to combine this with control of task placement on CPUs CONFIG_FAIR_GROUP_SCHED & cgroups make life extremely difficult PeterZ: can't remove it PeterZ: would love to remove cgroups MikeT: due to cgroups, some tasks can represent large numbers of other tasks MikeT: doing all his testing on Chromebook 2 (Exynos 5800, 2 clusters of 4 cpus each) locking and serialization problems PeterZ: can't we just ignore the effects on other CPUs and just compute a new P-state for the CPU we placed the task on? MikeT: is that too naïve? consensus was to do this and deal with the consequences to CONFIG_FAIR_GROUP_SCHED later PeterZ: let the people who use cgroups propose a solution for the cgroup cases Kevin: people using b/L also like to use cpusets MikeT: Android is using cgroups a lot PeterZ: but what about scheduler cgroups? Morten: they may not be using task group cgroups PeterZ: they all interact in fairly unpredictable ways Another issue are the pluggable policy backends that CPUFreq has - how do we do this in the scheduler? One may want to try different policies dynamically at runtime? PeterZ: we try very hard to do one policy thing PeterZ: LinusT believes that pluggable IO schedulers were a mistake MikeT: use weak functions? Morten: could use platform power management drivers to take different actions on a per-architecture basis when the scheduler requests a frequency Mike's capacity_ops patches MikeT: seems like the trend is that for every question that I have, the ersponse is to "do it the simple way" When a task is enqueued or dequeued, select a frequency What I'd really like to see is combining this with load balancing MikeT: I'm using the per-CPU utiliziation, but should I be using the per-entity load tracking data? using CPU utilization alone seems a bit naïve Audience Q: Why not get information from the userspace? We could tell you in advance what the task will do? PeterZ: need to be very careful with that - cannot ever trust userspace Audience 2 Q: Old apps might run on new hardware and the app's directives may no longer apply KevinH: voltage scaling timescales can be quite a bit longer than scheduler decisions - how to account for this? MikeT: pay attention to the platform power drivers to determine when a frequency/voltage change has completed Another historical data problem IS per-entity load tracking good for filtering frequency changes? MikeT: not saying that windows or buckets are needed Preeti: there was another issue that we discovered - on power, the core P state decides the socket P state - since we are running a timer, when we are going idle, we carry forward the previous evaluation - tried different approaches - just before going to idle, tried going to a lower P state - hard to do P state changes in non-sleepable context Preeti: Power just did it in the hardware r/w semaphores seem to be the problem MikeT: no problem removing the r/w semaphores Speaker 5: - Energy Model guided scheduler decisions - Morten Rasmussen (ARM) We're all talking about ways to optimize scheduling and interaction with other PM frameworks to try to save energy But depending on the platform, the tricks needed to save power are fairly difference Task packing vs spreading Right now the scheduler and PM frameworks have no clue about the platforms Is it possible to integrate energy data to better make these decisions? Currently load-tracking in the schedule does not give us utilization - it's just weighted load PeterZ: we can add it Morten: do we agree what utilization is? needed for what Mike and Preeti is working on When we migrate a task, the load of the task is updated instantaneously Preeti: would this help us in idle wakeups? Morten: I think it would Morten: compare the load & the blocked_load PeterZ: if it's on the rq and it's runnable at the particluar moment... Morten: when tasks are idle, should we pull tasks from other CPUs to fill in the small gaps? Some people might care about latency, others might not There are patches on the list from Vincent Guittot to clean some of this up Try to make the load tracking frequency-aware The way we measure load today: busy time vs. wall clock time all based on wall clock time But if we double the frequency, the busy time would be reduced by a factor of two So if one CPU is running 2x as fast as another one, one has no clue how much load one is moving Problem on the lists today: we have 4 or 5 different patch sets and none are aligned All of us working on them all agree that they should go in What can we do to make PeterZ's life easier? PeterZ: I did merge some of it PeterZ: it would be nice if we could agree on at least _some_ of each others' patch sets... Can those patches be moved to the beginning of the patch sets? Rik: whne two cores are siblings, it might not be necessary to take the blocked_load into account PeterZ: try not to make things complicated at first Morten: it's only used for task groups - and not even for the load Moren: just used for distribution of the task groups Amit: are we agreed that we should get the stuff from Vincent that we agree on into a first, single patchset? PeterZ: yes Morten: the running geometric series was there, but nothing was using it PeterZ: let's jus ttry to make something that's not entirely crap PeterZ: one thing that is not done is to break out the blocked vs. non-blocked pjt kept them separate to avoid some races Morten: lots of variable naming problems, running_blocked_avg, etc. Amit: Vincent's patches don't regress anything, right? Rik: we're probably fine wiht blocked_load Amit: Vincent found some regressions with runnable_blocked_load added in Rik: I suspect it leaves some cores more idle Morten: we will need to go through the load_balance() code to fix that up, then add the running_blocked_load, Morten: for running tasks, we don't want to scale by priority We really want a utilization metric that is frequency-independent for packing As soon as all of the CPUs are busy, we can't do much What about idle time injection? for thermal or power reasons Morten doesn't like it PeterZ: the Intel idle-time injector tries to synchronize core sleep times together to hit package C states power-capping The first set of patches are just all infrastructure - the interesting stuff comes later 1. patches that introduce running 2. need to introduce freq invariance The next set would introduce: 3. need to introduce microarchitecture invariance 4. running blocked load #2 and #3 done by adding scaling factors to the tracked_load via per-arch hooks - need to make sure we don't break things in the load_balance code Did Vincent break any of this stuff? Morten: easiest way to avoid the overflow thing is to scale by a factor between 0..1 problems with scaling factors that are > 1 PeterZ: keep the bsaic idea of pooling tasks If there's a task running everywhere, we need to switch to utilization to avoid queuing Once we're saturated, we need to switch to load balancing Morten: try to preseve the current behavior PeterZ: but you did change the behavior PeterZ: the SMP crap is based on nr_running, but Vincent wants to switch to utilization Spreading vs. packing PeterZ: have heard rumors about ARM NUMA that don't share last-level cache Morten: b/L systems don't share LLC Need to change balancing metric? Morten:we should still go up and down the hierarchies, do polling avg_load makes less sense with different frequencies and microarchitectures Vincent's code tries not to break anything, but we will everntually need to go through them again Q: Can the freq-invariant metric be exported to userspace? PeterZ: cpu_load is a random number, so please no, let's not export it to userspace, because we can't change it same problem with IO wait statistics, it's a random number generator - let's just write 0 tlo it How to fix userspace tools to take frequency into account? Rik: this is the wrong time to export things into userspace - too many things are changing Q: what's the state of the per-arch energy model that we've discussed in the past? Would like to account for: - power consumption per P state - power consumption per C state - topology that shares P and C states Morten: would like to add this into the sched_domain hierarchy perhaps via ACPI, or DT Microarchitecture invariance - will need account for non-ARM-manufactured cores in the ARM architecture.. e.g., Apple, NVIDIA, X-Gee Mark Brown: the problem with putting cpu_power in the DT data is that it may not be directly derivable from the kernel Morten: ideally this would be related to the number of insns-per-cycle for that core type Runtime-loadable microcode, as found on Intel CPUs and possibly some ARM CPUs, may affect CPU performance - so it may not make sense to hardcode cpu_power values MarkB: re cpu_power DT, would like ARM64 to look like ARM32 Morten: we should probably have removed it Morten: it might change soon MarkB: we should probably just drop the cpu_power scaling PaulW: would be ideal to compute those at runtime PeterZ: before 2.6.26, we use to measure cache repopulation time at boot PeterZ: the problem with this is that it caused benchmark variations PaulW: same problem with bogomips MikeT: can always pass this in via sysfs at runtime (pwsan)