aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-devices-cache_disable18
-rw-r--r--Documentation/ABI/testing/sysfs-devices-system-cpu156
-rw-r--r--Documentation/cputopology.txt47
-rw-r--r--Documentation/hwmon/sysfs-interface57
-rw-r--r--Documentation/vm/hwpoison.txt136
5 files changed, 378 insertions, 36 deletions
diff --git a/Documentation/ABI/testing/sysfs-devices-cache_disable b/Documentation/ABI/testing/sysfs-devices-cache_disable
deleted file mode 100644
index 175bb4f7051..00000000000
--- a/Documentation/ABI/testing/sysfs-devices-cache_disable
+++ /dev/null
@@ -1,18 +0,0 @@
-What: /sys/devices/system/cpu/cpu*/cache/index*/cache_disable_X
-Date: August 2008
-KernelVersion: 2.6.27
-Contact: mark.langsdorf@amd.com
-Description: These files exist in every cpu's cache index directories.
- There are currently 2 cache_disable_# files in each
- directory. Reading from these files on a supported
- processor will return that cache disable index value
- for that processor and node. Writing to one of these
- files will cause the specificed cache index to be disabled.
-
- Currently, only AMD Family 10h Processors support cache index
- disable, and only for their L3 caches. See the BIOS and
- Kernel Developer's Guide at
- http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116-Public-GH-BKDG_3.20_2-4-09.pdf
- for formatting information and other details on the
- cache index disable.
-Users: joachim.deguara@amd.com
diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu b/Documentation/ABI/testing/sysfs-devices-system-cpu
new file mode 100644
index 00000000000..a703b9e9aeb
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
@@ -0,0 +1,156 @@
+What: /sys/devices/system/cpu/
+Date: pre-git history
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description:
+ A collection of both global and individual CPU attributes
+
+ Individual CPU attributes are contained in subdirectories
+ named by the kernel's logical CPU number, e.g.:
+
+ /sys/devices/system/cpu/cpu#/
+
+What: /sys/devices/system/cpu/sched_mc_power_savings
+ /sys/devices/system/cpu/sched_smt_power_savings
+Date: June 2006
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: Discover and adjust the kernel's multi-core scheduler support.
+
+ Possible values are:
+
+ 0 - No power saving load balance (default value)
+ 1 - Fill one thread/core/package first for long running threads
+ 2 - Also bias task wakeups to semi-idle cpu package for power
+ savings
+
+ sched_mc_power_savings is dependent upon SCHED_MC, which is
+ itself architecture dependent.
+
+ sched_smt_power_savings is dependent upon SCHED_SMT, which
+ is itself architecture dependent.
+
+ The two files are independent of each other. It is possible
+ that one file may be present without the other.
+
+ Introduced by git commit 5c45bf27.
+
+
+What: /sys/devices/system/cpu/kernel_max
+ /sys/devices/system/cpu/offline
+ /sys/devices/system/cpu/online
+ /sys/devices/system/cpu/possible
+ /sys/devices/system/cpu/present
+Date: December 2008
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: CPU topology files that describe kernel limits related to
+ hotplug. Briefly:
+
+ kernel_max: the maximum cpu index allowed by the kernel
+ configuration.
+
+ offline: cpus that are not online because they have been
+ HOTPLUGGED off or exceed the limit of cpus allowed by the
+ kernel configuration (kernel_max above).
+
+ online: cpus that are online and being scheduled.
+
+ possible: cpus that have been allocated resources and can be
+ brought online if they are present.
+
+ present: cpus that have been identified as being present in
+ the system.
+
+ See Documentation/cputopology.txt for more information.
+
+
+
+What: /sys/devices/system/cpu/cpu#/node
+Date: October 2009
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: Discover NUMA node a CPU belongs to
+
+ When CONFIG_NUMA is enabled, a symbolic link that points
+ to the corresponding NUMA node directory.
+
+ For example, the following symlink is created for cpu42
+ in NUMA node 2:
+
+ /sys/devices/system/cpu/cpu42/node2 -> ../../node/node2
+
+
+What: /sys/devices/system/cpu/cpu#/topology/core_id
+ /sys/devices/system/cpu/cpu#/topology/core_siblings
+ /sys/devices/system/cpu/cpu#/topology/core_siblings_list
+ /sys/devices/system/cpu/cpu#/topology/physical_package_id
+ /sys/devices/system/cpu/cpu#/topology/thread_siblings
+ /sys/devices/system/cpu/cpu#/topology/thread_siblings_list
+Date: December 2008
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: CPU topology files that describe a logical CPU's relationship
+ to other cores and threads in the same physical package.
+
+ One cpu# directory is created per logical CPU in the system,
+ e.g. /sys/devices/system/cpu/cpu42/.
+
+ Briefly, the files above are:
+
+ core_id: the CPU core ID of cpu#. Typically it is the
+ hardware platform's identifier (rather than the kernel's).
+ The actual value is architecture and platform dependent.
+
+ core_siblings: internal kernel map of cpu#'s hardware threads
+ within the same physical_package_id.
+
+ core_siblings_list: human-readable list of the logical CPU
+ numbers within the same physical_package_id as cpu#.
+
+ physical_package_id: physical package id of cpu#. Typically
+ corresponds to a physical socket number, but the actual value
+ is architecture and platform dependent.
+
+ thread_siblings: internel kernel map of cpu#'s hardware
+ threads within the same core as cpu#
+
+ thread_siblings_list: human-readable list of cpu#'s hardware
+ threads within the same core as cpu#
+
+ See Documentation/cputopology.txt for more information.
+
+
+What: /sys/devices/system/cpu/cpuidle/current_driver
+ /sys/devices/system/cpu/cpuidle/current_governer_ro
+Date: September 2007
+Contact: Linux kernel mailing list <linux-kernel@vger.kernel.org>
+Description: Discover cpuidle policy and mechanism
+
+ Various CPUs today support multiple idle levels that are
+ differentiated by varying exit latencies and power
+ consumption during idle.
+
+ Idle policy (governor) is differentiated from idle mechanism
+ (driver)
+
+ current_driver: displays current idle mechanism
+
+ current_governor_ro: displays current idle policy
+
+ See files in Documentation/cpuidle/ for more information.
+
+
+What: /sys/devices/system/cpu/cpu*/cache/index*/cache_disable_X
+Date: August 2008
+KernelVersion: 2.6.27
+Contact: mark.langsdorf@amd.com
+Description: These files exist in every cpu's cache index directories.
+ There are currently 2 cache_disable_# files in each
+ directory. Reading from these files on a supported
+ processor will return that cache disable index value
+ for that processor and node. Writing to one of these
+ files will cause the specificed cache index to be disabled.
+
+ Currently, only AMD Family 10h Processors support cache index
+ disable, and only for their L3 caches. See the BIOS and
+ Kernel Developer's Guide at
+ http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/31116-Public-GH-BKDG_3.20_2-4-09.pdf
+ for formatting information and other details on the
+ cache index disable.
+Users: joachim.deguara@amd.com
diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index b41f3e58aef..f1c5c4bccd3 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -1,15 +1,28 @@
-Export cpu topology info via sysfs. Items (attributes) are similar
+Export CPU topology info via sysfs. Items (attributes) are similar
to /proc/cpuinfo.
1) /sys/devices/system/cpu/cpuX/topology/physical_package_id:
-represent the physical package id of cpu X;
+
+ physical package id of cpuX. Typically corresponds to a physical
+ socket number, but the actual value is architecture and platform
+ dependent.
+
2) /sys/devices/system/cpu/cpuX/topology/core_id:
-represent the cpu core id to cpu X;
+
+ the CPU core ID of cpuX. Typically it is the hardware platform's
+ identifier (rather than the kernel's). The actual value is
+ architecture and platform dependent.
+
3) /sys/devices/system/cpu/cpuX/topology/thread_siblings:
-represent the thread siblings to cpu X in the same core;
+
+ internel kernel map of cpuX's hardware threads within the same
+ core as cpuX
+
4) /sys/devices/system/cpu/cpuX/topology/core_siblings:
-represent the thread siblings to cpu X in the same physical package;
+
+ internal kernel map of cpuX's hardware threads within the same
+ physical_package_id.
To implement it in an architecture-neutral way, a new source file,
drivers/base/topology.c, is to export the 4 attributes.
@@ -32,32 +45,32 @@ not defined by include/asm-XXX/topology.h:
3) thread_siblings: just the given CPU
4) core_siblings: just the given CPU
-Additionally, cpu topology information is provided under
+Additionally, CPU topology information is provided under
/sys/devices/system/cpu and includes these files. The internal
source for the output is in brackets ("[]").
- kernel_max: the maximum cpu index allowed by the kernel configuration.
+ kernel_max: the maximum CPU index allowed by the kernel configuration.
[NR_CPUS-1]
- offline: cpus that are not online because they have been
+ offline: CPUs that are not online because they have been
HOTPLUGGED off (see cpu-hotplug.txt) or exceed the limit
- of cpus allowed by the kernel configuration (kernel_max
+ of CPUs allowed by the kernel configuration (kernel_max
above). [~cpu_online_mask + cpus >= NR_CPUS]
- online: cpus that are online and being scheduled [cpu_online_mask]
+ online: CPUs that are online and being scheduled [cpu_online_mask]
- possible: cpus that have been allocated resources and can be
+ possible: CPUs that have been allocated resources and can be
brought online if they are present. [cpu_possible_mask]
- present: cpus that have been identified as being present in the
+ present: CPUs that have been identified as being present in the
system. [cpu_present_mask]
The format for the above output is compatible with cpulist_parse()
[see <linux/cpumask.h>]. Some examples follow.
-In this example, there are 64 cpus in the system but cpus 32-63 exceed
+In this example, there are 64 CPUs in the system but cpus 32-63 exceed
the kernel max which is limited to 0..31 by the NR_CPUS config option
-being 32. Note also that cpus 2 and 4-31 are not online but could be
+being 32. Note also that CPUs 2 and 4-31 are not online but could be
brought online as they are both present and possible.
kernel_max: 31
@@ -67,8 +80,8 @@ brought online as they are both present and possible.
present: 0-31
In this example, the NR_CPUS config option is 128, but the kernel was
-started with possible_cpus=144. There are 4 cpus in the system and cpu2
-was manually taken offline (and is the only cpu that can be brought
+started with possible_cpus=144. There are 4 CPUs in the system and cpu2
+was manually taken offline (and is the only CPU that can be brought
online.)
kernel_max: 127
@@ -78,4 +91,4 @@ online.)
present: 0-3
See cpu-hotplug.txt for the possible_cpus=NUM kernel start parameter
-as well as more information on the various cpumask's.
+as well as more information on the various cpumasks.
diff --git a/Documentation/hwmon/sysfs-interface b/Documentation/hwmon/sysfs-interface
index dcbd502c879..82def883361 100644
--- a/Documentation/hwmon/sysfs-interface
+++ b/Documentation/hwmon/sysfs-interface
@@ -353,10 +353,20 @@ power[1-*]_average Average power use
Unit: microWatt
RO
-power[1-*]_average_interval Power use averaging interval
+power[1-*]_average_interval Power use averaging interval. A poll
+ notification is sent to this file if the
+ hardware changes the averaging interval.
Unit: milliseconds
RW
+power[1-*]_average_interval_max Maximum power use averaging interval
+ Unit: milliseconds
+ RO
+
+power[1-*]_average_interval_min Minimum power use averaging interval
+ Unit: milliseconds
+ RO
+
power[1-*]_average_highest Historical average maximum power use
Unit: microWatt
RO
@@ -365,6 +375,18 @@ power[1-*]_average_lowest Historical average minimum power use
Unit: microWatt
RO
+power[1-*]_average_max A poll notification is sent to
+ power[1-*]_average when power use
+ rises above this value.
+ Unit: microWatt
+ RW
+
+power[1-*]_average_min A poll notification is sent to
+ power[1-*]_average when power use
+ sinks below this value.
+ Unit: microWatt
+ RW
+
power[1-*]_input Instantaneous power use
Unit: microWatt
RO
@@ -381,6 +403,39 @@ power[1-*]_reset_history Reset input_highest, input_lowest,
average_highest and average_lowest.
WO
+power[1-*]_accuracy Accuracy of the power meter.
+ Unit: Percent
+ RO
+
+power[1-*]_alarm 1 if the system is drawing more power than the
+ cap allows; 0 otherwise. A poll notification is
+ sent to this file when the power use exceeds the
+ cap. This file only appears if the cap is known
+ to be enforced by hardware.
+ RO
+
+power[1-*]_cap If power use rises above this limit, the
+ system should take action to reduce power use.
+ A poll notification is sent to this file if the
+ cap is changed by the hardware. The *_cap
+ files only appear if the cap is known to be
+ enforced by hardware.
+ Unit: microWatt
+ RW
+
+power[1-*]_cap_hyst Margin of hysteresis built around capping and
+ notification.
+ Unit: microWatt
+ RW
+
+power[1-*]_cap_max Maximum cap that can be set.
+ Unit: microWatt
+ RO
+
+power[1-*]_cap_min Minimum cap that can be set.
+ Unit: microWatt
+ RO
+
**********
* Energy *
**********
diff --git a/Documentation/vm/hwpoison.txt b/Documentation/vm/hwpoison.txt
new file mode 100644
index 00000000000..3ffadf8da61
--- /dev/null
+++ b/Documentation/vm/hwpoison.txt
@@ -0,0 +1,136 @@
+What is hwpoison?
+
+Upcoming Intel CPUs have support for recovering from some memory errors
+(``MCA recovery''). This requires the OS to declare a page "poisoned",
+kill the processes associated with it and avoid using it in the future.
+
+This patchkit implements the necessary infrastructure in the VM.
+
+To quote the overview comment:
+
+ * High level machine check handler. Handles pages reported by the
+ * hardware as being corrupted usually due to a 2bit ECC memory or cache
+ * failure.
+ *
+ * This focusses on pages detected as corrupted in the background.
+ * When the current CPU tries to consume corruption the currently
+ * running process can just be killed directly instead. This implies
+ * that if the error cannot be handled for some reason it's safe to
+ * just ignore it because no corruption has been consumed yet. Instead
+ * when that happens another machine check will happen.
+ *
+ * Handles page cache pages in various states. The tricky part
+ * here is that we can access any page asynchronous to other VM
+ * users, because memory failures could happen anytime and anywhere,
+ * possibly violating some of their assumptions. This is why this code
+ * has to be extremely careful. Generally it tries to use normal locking
+ * rules, as in get the standard locks, even if that means the
+ * error handling takes potentially a long time.
+ *
+ * Some of the operations here are somewhat inefficient and have non
+ * linear algorithmic complexity, because the data structures have not
+ * been optimized for this case. This is in particular the case
+ * for the mapping from a vma to a process. Since this case is expected
+ * to be rare we hope we can get away with this.
+
+The code consists of a the high level handler in mm/memory-failure.c,
+a new page poison bit and various checks in the VM to handle poisoned
+pages.
+
+The main target right now is KVM guests, but it works for all kinds
+of applications. KVM support requires a recent qemu-kvm release.
+
+For the KVM use there was need for a new signal type so that
+KVM can inject the machine check into the guest with the proper
+address. This in theory allows other applications to handle
+memory failures too. The expection is that near all applications
+won't do that, but some very specialized ones might.
+
+---
+
+There are two (actually three) modi memory failure recovery can be in:
+
+vm.memory_failure_recovery sysctl set to zero:
+ All memory failures cause a panic. Do not attempt recovery.
+ (on x86 this can be also affected by the tolerant level of the
+ MCE subsystem)
+
+early kill
+ (can be controlled globally and per process)
+ Send SIGBUS to the application as soon as the error is detected
+ This allows applications who can process memory errors in a gentle
+ way (e.g. drop affected object)
+ This is the mode used by KVM qemu.
+
+late kill
+ Send SIGBUS when the application runs into the corrupted page.
+ This is best for memory error unaware applications and default
+ Note some pages are always handled as late kill.
+
+---
+
+User control:
+
+vm.memory_failure_recovery
+ See sysctl.txt
+
+vm.memory_failure_early_kill
+ Enable early kill mode globally
+
+PR_MCE_KILL
+ Set early/late kill mode/revert to system default
+ arg1: PR_MCE_KILL_CLEAR: Revert to system default
+ arg1: PR_MCE_KILL_SET: arg2 defines thread specific mode
+ PR_MCE_KILL_EARLY: Early kill
+ PR_MCE_KILL_LATE: Late kill
+ PR_MCE_KILL_DEFAULT: Use system global default
+PR_MCE_KILL_GET
+ return current mode
+
+
+---
+
+Testing:
+
+madvise(MADV_POISON, ....)
+ (as root)
+ Poison a page in the process for testing
+
+
+hwpoison-inject module through debugfs
+ /sys/debug/hwpoison/corrupt-pfn
+
+Inject hwpoison fault at PFN echoed into this file
+
+
+Architecture specific MCE injector
+
+x86 has mce-inject, mce-test
+
+Some portable hwpoison test programs in mce-test, see blow.
+
+---
+
+References:
+
+http://halobates.de/mce-lc09-2.pdf
+ Overview presentation from LinuxCon 09
+
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git
+ Test suite (hwpoison specific portable tests in tsrc)
+
+git://git.kernel.org/pub/scm/utils/cpu/mce/mce-inject.git
+ x86 specific injector
+
+
+---
+
+Limitations:
+
+- Not all page types are supported and never will. Most kernel internal
+objects cannot be recovered, only LRU pages for now.
+- Right now hugepage support is missing.
+
+---
+Andi Kleen, Oct 2009
+