1 .. SPDX-License-Identifier: GPL-2.0
7 :Author: Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010
9 KVM makes use of some custom MSRs to service some requests.
11 Custom MSRs have a range reserved for them, that goes from
12 0x4b564d00 to 0x4b564dff. There are MSRs outside this area,
13 but they are deprecated and their use is discouraged.
18 The current supported Custom MSR list is:
20 MSR_KVM_WALL_CLOCK_NEW:
24 4-byte alignment physical address of a memory area which must be
25 in guest RAM. This memory is expected to hold a copy of the following
28 struct pvclock_wall_clock {
32 } __attribute__((__packed__));
34 whose data will be filled in by the hypervisor. The hypervisor is only
35 guaranteed to update this data at the moment of MSR write.
36 Users that want to reliably query this information more than once have
37 to write more than once to this MSR. Fields have the following meanings:
40 guest has to check version before and after grabbing
41 time information and check that they are both equal and even.
42 An odd version indicates an in-progress update.
45 number of seconds for wallclock at time of boot.
48 number of nanoseconds for wallclock at time of boot.
50 In order to get the current wallclock time, the system_time from
51 MSR_KVM_SYSTEM_TIME_NEW needs to be added.
53 Note that although MSRs are per-CPU entities, the effect of this
54 particular MSR is global.
56 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
59 MSR_KVM_SYSTEM_TIME_NEW:
63 4-byte aligned physical address of a memory area which must be in
64 guest RAM, plus an enable bit in bit 0. This memory is expected to hold
65 a copy of the following structure::
67 struct pvclock_vcpu_time_info {
72 u32 tsc_to_system_mul;
76 } __attribute__((__packed__)); /* 32 bytes */
78 whose data will be filled in by the hypervisor periodically. Only one
79 write, or registration, is needed for each VCPU. The interval between
80 updates of this structure is arbitrary and implementation-dependent.
81 The hypervisor may update this structure at any time it sees fit until
82 anything with bit0 == 0 is written to it.
84 Fields have the following meanings:
87 guest has to check version before and after grabbing
88 time information and check that they are both equal and even.
89 An odd version indicates an in-progress update.
92 the tsc value at the current VCPU at the time
93 of the update of this structure. Guests can subtract this value
94 from current tsc to derive a notion of elapsed time since the
98 a host notion of monotonic time, including sleep
99 time at the time this structure was last updated. Unit is
103 multiplier to be used when converting
104 tsc-related quantity to nanoseconds
107 shift to be used when converting tsc-related
108 quantity to nanoseconds. This shift will ensure that
109 multiplication with tsc_to_system_mul does not overflow.
110 A positive value denotes a left shift, a negative value
113 The conversion from tsc to nanoseconds involves an additional
114 right shift by 32 bits. With this information, guests can
115 derive per-CPU time by doing::
117 time = (current_tsc - tsc_timestamp)
122 time = (time * tsc_to_system_mul) >> 32
123 time = time + system_time
126 bits in this field indicate extended capabilities
127 coordinated between the guest and the hypervisor. Availability
128 of specific flags has to be checked in 0x40000001 cpuid leaf.
132 +-----------+--------------+----------------------------------+
133 | flag bit | cpuid bit | meaning |
134 +-----------+--------------+----------------------------------+
135 | | | time measures taken across |
136 | 0 | 24 | multiple cpus are guaranteed to |
138 +-----------+--------------+----------------------------------+
139 | | | guest vcpu has been paused by |
140 | 1 | N/A | the host |
141 | | | See 4.70 in api.txt |
142 +-----------+--------------+----------------------------------+
144 Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid
151 data and functioning:
152 same as MSR_KVM_WALL_CLOCK_NEW. Use that instead.
154 This MSR falls outside the reserved KVM range and may be removed in the
155 future. Its usage is deprecated.
157 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
163 data and functioning:
164 same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead.
166 This MSR falls outside the reserved KVM range and may be removed in the
167 future. Its usage is deprecated.
169 Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid
172 The suggested algorithm for detecting kvmclock presence is then::
174 if (!kvm_para_available()) /* refer to cpuid.txt */
177 flags = cpuid_eax(0x40000001);
179 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
180 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
182 } else if (flags & 0) {
183 msr_kvm_system_time = MSR_KVM_SYSTEM_TIME;
184 msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK;
193 Asynchronous page fault (APF) control MSR.
195 Bits 63-6 hold 64-byte aligned physical address of a 64 byte memory area
196 which must be in guest RAM and must be zeroed. This memory is expected
197 to hold a copy of the following structure::
199 struct kvm_vcpu_pv_apf_data {
200 /* Used for 'page not present' events delivered via #PF */
203 /* Used for 'page ready' events delivered via interrupt notification */
210 Bits 5-4 of the MSR are reserved and should be zero. Bit 0 is set to 1
211 when asynchronous page faults are enabled on the vcpu, 0 when disabled.
212 Bit 1 is 1 if asynchronous page faults can be injected when vcpu is in
213 cpl == 0. Bit 2 is 1 if asynchronous page faults are delivered to L1 as
214 #PF vmexits. Bit 2 can be set only if KVM_FEATURE_ASYNC_PF_VMEXIT is
215 present in CPUID. Bit 3 enables interrupt based delivery of 'page ready'
216 events. Bit 3 can only be set if KVM_FEATURE_ASYNC_PF_INT is present in
219 'Page not present' events are currently always delivered as synthetic
220 #PF exception. During delivery of these events APF CR2 register contains
221 a token that will be used to notify the guest when missing page becomes
222 available. Also, to make it possible to distinguish between real #PF and
223 APF, first 4 bytes of 64 byte memory location ('flags') will be written
224 to by the hypervisor at the time of injection. Only first bit of 'flags'
225 is currently supported, when set, it indicates that the guest is dealing
226 with asynchronous 'page not present' event. If during a page fault APF
227 'flags' is '0' it means that this is regular page fault. Guest is
228 supposed to clear 'flags' when it is done handling #PF exception so the
229 next event can be delivered.
231 Note, since APF 'page not present' events use the same exception vector
232 as regular page fault, guest must reset 'flags' to '0' before it does
233 something that can generate normal page fault.
235 Bytes 5-7 of 64 byte memory location ('token') will be written to by the
236 hypervisor at the time of APF 'page ready' event injection. The content
237 of these bytes is a token which was previously delivered as 'page not
238 present' event. The event indicates the page in now available. Guest is
239 supposed to write '0' to 'token' when it is done handling 'page ready'
240 event and to write 1' to MSR_KVM_ASYNC_PF_ACK after clearing the location;
241 writing to the MSR forces KVM to re-scan its queue and deliver the next
242 pending notification.
244 Note, MSR_KVM_ASYNC_PF_INT MSR specifying the interrupt vector for 'page
245 ready' APF delivery needs to be written to before enabling APF mechanism
246 in MSR_KVM_ASYNC_PF_EN or interrupt #0 can get injected. The MSR is
247 available if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
249 Note, previously, 'page ready' events were delivered via the same #PF
250 exception as 'page not present' events but this is now deprecated. If
251 bit 3 (interrupt based delivery) is not set APF events are not delivered.
253 If APF is disabled while there are outstanding APFs, they will
256 Currently 'page ready' APF events will be always delivered on the
257 same vcpu as 'page not present' event was, but guest should not rely on
264 64-byte alignment physical address of a memory area which must be
265 in guest RAM, plus an enable bit in bit 0. This memory is expected to
266 hold a copy of the following structure::
268 struct kvm_steal_time {
277 whose data will be filled in by the hypervisor periodically. Only one
278 write, or registration, is needed for each VCPU. The interval between
279 updates of this structure is arbitrary and implementation-dependent.
280 The hypervisor may update this structure at any time it sees fit until
281 anything with bit0 == 0 is written to it. Guest is required to make sure
282 this structure is initialized to zero.
284 Fields have the following meanings:
287 a sequence counter. In other words, guest has to check
288 this field before and after grabbing time information and make
289 sure they are both equal and even. An odd version indicates an
293 At this point, always zero. May be used to indicate
294 changes in this structure in the future.
297 the amount of time in which this vCPU did not run, in
298 nanoseconds. Time during which the vcpu is idle, will not be
299 reported as steal time.
302 indicate the vCPU who owns this struct is running or
303 not. Non-zero values mean the vCPU has been preempted. Zero
304 means the vCPU is not preempted. NOTE, it is always zero if the
305 the hypervisor doesn't support this field.
311 Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
312 when disabled. Bit 1 is reserved and must be zero. When PV end of
313 interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned
314 physical address of a 4 byte memory area which must be in guest RAM and
317 The first, least significant bit of 4 byte memory location will be
318 written to by the hypervisor, typically at the time of interrupt
319 injection. Value of 1 means that guest can skip writing EOI to the apic
320 (using MSR or MMIO write); instead, it is sufficient to signal
321 EOI by clearing the bit in guest memory - this location will
322 later be polled by the hypervisor.
323 Value of 0 means that the EOI write is required.
325 It is always safe for the guest to ignore the optimization and perform
326 the APIC EOI write anyway.
328 Hypervisor is guaranteed to only modify this least
329 significant bit while in the current VCPU context, this means that
330 guest does not need to use either lock prefix or memory ordering
331 primitives to synchronise with the hypervisor.
333 However, hypervisor can set and clear this memory bit at any time:
334 therefore to make sure hypervisor does not interrupt the
335 guest and clear the least significant bit in the memory area
336 in the window between guest testing it to detect
337 whether it can skip EOI apic write and between guest
338 clearing it to signal EOI to the hypervisor,
339 guest must both read the least significant bit in the memory area and
340 clear it using a single CPU instruction, such as test and clear, or
341 compare and exchange.
343 MSR_KVM_POLL_CONTROL:
346 Control host-side polling.
349 Bit 0 enables (1) or disables (0) host-side HLT polling logic.
351 KVM guests can request the host not to poll on HLT, for example if
352 they are performing polling themselves.
354 MSR_KVM_ASYNC_PF_INT:
358 Second asynchronous page fault (APF) control MSR.
360 Bits 0-7: APIC vector for delivery of 'page ready' APF events.
363 Interrupt vector for asynchnonous 'page ready' notifications delivery.
364 The vector has to be set up before asynchronous page fault mechanism
365 is enabled in MSR_KVM_ASYNC_PF_EN. The MSR is only available if
366 KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
368 MSR_KVM_ASYNC_PF_ACK:
372 Asynchronous page fault (APF) acknowledgment.
374 When the guest is done processing 'page ready' APF event and 'token'
375 field in 'struct kvm_vcpu_pv_apf_data' is cleared it is supposed to
376 write '1' to bit 0 of the MSR, this causes the host to re-scan its queue
377 and check if there are more notifications pending. The MSR is available
378 if KVM_FEATURE_ASYNC_PF_INT is present in CPUID.
380 MSR_KVM_MIGRATION_CONTROL:
384 This MSR is available if KVM_FEATURE_MIGRATION_CONTROL is present in
385 CPUID. Bit 0 represents whether live migration of the guest is allowed.
387 When a guest is started, bit 0 will be 0 if the guest has encrypted
388 memory and 1 if the guest does not have encrypted memory. If the
389 guest is communicating page encryption status to the host using the
390 ``KVM_HC_MAP_GPA_RANGE`` hypercall, it can set bit 0 in this MSR to
391 allow live migration of the guest.