aboutsummaryrefslogtreecommitdiffstats
path: root/drivers/md
Commit message (Collapse)AuthorAgeFilesLines
* dm mpath: change to be request basedKiyoshi Ueda2009-06-221-65/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch converts dm-multipath target to request-based from bio-based. Basically, the patch just converts the I/O unit from struct bio to struct request. In the course of the conversion, it also changes the I/O queueing mechanism. The change in the I/O queueing is described in details as follows. I/O queueing mechanism change ----------------------------- In I/O submission, map_io(), there is no mechanism change from bio-based, since the clone request is ready for retry as it is. However, in I/O complition, do_end_io(), there is a mechanism change from bio-based, since the clone request is not ready for retry. In do_end_io() of bio-based, the clone bio has all needed memory for resubmission. So the target driver can queue it and resubmit it later without memory allocations. The mechanism has almost no overhead. On the other hand, in do_end_io() of request-based, the clone request doesn't have clone bios, so the target driver can't resubmit it as it is. To resubmit the clone request, memory allocation for clone bios is needed, and it takes some overheads. To avoid the overheads just for queueing, the target driver doesn't queue the clone request inside itself. Instead, the target driver asks dm core for queueing and remapping the original request of the clone request, since the overhead for queueing is just a freeing memory for the clone request. As a result, the target driver doesn't need to record/restore the information of the original request for resubmitting the clone request. So dm_bio_details in dm_mpath_io is removed. multipath_busy() --------------------- The target driver returns "busy", only when the following case: o The target driver will map I/Os, if map() function is called and o The mapped I/Os will wait on underlying device's queue due to their congestions, if map() function is called now. In other cases, the target driver doesn't return "busy". Otherwise, dm core will keep the I/Os and the target driver can't do what it wants. (e.g. the target driver can't map I/Os now, so wants to kill I/Os.) Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: disable interrupt when taking map_lockKiyoshi Ueda2009-06-221-6/+9
| | | | | | | | | | | | | | | | | | | This patch disables interrupt when taking map_lock to avoid lockdep warnings in request-based dm. request-based dm takes map_lock after taking queue_lock with disabling interrupt: spin_lock_irqsave(queue_lock) q->request_fn() == dm_request_fn() => dm_get_table() => read_lock(map_lock) while queue_lock could be (but isn't) taken in interrupt context. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Christof Schmitt <christof.schmitt@de.ibm.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: do not set QUEUE_ORDERED_DRAIN if request basedKiyoshi Ueda2009-06-223-1/+16
| | | | | | | | | | | | Request-based dm doesn't have barrier support yet. So we need to set QUEUE_ORDERED_DRAIN only for bio-based dm. Since the device type is decided at the first table loading time, the flag set is deferred until then. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: enable request based optionKiyoshi Ueda2009-06-224-26/+285
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch enables request-based dm. o Request-based dm and bio-based dm coexist, since there are some target drivers which are more fitting to bio-based dm. Also, there are other bio-based devices in the kernel (e.g. md, loop). Since bio-based device can't receive struct request, there are some limitations on device stacking between bio-based and request-based. type of underlying device bio-based request-based ---------------------------------------------- bio-based OK OK request-based -- OK The device type is recognized by the queue flag in the kernel, so dm follows that. o The type of a dm device is decided at the first table binding time. Once the type of a dm device is decided, the type can't be changed. o Mempool allocations are deferred to at the table loading time, since mempools for request-based dm are different from those for bio-based dm and needed mempool type is fixed by the type of table. o Currently, request-based dm supports only tables that have a single target. To support multiple targets, we need to support request splitting or prevent bio/request from spanning multiple targets. The former needs lots of changes in the block layer, and the latter needs that all target drivers support merge() function. Both will take a time. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: prepare for request based optionKiyoshi Ueda2009-06-223-4/+716
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds core functions for request-based dm. When struct mapped device (md) is initialized, md->queue has an I/O scheduler and the following functions are used for request-based dm as the queue functions: make_request_fn: dm_make_request() pref_fn: dm_prep_fn() request_fn: dm_request_fn() softirq_done_fn: dm_softirq_done() lld_busy_fn: dm_lld_busy() Actual initializations are done in another patch (PATCH 2). Below is a brief summary of how request-based dm behaves, including: - making request from bio - cloning, mapping and dispatching request - completing request and bio - suspending md - resuming md bio to request ============== md->queue->make_request_fn() (dm_make_request()) calls __make_request() for a bio submitted to the md. Then, the bio is kept in the queue as a new request or merged into another request in the queue if possible. Cloning and Mapping =================== Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()), when requests are dispatched after they are sorted by the I/O scheduler. dm_request_fn() checks busy state of underlying devices using target's busy() function and stops dispatching requests to keep them on the dm device's queue if busy. It helps better I/O merging, since no merge is done for a request once it is dispatched to underlying devices. Actual cloning and mapping are done in dm_prep_fn() and map_request() called from dm_request_fn(). dm_prep_fn() clones not only request but also bios of the request so that dm can hold bio completion in error cases and prevent the bio submitter from noticing the error. (See the "Completion" section below for details.) After the cloning, the clone is mapped by target's map_rq() function and inserted to underlying device's queue using blk_insert_cloned_request(). Completion ========== Request completion can be hooked by rq->end_io(), but then, all bios in the request will have been completed even error cases, and the bio submitter will have noticed the error. To prevent the bio completion in error cases, request-based dm clones both bio and request and hooks both bio->bi_end_io() and rq->end_io(): bio->bi_end_io(): end_clone_bio() rq->end_io(): end_clone_request() Summary of the request completion flow is below: blk_end_request() for a clone request => blk_update_request() => bio->bi_end_io() == end_clone_bio() for each clone bio => Free the clone bio => Success: Complete the original bio (blk_update_request()) Error: Don't complete the original bio => blk_finish_request() => rq->end_io() == end_clone_request() => blk_complete_request() => dm_softirq_done() => Free the clone request => Success: Complete the original request (blk_end_request()) Error: Requeue the original request end_clone_bio() completes the original request on the size of the original bio in successful cases. Even if all bios in the original request are completed by that completion, the original request must not be completed yet to keep the ordering of request completion for the stacking. So end_clone_bio() uses blk_update_request() instead of blk_end_request(). In error cases, end_clone_bio() doesn't complete the original bio. It just frees the cloned bio and gives over the error handling to end_clone_request(). end_clone_request(), which is called with queue lock held, completes the clone request and the original request in a softirq context (dm_softirq_done()), which has no queue lock, to avoid a deadlock issue on submission of another request during the completion: - The submitted request may be mapped to the same device - Request submission requires queue lock, but the queue lock has been held by itself and it doesn't know that The clone request has no clone bio when dm_softirq_done() is called. So target drivers can't resubmit it again even error cases. Instead, they can ask dm core for requeueing and remapping the original request in that cases. suspend ======= Request-based dm uses stopping md->queue as suspend of the md. For noflush suspend, just stops md->queue. For flush suspend, inserts a marker request to the tail of md->queue. And dispatches all requests in md->queue until the marker comes to the front of md->queue. Then, stops dispatching request and waits for the all dispatched requests to complete. After that, completes the marker request, stops md->queue and wake up the waiter on the suspend queue, md->wait. resume ====== Starts md->queue. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm raid1: add userspace logJonthan Brassow2009-06-225-0/+1004
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch contains a device-mapper mirror log module that forwards requests to userspace for processing. The structures used for communication between kernel and userspace are located in include/linux/dm-log-userspace.h. Due to the frequency, diversity, and 2-way communication nature of the exchanges between kernel and userspace, 'connector' was chosen as the interface for communication. The first log implementations written in userspace - "clustered-disk" and "clustered-core" - support clustered shared storage. A userspace daemon (in the LVM2 source code repository) uses openAIS/corosync to process requests in an ordered fashion with the rest of the nodes in the cluster so as to prevent log state corruption. Other implementations with no association to LVM or openAIS/corosync, are certainly possible. (Imagine if two machines are writing to the same region of a mirror. They would both mark the region dirty, but you need a cluster-aware entity that can handle properly marking the region clean when they are done. Otherwise, you might clear the region when the first machine is done, not the second.) Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: calculate queue limits during resume not loadMike Snitzer2009-06-223-87/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, device-mapper maintains a separate instance of 'struct queue_limits' for each table of each device. When the configuration of a device is to be changed, first its table is loaded and this structure is populated, then the device is 'resumed' and the calculated queue_limits are applied. This places restrictions on how userspace may process related devices, where it is often advantageous to 'load' tables for several devices at once before 'resuming' them together. As the new queue_limits only take effect after the 'resume', if they are changing and one device uses another, the latter must be 'resumed' before the former may be 'loaded'. This patch moves the calculation of these queue_limits out of the 'load' operation into 'resume'. Since we are no longer pre-calculating this struct, we no longer need to maintain copies within our dm structs. dm_set_device_limits() now passes the 'start' of the device's data area (aka pe_start) as the 'offset' to blk_stack_limits(). init_valid_queue_limits() is replaced by blk_set_default_limits(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: martin.petersen@oracle.com Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm log: fix create_log_context to use logical_block_size of log deviceMike Snitzer2009-06-221-3/+4
| | | | | | | | create_log_context() must use the logical_block_size from the log disk, where the I/O happens, not the target's logical_block_size. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm target:s introduce iterate devices fnMike Snitzer2009-06-226-6/+94
| | | | | | | | | | | | | | Add .iterate_devices to 'struct target_type' to allow a function to be called for all devices in a DM target. Implemented it for all targets except those in dm-snap.c (origin and snapshot). (The raid1 version number jumps to 1.12 because we originally reserved 1.1 to 1.11 for 'block_on_error' but ended up using 'handle_errors' instead.) Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: martin.petersen@oracle.com
* dm table: establish queue limits by copying table limitsMike Snitzer2009-06-221-10/+2
| | | | | | | | | | | | | | | | | Copy the table's queue_limits to the DM device's request_queue. This properly initializes the queue's topology limits and also avoids having to track the evolution of 'struct queue_limits' in dm_table_set_restrictions() Also fixes a bug that was introduced in dm_table_set_restrictions() via commit ae03bf639a5027d27270123f5f6e3ee6a412781d. In addition to establishing 'bounce_pfn' in the queue's limits blk_queue_bounce_limit() also performs an allocation to setup the ISA DMA pool. This allocation resulted in "sleeping function called from invalid context" when called from dm_table_set_restrictions(). Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: replace struct io_restrictions with struct queue_limitsMike Snitzer2009-06-221-95/+43
| | | | | | | | Use blk_stack_limits() to stack block limits (including topology) rather than duplicate the equivalent within Device Mapper. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: validate device logical_block_sizeMike Snitzer2009-06-221-0/+69
| | | | | | | | | Impose necessary and sufficient conditions on a devices's table such that any incoming bio which respects its logical_block_size can be processed successfully. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: ensure targets are aligned to logical_block_sizeMike Snitzer2009-06-221-14/+44
| | | | | | | | | | | Ensure I/O is aligned to the logical block size of target devices. Rename check_device_area() to device_area_is_valid() for clarity and establish the device limits including the logical block size prior to calling it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm ioctl: support cookies for udevMilan Broz2009-06-223-11/+31
| | | | | | | | | | | | | | | | | | | | Add support for passing a 32 bit "cookie" into the kernel with the DM_SUSPEND, DM_DEV_RENAME and DM_DEV_REMOVE ioctls. The (unsigned) value of this cookie is returned to userspace alongside the uevents issued by these ioctls in the variable DM_COOKIE. This means the userspace process issuing these ioctls can be notified by udev after udev has completed any actions triggered. To minimise the interface extension, we pass the cookie into the kernel in the event_nr field which is otherwise unused when calling these ioctls. Incrementing the version number allows userspace to determine in advance whether or not the kernel supports the cookie. If the kernel does support this but userspace does not, there should be no impact as the new variable will just get ignored. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: sysfs add suspended attributePeter Rajnoha2009-06-221-0/+9
| | | | | | | | | Add a file named 'suspended' to each device-mapper device directory in sysfs. It holds the value 1 while the device is suspended. Otherwise it holds 0. Signed-off-by: Peter Rajnoha <prajnoha@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm table: improve warning message when devices not freed before destructionJonthan Brassow2009-06-221-5/+3
| | | | | | | Report any devices forgotten to be freed before a table is destroyed. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: add service time load balancerKiyoshi Ueda2009-06-223-0/+350
| | | | | | | | | | | | | | | | This patch adds a service time oriented dynamic load balancer, dm-service-time, which selects the path with the shortest estimated service time for the incoming I/O. The service time is estimated by dividing the in-flight I/O size by a performance value of each path. The performance value can be given as a table argument at the table loading time. If no performance value is given, all paths are considered equal. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: add queue length load balancerKiyoshi Ueda2009-06-223-0/+273
| | | | | | | | | | | | | This patch adds a dynamic load balancer, dm-queue-length, which balances the number of in-flight I/Os across the paths. The code is based on the patch posted by Stefan Bader: https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: add start_io and nr_bytes to path selectorsKiyoshi Ueda2009-06-223-13/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch makes two additions to the dm path selector interface for dynamic load balancers: o a new hook, start_io() o a new parameter 'nr_bytes' to select_path()/start_io()/end_io() to pass the size of the I/O start_io() is called when a target driver actually submits I/O to the selected path. Path selectors can use it to start accounting of the I/O. (e.g. counting the number of in-flight I/Os.) The start_io hook is based on the patch posted by Stefan Bader: https://www.redhat.com/archives/dm-devel/2005-October/msg00050.html nr_bytes, the size of the I/O, is so path selectors can take the size of the I/O into account when deciding which path to use. dm-service-time uses it to estimate service time, for example. (Added the nr_bytes member to dm_mpath_io instead of using existing details.bi_size, since request-based dm patch deletes it.) Signed-off-by: Stefan Bader <stefan.bader@canonical.com> Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm snapshot: use barrier when writing exception storeMikulas Patocka2009-06-221-1/+1
| | | | | | | | | | Send barrier requests when updating the exception area. Exception area updates need to be ordered w.r.t. data writes, so that the writes are not reordered in hardware disk cache. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm io: retry after barrier errorMikulas Patocka2009-06-221-0/+6
| | | | | | | | | | | | If -EOPNOTSUPP was returned and the request was a barrier request, retry it without barrier. Retry all regions for now. Barriers are submitted only for one-region requests, so it doesn't matter. (In the future, retries can be limited to the actual regions that failed.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm io: record eopnotsuppMikulas Patocka2009-06-221-1/+7
| | | | | | | | | | | Add another field, eopnotsupp_bits. It is subset of error_bits, representing regions that returned -EOPNOTSUPP. (The bit is set in both error_bits and eopnotsupp_bits). This value will be used in further patches. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm snapshot: support barriersMikulas Patocka2009-06-221-0/+11
| | | | | | | | | | Flush support for dm-snapshot target. This patch just forwards the flush request to either the origin or the snapshot device. (It doesn't flush exception store metadata.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: support barriersMikulas Patocka2009-06-221-0/+2
| | | | | | | Flush support for dm-multipath target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm delay: support barriersMikulas Patocka2009-06-221-2/+4
| | | | | | | Flush support for dm-delay target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm crypt: support flushMikulas Patocka2009-06-221-0/+8
| | | | | | | Flush support for dm-crypt target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: stripe support flushMikulas Patocka2009-06-221-3/+12
| | | | | | | | | | Flush support for the stripe target. This sets ti->num_flush_requests to the number of stripes and remaps individual flush requests to the appropriate stripe devices. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: linear support flushMikulas Patocka2009-06-221-1/+3
| | | | | | | Flush support for the linear target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: send empty barriers to targets in dm_flushMikulas Patocka2009-06-221-0/+10
| | | | | | | Pass empty barrier flushes to the targets in dm_flush(). Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: initialise tio in alloc_tioAlasdair G Kergon2009-06-221-18/+15
| | | | | | Move repeated dm_target_io initialisation inside alloc_tio(). Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: introduce num_flush_requestsMikulas Patocka2009-06-221-0/+39
| | | | | | | | | | | | Introduce num_flush_requests for a target to set to say how many flush instructions (empty barriers) it wants to receive. These are sent by __clone_and_map_empty_barrier with map_info->flush_request going from 0 to (num_flush_requests - 1). Old targets without flush support won't receive any flush requests. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: remove check that prevents mapping empty biosMikulas Patocka2009-06-221-5/+0
| | | | | | | | Remove the check that the size of the cloned bio is not zero because a subsequent patch needs to send zero-sized barriers down this path. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: remove EOPNOTSUPP for barriersMikulas Patocka2009-06-221-1/+1
| | | | | | | | | | | | | If the underlying device doesn't support barriers and dm receives a barrier, it waits until all requests on that device drain so it no longer needs to report -EOPNOTSUPP to the caller. This patch deals with the confusing situation when moving a volume from one physical device to another triggers an EOPNOTSUPP on a volume that didn't report it before. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: store only first barrier errorMikulas Patocka2009-06-221-8/+9
| | | | | | | | | With the following patches, more than one error can occur during processing. Change md->barrier_error so that only the first one is recorded and returned to the caller. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: process requeue in dm_wq_workMikulas Patocka2009-06-221-3/+10
| | | | | | | | | | | If barrier request was returned with DM_ENDIO_REQUEUE, requeue it in dm_wq_work instead of dec_pending. This allows us to correctly handle a situation when some targets are asking for a requeue and other targets signal an error. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: make dm_flush return voidMikulas Patocka2009-06-221-13/+4
| | | | | | | | | Make dm_flush return void. The first error during flush is stored in md->barrier_error instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: always hold bdev referenceMikulas Patocka2009-06-221-41/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a potential deadlock when creating multiple snapshots by holding a reference to struct block_device for the whole lifecycle of every dm device instead of obtaining it independently at each point it is needed. bdget_disk() was called while the device was being suspended, in dm_suspend(). However there could be other devices already suspended, for example when creating additional snapshots of a device. bdget_disk() can wait for IO and allocate memory resulting in waiting for the already-suspended device - deadlock. This patch changes the code so that it gets the reference to struct block_device when struct mapped_device is allocated and initialized in alloc_dev() where it is always OK to allocate memory or wait for I/O. It drops the reference when it is destroyed in free_dev(). Thus there is no call to bdget_disk() while any device is suspended. Previously unlock_fs() was called only if bdev was held. Now it is called unconditionally, but the superfluous calls are harmless because it returns immediately if the filesystem was not previously frozen. This patch also now allows the device size to be changed in a noflush suspend because the bdev is held. This has no adverse effect. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: rename suspended_bdev to bdevMikulas Patocka2009-06-221-18/+18
| | | | | | | | | | | | Rename suspended_bdev to bdev. This patch doesn't change any functionality, just renames the variable. In the next patch, the variable will be used even for non-suspended device. (Pre-requisite for the per-target barrier support patches.) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm exception store: fix exstore lookup to be case insensitiveJonathan Brassow2009-06-221-1/+1
| | | | | | | | | | | | When snapshots are created using 'p' instead of 'P' as the exception store type, the device-mapper table loading fails. This patch makes the code case insensitive as intended and fixes some regressions reported with device-mapper snapshots. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: use i_size_readMikulas Patocka2009-06-223-3/+4
| | | | | | | | | | | | | | | Use i_size_read() instead of reading i_size. If someone changes the size of the device simultaneously, i_size_read is guaranteed to return a valid value (either the old one or the new one). i_size can return some intermediate invalid value (on 32-bit computers with 64-bit i_size, the reads to both halves of i_size can be interleaved with updates to i_size, resulting in garbage being returned). Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: avoid unsupported spanning of md stripe boundariesMikulas Patocka2009-06-221-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | A bio that has two or more vector entries, size less than or equal to page size, that crosses a stripe boundary of an underlying md device is accepted by device mapper (it conforms to all its limits) but not by the underlying device. The fix is: If device mapper selects the one-page maximum request size, it also needs to set its own q->merge_bvec_fn to reject any bios with multiple vector entries that span more pages. The problem was discovered in the following scenario: * MD - RAID-0 * LV on the top of it (raid1, snapshot or striped with chunk size/stripe larger than RAID-0 stripe) * one of the logical volumes is exported to xen domU * inside xen domU it is partitioned, the key point is that the partition must be unaligned on page boundary (fdisk normally aligns the partition to 63 sectors which will trigger it) * install the system on the partitioned disk in domU This causes I/O failures in dom0. Reference: https://bugzilla.redhat.com/show_bug.cgi?id=223947 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: flush keventd queue in destructorMikulas Patocka2009-06-221-0/+1
| | | | | | | | | | | | | | | | | | | | | The commit fe9cf30eb8186ef267d1868dc9f12f2d0f40835a moves dm table event submission from kmultipath queue to kernel kevent queue to avoid a deadlock. There is a possibility of race condition because kevent queue is not flushed in the multipath destructor. The scenario is: - some event happens and is queued to keventd - keventd thread is delayed due to scheuling latency or some other work - multipath device is destroyed - keventd now attempts to process work_struct that is residing in already released memory. The patch flushes the keventd queue in multipath constructor. I've already fixed similar bug in dm-raid1. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org
* dm raid1: keep retrying alloc if mempool_alloc failedMikulas Patocka2009-06-221-1/+1
| | | | | | | | | | | | | | If the code can't handle allocation failures, use __GFP_NOFAIL so that in case of memory pressure the allocator will retry indefinitely and won't return NULL which would cause a crash in the function. This is still not a correct fix, it may cause a classic deadlock when memory manager waits for I/O being done and I/O waits for some free memory. I/O code shouldn't allocate any memory. But in this case it probably doesn't matter much in practice, people usually do not swap on RAID. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: call activate fn for each path in pg_initChandra Seetharaman2009-06-221-37/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixed a problem affecting reinstatement of passive paths. Before we moved the hardware handler from dm to SCSI, it performed a pg_init for a path group and didn't maintain any state about each path in hardware handler code. But in SCSI dh, such state is now maintained, as we want to fail I/O early on a path if it is not the active path. All the hardware handlers have a state now and set to active or some form of inactive. They have prep_fn() which uses this state to fail the I/O without it ever being sent to the device. So in effect when dm-multipath calls scsi_dh_activate(), activate is sent to only one path and the "state" of that path is changed appropriately to "active" while other paths in the same path group are never changed as they never got an "activate". In order make sure all the paths in a path group gets their state set properly when a pg_init happens, we need to call scsi_dh_activate() on all paths in a path group. Doing this at the hardware handler layer is not a good option as we want the multipath layer to define the relationship between path and path groups and not the hardware handler. Attached patch sends an "activate" on each path in a path group when a path group is switched. It also sends an activate when a path is reinstated. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: change attached scsi_dhHannes Reinecke2009-06-221-2/+13
| | | | | | | | | | | | | | | | | | | | When specifying a different hardware handler via multipath features we should be able to override the built-in defaults. The problem here is the hardware table from scsi_dh is compiled in and cannot be changed from userland. The multipath.conf OTOH is purely user-defined and, what's more, the user might have a valid reason for modifying it. (EG EMC Clariion can well be run in PNR mode even though ALUA is active, or the user might want to try ALUA on any as-of-yet unknown devices) So _not_ allowing multipath to override the device handler setting will just add to the confusion and makes error tracking even more difficult. Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm: sysfs skip output when device is being destroyedMilan Broz2009-06-221-0/+4
| | | | | | | | | | | | Do not process sysfs attributes when device is being destroyed. Otherwise code can cause BUG_ON(test_bit(DMF_FREEING, &md->flags)); in dm_put() call. Cc: stable@kernel.org Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: validate hw_handler argument countMikulas Patocka2009-06-221-0/+5
| | | | | | | | Fix arg count parsing error in hw handlers. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* dm mpath: validate table argument countMikulas Patocka2009-06-221-0/+6
| | | | | | | | | | | | | | The parser reads the argument count as a number but doesn't check that sufficient arguments are supplied. This command triggers the bug: dmsetup create mpath --table "0 `blockdev --getsize /dev/mapper/cr0` multipath 0 0 2 1 round-robin 1000 0 1 1 /dev/mapper/cr0 round-robin 0 1 1 /dev/mapper/cr1 1000" kernel BUG at drivers/md/dm-mpath.c:530! Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
* Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds2009-06-191-1/+1
|\ | | | | | | | | | | | | | | * 'for-linus' of git://git.kernel.dk/linux-2.6-block: Fix kernel-doc parameter name typo in blk-settings.c: block: rename CONFIG_LBD to CONFIG_LBDAF block: Fix bounce_pfn setting hd: stop defining MAJOR_NR
| * block: rename CONFIG_LBD to CONFIG_LBDAFBartlomiej Zolnierkiewicz2009-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Follow-up to "block: enable by default support for large devices and files on 32-bit archs". Rename CONFIG_LBD to CONFIG_LBDAF to: - allow update of existing [def]configs for "default y" change - reflect that it is used also for large files support nowadays Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>