7 Stored Data Management #
The CRUSH algorithm determines how to store and retrieve data by computing data storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability.
CRUSH requires a map of your cluster, and uses the CRUSH Map to pseudo-randomly store and retrieve data in OSDs with a uniform distribution of data across the cluster.
CRUSH maps contain a list of OSDs, a list of 'buckets' for aggregating the devices into physical locations, and a list of rules that tell CRUSH how it should replicate data in a Ceph cluster’s pools. By reflecting the underlying physical organization of the installation, CRUSH can model—and thereby address—potential sources of correlated device failures. Typical sources include physical proximity, a shared power source, and a shared network. By encoding this information into the cluster map, CRUSH placement policies can separate object replicas across different failure domains while still maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations.
After you deploy a Ceph cluster, a default CRUSH Map is generated. It is fine for your Ceph sandbox environment. However, when you deploy a large-scale data cluster, you should give significant consideration to developing a custom CRUSH Map, because it will help you manage your Ceph cluster, improve performance and ensure data safety.
For example, if an OSD goes down, a CRUSH Map can help you locate the physical data center, room, row and rack of the host with the failed OSD in the event you need to use on-site support or replace hardware.
Similarly, CRUSH may help you identify faults more quickly. For example, if all OSDs in a particular rack go down simultaneously, the fault may lie with a network switch or power to the rack or the network switch rather than the OSDs themselves.
A custom CRUSH Map can also help you identify the physical locations where Ceph stores redundant copies of data when the placement group(s) associated with a failed host are in a degraded state.
There are three main sections to a CRUSH Map.
7.1 Devices #
To map placement groups to OSDs, a CRUSH Map requires a list of OSD devices (the name of the OSD daemon). The list of devices appears first in the CRUSH Map.
#devices device NUM osd.OSD_NAME class CLASS_NAME
For example:
#devices device 0 osd.0 class hdd device 1 osd.1 class ssd device 2 osd.2 class nvme device 3 osd.3class ssd
As a general rule, an OSD daemon maps to a single disk.
7.1.1 Device Classes #
The flexibility of the CRUSH Map in controlling data placement is one of the Ceph's strengths. It is also one of the most difficult parts of the cluster to manage. Device classes automate one of the most common reasons why CRUSH Maps are directly manually edited.
7.1.1.1 The CRUSH Management Problem #
Ceph clusters are frequently built with multiple types of storage devices: HDDs, SSDs, NVMe’s, or even mixed classes of the above. We call these different types of storage devices device classes to avoid confusion between the type property of CRUSH buckets (e.g., host, rack, row, see Section 7.2, “Buckets” for more details). Ceph OSDs backed by SSDs are much faster than those backed by spinning disks, making them better suited for certain workloads. Ceph makes it easy to create RADOS pools for different data sets or workloads and to assign different CRUSH rules to control data placement for those pools.
Figure 7.1: OSDs with Mixed Device Classes #
However, setting up the CRUSH rules to place data only on a certain class of device is tedious. Rules work in terms of the CRUSH hierarchy, but if the devices are mixed into the same hosts or racks (as in the sample hierarchy above), they will (by default) be mixed together and appear in the same subtrees of the hierarchy. Manually separating them out into separate trees involved creating multiple versions of each intermediate node for each device class in previous versions of SUSE Enterprise Storage.
7.1.1.2 Device Classes #
     An elegant solution that Ceph offers is to add a property called
     device class to each OSD. By default, OSDs will
     automatically set their device classes to either 'hdd', 'ssd', or 'nvme'
     based on the hardware properties exposed by the Linux kernel. These device
     classes are reported in a new column of the ceph osd
     tree command output:
    
cephadm > ceph osd tree
 ID CLASS WEIGHT   TYPE NAME      STATUS REWEIGHT PRI-AFF
 -1       83.17899 root default
 -4       23.86200     host cpach
 2   hdd  1.81898         osd.2      up  1.00000 1.00000
 3   hdd  1.81898         osd.3      up  1.00000 1.00000
 4   hdd  1.81898         osd.4      up  1.00000 1.00000
 5   hdd  1.81898         osd.5      up  1.00000 1.00000
 6   hdd  1.81898         osd.6      up  1.00000 1.00000
 7   hdd  1.81898         osd.7      up  1.00000 1.00000
 8   hdd  1.81898         osd.8      up  1.00000 1.00000
 15  hdd  1.81898         osd.15     up  1.00000 1.00000
 10  nvme 0.93100         osd.10     up  1.00000 1.00000
 0   ssd  0.93100         osd.0      up  1.00000 1.00000
 9   ssd  0.93100         osd.9      up  1.00000 1.00000
     If the automatic device class detection fails for example because the
     device driver is not properly exposing information about the device via
     /sys/block, you can adjust device classes from the
     command line:
    
cephadm >ceph osd crush rm-device-class osd.2 osd.3 done removing class of osd(s): 2,3cephadm >ceph osd crush set-device-class ssd osd.2 osd.3 set osd(s) 2,3 to class 'ssd'
7.1.1.3 CRUSH Placement Rules #
CRUSH rules can restrict placement to a specific device class. For example, you can create a 'fast' replicated pool that distributes data only over SSD disks by running the following command:
cephadm > ceph osd crush rule create-replicated RULE_NAME ROOT FAILURE_DOMAIN_TYPE DEVICE_CLASSFor example:
cephadm > ceph osd crush rule create-replicated fast default host ssdCreate a pool named 'fast_pool' and assign it to the 'fast' rule:
cephadm > ceph osd pool create fast_pool 128 128 replicated fastThe process for creating erasure code rules is slightly different. First, you create an erasure code profile that includes a property for your desired device class. Then use that profile when creating the erasure coded pool:
cephadm >ceph osd erasure-code-profile set myprofile \ k=4 m=2 crush-device-class=ssd crush-failure-domain=hostcephadm >ceph osd pool create mypool 64 erasure myprofile
If you need to manually edit the CRUSH Map to customize your rule, the syntax has been extended to allow the device class to be specified. For example, the CRUSH rule generated by the above commands looks as follows:
rule ecpool {
  id 2
  type erasure
  min_size 3
  max_size 6
  step set_chooseleaf_tries 5
  step set_choose_tries 100
  step take default class ssd
  step chooseleaf indep 0 type host
  step emit
}The important difference there is that the 'take' command includes the additional 'class CLASS_NAME' suffix.
7.1.1.4 Additional Commands #
To list device classes used in a CRUSH Map, run:
cephadm > ceph osd crush class ls
[
  "hdd",
  "ssd"
]To list existing CRUSH rules, run:
cephadm > ceph osd crush rule ls
replicated_rule
fastTo view details of the CRUSH rule named 'fast', run:
cephadm > ceph osd crush rule dump fast
{
		"rule_id": 1,
		"rule_name": "fast",
		"ruleset": 1,
		"type": 1,
		"min_size": 1,
		"max_size": 10,
		"steps": [
						{
										"op": "take",
										"item": -21,
										"item_name": "default~ssd"
						},
						{
										"op": "chooseleaf_firstn",
										"num": 0,
										"type": "host"
						},
						{
										"op": "emit"
						}
		]
}To list OSDs that belong to a 'ssd' class, run:
cephadm > ceph osd crush class ls-osd ssd
0
17.1.1.5 Migrating from a Legacy SSD Rule to Device Classes #
In SUSE Enterprise Storage prior to version 5, you needed to manually edit the CRUSH Map and maintain a parallel hierarchy for each specialized device type (such as SSD) in order to write rules that apply to these devices. Since SUSE Enterprise Storage 5, the device class feature has enabled this transparently.
     You can transform a legacy rule and hierarchy to the new class-based rules
     by using the crushtool command. There are several types
     of transformation possible:
    
- crushtool --reclassify-root ROOT_NAME DEVICE_CLASS
- This command takes everything in the hierarchy beneath ROOT_NAME and adjusts any rules that reference that root via - take ROOT_NAME - to instead - take ROOT_NAME class DEVICE_CLASS - It renumbers the buckets so that the old IDs are used for the specified class’s 'shadow tree'. As a consequence, no data movement occurs. - Example 7.1:- crushtool --reclassify-root#- Consider the following existing rule: - rule replicated_ruleset { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type rack step emit }- If you reclassify the root 'default' as class 'hdd', the rule will become - rule replicated_ruleset { id 0 type replicated min_size 1 max_size 10 step take default class hdd step chooseleaf firstn 0 type rack step emit }
- crushtool --set-subtree-class BUCKET_NAME DEVICE_CLASS
- This method marks every device in the subtree rooted at BUCKET_NAME with the specified device class. - --set-subtree-classis normally used in conjunction with the- --reclassify-rootoption to ensure that all devices in that root are labeled with the correct class. However some of those devices may intentionally have a different class, and therefore you do not want to relabel them. In such cases, exclude the- --set-subtree-classoption. Keep in mind that such remapping will not be perfect, because the previous rule is distributed across devices of multiple classes but the adjusted rules will only map to devices of the specified device class.
- crushtool --reclassify-bucket MATCH_PATTERN DEVICE_CLASS DEFAULT_PATTERN
- This method allows merging a parallel type-specific hierarchy with the normal hierarchy. For example, many users have CRUSH Maps similar to the following one: - Example 7.2:- crushtool --reclassify-bucket#- host node1 { id -2 # do not change unnecessarily # weight 109.152 alg straw hash 0 # rjenkins1 item osd.0 weight 9.096 item osd.1 weight 9.096 item osd.2 weight 9.096 item osd.3 weight 9.096 item osd.4 weight 9.096 item osd.5 weight 9.096 [...] } host node1-ssd { id -10 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item osd.80 weight 2.000 [...] } root default { id -1 # do not change unnecessarily alg straw hash 0 # rjenkins1 item node1 weight 110.967 [...] } root ssd { id -18 # do not change unnecessarily # weight 16.000 alg straw hash 0 # rjenkins1 item node1-ssd weight 2.000 [...] }- This function reclassifies each bucket that matches a given pattern. The pattern can look like - %suffixor- prefix%. In the above example, you would use the pattern- %-ssd. For each matched bucket, the remaining portion of the name that matches the '%' wild card specifies the base bucket. All devices in the matched bucket are labeled with the specified device class and then moved to the base bucket. If the base bucket does not exist (for example if 'node12-ssd' exists but 'node12' does not), then it is created and linked underneath the specified default parent bucket. The old bucket IDs are preserved for the new shadow buckets to prevent data movement. Rules with the- takesteps that reference the old buckets are adjusted.
- crushtool --reclassify-bucket BUCKET_NAME DEVICE_CLASS BASE_BUCKET
- You can use the - --reclassify-bucketoption without a wild card to map a single bucket. For example, in the previous example, we want the 'ssd' bucket to be mapped to the default bucket.- The final command to convert the map comprised of the above fragments would be as follows: - cephadm >ceph osd getcrushmap -o original- cephadm >crushtool -i original --reclassify \ --set-subtree-class default hdd \ --reclassify-root default hdd \ --reclassify-bucket %-ssd ssd default \ --reclassify-bucket ssd ssd default \ -o adjusted- In order to verify that the conversion is correct, there is a - --compareoption that tests a large sample of inputs to the CRUSH Map and compares if the same result comes back out. These inputs are controlled by the same options that apply to the- --test. For the above example the command would be as follows:- cephadm >crushtool -i original --compare adjusted rule 0 had 0/10240 mismatched mappings (0) rule 1 had 0/10240 mismatched mappings (0) maps appear equivalent- Tip- If there were differences, you would see what ratio of inputs are remapped in the parentheses. - If you are satisfied with the adjusted CRUSH Map, you can apply it to the cluster: - cephadm >ceph osd setcrushmap -i adjusted
7.1.1.6 For More Information #
Find more details on CRUSH Maps in Section 7.4, “CRUSH Map Manipulation”.
Find more details on Ceph pools in general in Chapter 8, Managing Storage Pools.
Find more details about erasure coded pools in Chapter 10, Erasure Coded Pools.
7.2 Buckets #
CRUSH maps contain a list of OSDs, which can be organized into 'buckets' for aggregating the devices into physical locations.
| 0 | osd | An OSD daemon (osd.1, osd.2, etc.). | 
| 1 | host | A host name containing one or more OSDs. | 
| 2 | chassis | Chassis of which the rack is composed. | 
| 3 | rack | 
        A computer rack. The default is  | 
| 4 | row | A row in a series of racks. | 
| 5 | pdu | Power distribution unit. | 
| 6 | pod | |
| 7 | room | A room containing racks and rows of hosts. | 
| 8 | datacenter | A physical data center containing rooms. | 
| 9 | region | |
| 10 | root | 
Tip
You can modify the existing types and create your own bucket types.
   Ceph’s deployment tools generate a CRUSH Map that contains a bucket for
   each host, and a root named 'default', which is useful for the default
   rbd pool. The remaining bucket types provide a means for
   storing information about the physical location of nodes/buckets, which
   makes cluster administration much easier when OSDs, hosts, or network
   hardware malfunction and the administrator needs access to physical
   hardware.
  
   A bucket has a type, a unique name (string), a unique ID expressed as a
   negative integer, a weight relative to the total capacity/capability of its
   item(s), the bucket algorithm ( straw2 by default), and
   the hash (0 by default, reflecting CRUSH Hash
   rjenkins1). A bucket may have one or more items. The
   items may consist of other buckets or OSDs. Items may have a weight that
   reflects the relative weight of the item.
  
[bucket-type] [bucket-name] {
  id [a unique negative numeric ID]
  weight [the relative capacity/capability of the item(s)]
  alg [the bucket type: uniform | list | tree | straw2 | straw ]
  hash [the hash type: 0 by default]
  item [item-name] weight [weight]
}The following example illustrates how you can use buckets to aggregate a pool and physical locations like a data center, a room, a rack and a row.
host ceph-osd-server-1 {
        id -17
        alg straw2
        hash 0
        item osd.0 weight 0.546
        item osd.1 weight 0.546
}
row rack-1-row-1 {
        id -16
        alg straw2
        hash 0
        item ceph-osd-server-1 weight 2.00
}
rack rack-3 {
        id -15
        alg straw2
        hash 0
        item rack-3-row-1 weight 2.00
        item rack-3-row-2 weight 2.00
        item rack-3-row-3 weight 2.00
        item rack-3-row-4 weight 2.00
        item rack-3-row-5 weight 2.00
}
rack rack-2 {
        id -14
        alg straw2
        hash 0
        item rack-2-row-1 weight 2.00
        item rack-2-row-2 weight 2.00
        item rack-2-row-3 weight 2.00
        item rack-2-row-4 weight 2.00
        item rack-2-row-5 weight 2.00
}
rack rack-1 {
        id -13
        alg straw2
        hash 0
        item rack-1-row-1 weight 2.00
        item rack-1-row-2 weight 2.00
        item rack-1-row-3 weight 2.00
        item rack-1-row-4 weight 2.00
        item rack-1-row-5 weight 2.00
}
room server-room-1 {
        id -12
        alg straw2
        hash 0
        item rack-1 weight 10.00
        item rack-2 weight 10.00
        item rack-3 weight 10.00
}
datacenter dc-1 {
        id -11
        alg straw2
        hash 0
        item server-room-1 weight 30.00
        item server-room-2 weight 30.00
}
root data {
        id -10
        alg straw2
        hash 0
        item dc-1 weight 60.00
        item dc-2 weight 60.00
}7.3 Rule Sets #
CRUSH maps support the notion of 'CRUSH rules', which are the rules that determine data placement for a pool. For large clusters, you will likely create many pools where each pool may have its own CRUSH ruleset and rules. The default CRUSH Map has a rule for the default root. If you want more roots and more rules, you need to create them later or they will be created automatically when new pools are created.
Note
In most cases, you will not need to modify the default rules. When you create a new pool, its default ruleset is 0.
A rule takes the following form:
rule rulename {
        ruleset ruleset
        type type
        min_size min-size
        max_size max-size
        step step
}- ruleset
- An integer. Classifies a rule as belonging to a set of rules. Activated by setting the ruleset in a pool. This option is required. Default is - 0.
- type
- A string. Describes a rule for either for 'replicated' or 'erasure' coded pool. This option is required. Default is - replicated.
- min_size
- An integer. If a pool group makes fewer replicas than this number, CRUSH will NOT select this rule. This option is required. Default is - 2.
- max_size
- An integer. If a pool group makes more replicas than this number, CRUSH will NOT select this rule. This option is required. Default is - 10.
- step take bucket
- Takes a bucket specified by a name, and begins iterating down the tree. This option is required. For an explanation about iterating through the tree, see Section 7.3.1, “Iterating Through the Node Tree”. 
- step targetmodenum type bucket-type
- target can either be - chooseor- chooseleaf. When set to- choose, a number of buckets is selected.- chooseleafdirectly selects the OSDs (leaf nodes) from the sub-tree of each bucket in the set of buckets.- mode can either be - firstnor- indep. See Section 7.3.2, “firstn and indep”.- Selects the number of buckets of the given type. Where N is the number of options available, if num > 0 && < N, choose that many buckets; if num < 0, it means N - num; and, if num == 0, choose N buckets (all available). Follows - step takeor- step choose.
- step emit
- Outputs the current value and empties the stack. Typically used at the end of a rule, but may also be used to form different trees in the same rule. Follows - step choose.
7.3.1 Iterating Through the Node Tree #
The structure defined with the buckets can be viewed as a node tree. Buckets are nodes and OSDs are leafs in this tree.
Rules in the CRUSH Map define how OSDs are selected from this tree. A rule starts with a node and then iterates down the tree to return a set of OSDs. It is not possible to define which branch needs to be selected. Instead the CRUSH algorithm assures that the set of OSDs fulfills the replication requirements and evenly distributes the data.
    With step take bucket the
    iteration through the node tree begins at the given bucket (not bucket
    type). If OSDs from all branches in the tree are to be returned, the bucket
    must be the root bucket. Otherwise the following steps are only iterating
    through a sub-tree.
   
    After step take one or more step
    choose entries follow in the rule definition. Each step
    choose chooses a defined number of nodes (or branches) from the
    previously selected upper node.
   
    In the end the selected OSDs are returned with step
    emit.
   
    step chooseleaf is a convenience function that directly
    selects OSDs from branches of the given bucket.
   
    Figure 7.2, “Example Tree” provides an example of
    how step is used to iterate through a tree. The orange
    arrows and numbers correspond to example1a and
    example1b, while blue corresponds to
    example2 in the following rule definitions.
   
Figure 7.2: Example Tree #
# orange arrows
rule example1a {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # orange (1)
        step take rack1
        # orange (2)
        step choose firstn 0 host
        # orange (3)
        step choose firstn 1 osd
        step emit
}
rule example1b {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # orange (1)
        step take rack1
        # orange (2) + (3)
        step chooseleaf firstn 0 host
        step emit
}
# blue arrows
rule example2 {
        ruleset 0
        type replicated
        min_size 2
        max_size 10
        # blue (1)
        step take room1
        # blue (2)
        step chooseleaf firstn 0 rack
        step emit
}7.3.2 firstn and indep #
    A CRUSH rule defines replacements for failed nodes or OSDs (see
    Section 7.3, “Rule Sets”). The keyword step
    requires either firstn or indep as
    parameter. Figure Figure 7.3, “Node Replacement Methods”
    provides an example.
   
    firstn adds replacement nodes to the end of the list of
    active nodes. In case of a failed node, the following healthy nodes are
    shifted to the left to fill the gap of the failed node. This is the default
    and desired method for replicated pools, because a
    secondary node already has all data and therefore can take over the duties
    of the primary node immediately.
   
    indep selects fixed replacement nodes for each active
    node. The replacement of a failed node does not change the order of the
    remaining nodes. This is desired for erasure coded
    pools. In erasure coded pools the data stored on a node depends
    on its position in the node selection. When the order of nodes changes, all
    data on affected nodes needs to be relocated.
   
Figure 7.3: Node Replacement Methods #
7.4 CRUSH Map Manipulation #
This section introduces ways to basic CRUSH Map manipulation, such as editing a CRUSH Map, changing CRUSH Map parameters, and adding/moving/removing an OSD.
7.4.1 Editing a CRUSH Map #
To edit an existing CRUSH map, do the following:
- Get a CRUSH Map. To get the CRUSH Map for your cluster, execute the following: - cephadm >ceph osd getcrushmap -o compiled-crushmap-filename- Ceph will output ( - -o) a compiled CRUSH Map to the file name you specified. Since the CRUSH Map is in a compiled form, you must decompile it first before you can edit it.
- Decompile a CRUSH Map. To decompile a CRUSH Map, execute the following: - cephadm >crushtool -d compiled-crushmap-filename \ -o decompiled-crushmap-filename- Ceph will decompile ( - -d) the compiled CRUSH Mapand output (- -o) it to the file name you specified.
- Edit at least one of Devices, Buckets and Rules parameters. 
- Compile a CRUSH Map. To compile a CRUSH Map, execute the following: - cephadm >crushtool -c decompiled-crush-map-filename \ -o compiled-crush-map-filename- Ceph will store a compiled CRUSH Mapto the file name you specified. 
- Set a CRUSH Map. To set the CRUSH Map for your cluster, execute the following: - cephadm >ceph osd setcrushmap -i compiled-crushmap-filename- Ceph will input the compiled CRUSH Map of the file name you specified as the CRUSH Map for the cluster. 
Tip: Use Versioning System
Use a versioning system—such as git or svn—for the exported and modified CRUSH Map files. It makes a possible rollback simple.
Tip: Test the New CRUSH Map
     Test the new adjusted CRUSH Map using the crushtool
     --test command, and compare to the state before applying the new
     CRUSH Map. You may find the following command switches useful:
     --show-statistics, --show-mappings,
     --show-bad-mappings, --show-utilization,
     --show-utilization-all,
     --show-choose-tries
    
7.4.2 Add/Move an OSD #
To add or move an OSD in the CRUSH Map of a running cluster, execute the following:
cephadm > ceph osd crush set id_or_name weight root=pool-name
bucket-type=bucket-name ...- id
- An integer. The numeric ID of the OSD. This option is required. 
- name
- A string. The full name of the OSD. This option is required. 
- weight
- A double. The CRUSH weight for the OSD. This option is required. 
- root
- A key/value pair. By default, the CRUSH hierarchy contains the pool default as its root. This option is required. 
- bucket-type
- Key/value pairs. You may specify the OSD’s location in the CRUSH hierarchy. 
    The following example adds osd.0 to the hierarchy, or
    moves the OSD from a previous location.
   
cephadm > ceph osd crush set osd.0 1.0 root=data datacenter=dc1 room=room1 \
row=foo rack=bar host=foo-bar-17.4.3 Difference between ceph osd reweight and ceph osd crush reweight #
There are two similar commands that change the 'weight' of an Ceph OSD. Context of their usage is different and may cause confusion.
7.4.3.1 ceph osd reweight #
Usage:
cephadm > ceph osd reweight OSD_NAME NEW_WEIGHT
     ceph osd reweight sets an override weight on the Ceph OSD.
     This value is in the range 0 to 1, and forces CRUSH to re-place of the
     data that would otherwise live on this drive. It does
     not change the weights assigned to the
     buckets above the OSD, and is a corrective measure in case the normal
     CRUSH distribution is not working out quite right. For example, if one of
     your OSDs is at 90% and the others are at 40%, you could reduce this
     weight to try and compensate for it.
    
Note: OSD Weight is Temporary
      Note that ceph osd reweight is not a persistent
      setting. When an OSD gets marked out, its weight will be set to 0 and
      when it gets marked in again, the weight will be changed to 1.
     
7.4.3.2 ceph osd crush reweight #
Usage:
cephadm > ceph osd crush reweight OSD_NAME NEW_WEIGHT
     ceph osd crush reweight sets the
     CRUSH weight of the OSD. This weight is
     an arbitrary value—generally the size of the disk in TB—and
     controls how much data the system tries to allocate to the OSD.
    
7.4.4 Remove an OSD #
To remove an OSD from the CRUSH Map of a running cluster, execute the following:
cephadm > ceph osd crush remove OSD_NAME7.4.5 Add a Bucket #
    To add a bucket in the CRUSH Map of a running cluster, execute the
    ceph osd crush add-bucket command:
   
cephadm > ceph osd crush add-bucket BUCKET_NAME BUCKET_TYPE7.4.6 Move a Bucket #
To move a bucket to a different location or position in the CRUSH Map hierarchy, execute the following:
cephadm > ceph osd crush move BUCKET_NAME BUCKET_TYPE=BUCKET_NAME [...]For example:
cephadm > ceph osd crush move bucket1 datacenter=dc1 room=room1 row=foo rack=bar host=foo-bar-17.4.7 Remove a Bucket #
To remove a bucket from the CRUSH Map hierarchy, execute the following:
cephadm > ceph osd crush remove BUCKET_NAMENote: Empty Bucket Only
A bucket must be empty before removing it from the CRUSH hierarchy.
7.5 Scrubbing #
   In addition to making multiple copies of objects, Ceph insures data
   integrity by scrubbing placement groups (find more
   information about placement groups in
   Book “Deployment Guide”, Chapter 1 “SUSE Enterprise Storage 5.5 and Ceph”, Section 1.4.2 “Placement Group”). Ceph scrubbing is analogous
   to running fsck on the object storage layer. For each
   placement group, Ceph generates a catalog of all objects and compares each
   primary object and its replicas to ensure that no objects are missing or
   mismatched. Daily light scrubbing checks the object size and attributes,
   while weekly deep scrubbing reads the data and uses checksums to ensure data
   integrity.
  
Scrubbing is important for maintaining data integrity, but it can reduce performance. You can adjust the following settings to increase or decrease scrubbing operations:
- osd max scrubs
- The maximum number of simultaneous scrub operations for a Ceph OSD. Default is 1. 
- osd scrub begin hour,- osd scrub end hour
- The hours of day (0 to 24) that define a time window when the scrubbing can happen. By default begins at 0 and ends at 24. - Important- If the placement group’s scrub interval exceeds the - osd scrub max intervalsetting, the scrub will happen no matter what time window you define for scrubbing.
- osd scrub during recovery
- Allows scrubs during recovery. Setting this to 'false' will disable scheduling new scrubs while there is an active recovery. Already running scrubs will continue. This option is useful for reducing load on busy clusters. Default is 'true'. 
- osd scrub thread timeout
- The maximum time in seconds before a scrub thread times out. Default is 60. 
- osd scrub finalize thread timeout
- The maximum time in seconds before a scrub finalize thread time out. Default is 60*10. 
- osd scrub load threshold
- The normalized maximum load. Ceph will not scrub when the system load (as defined by the ratio of - getloadavg()/ number of- online cpus) is higher than this number. Default is 0.5.
- osd scrub min interval
- The minimal interval in seconds for scrubbing Ceph OSD when the Ceph cluster load is low. Default is 60*60*24 (once a day). 
- osd scrub max interval
- The maximum interval in seconds for scrubbing Ceph OSD irrespective of cluster load. 7*60*60*24 (once a week). 
- osd scrub chunk min
- The minimal number of object store chunks to scrub during single operation. Ceph blocks writes to a single chunk during scrub. Default is 5. 
- osd scrub chunk max
- The maximum number of object store chunks to scrub during single operation. Default is 25. 
- osd scrub sleep
- Time to sleep before scrubbing next group of chunks. Increasing this value slows down the whole scrub operation while client operations are less impacted. Default is 0. 
- osd deep scrub interval
- The interval for 'deep' scrubbing (fully reading all data). The - osd scrub load thresholdoption does not affect this setting. Default is 60*60*24*7 (once a week).
- osd scrub interval randomize ratio
- Add a random delay to the - osd scrub min intervalvalue when scheduling the next scrub job for a placement group. The delay is a random value smaller than the result of- osd scrub min interval*- osd scrub interval randomized ratio. Therefore, the default setting practically randomly spreads the scrubs out in the allowed time window of [1, 1.5] *- osd scrub min interval. Default is 0.5
- osd deep scrub stride
- Read size when doing a deep scrub. Default is 524288 (512 kB). 

