cgroups enable us to distribute the
resources among the various tasks or tasks group. The cgroup uses subsystems (resources like cpu, mem,
blkio) to apply per-cgroup limits for these resources. Refer [1] [2].
Following steps are required for creating
a cgroup with only specialized limiting of blkio subsystem.
create
blkio cgroup :
mount
-t tmpfs cgroup_root /sys/fs/cgroup
mkdir
/sys/fs/cgroup/blkio
mount
-t cgroup -o blkio none /sys/fs/cgroup/blkio
mkdir
-p /sys/fs/cgroup/blkio/test1/ --------------->
creation of cgroup test1
mkdir
-p /sys/fs/cgroup/blkio/test2/ --------------->
creation of cgroup test2
echo
1000 > /sys/fs/cgroup/blkio/test1/blkio.weight -----> Set weight of cgroup
test1
echo
500 > /sys/fs/cgroup/blkio/test2/blkio.weight ----> Set weight of cgroup
test2
sync
echo
3 > /proc/sys/vm/drop_caches
dd
if=/dev/sdbv of=file_1 bs=1M count=512 &
echo
$! > /sys/fs/cgroup/blkio/test1/tasks ---> Attach dd process to test1 cgroup
cat
/sys/fs/cgroup/blkio/test1/tasks
dd
if=/dev/sdbv of=file_2 bs=1M count=512 &
echo
$! > /sys/fs/cgroup/blkio/test2/tasks --> Attach dd process to test2 cgroup
cat
/sys/fs/cgroup/blkio/test2/tasks
Here we create a
cgroup with blkio subsystem, assign weights and attach “dd” process to these
cgroups. "test1” cgroup will complete the io faster than the “test2”
cgroup as less weight is assigned to “test2”
Peek
into the changes done in task_struct of the dd process:
We added a jprobe in the
generic_make_request function and tried to print the cgroup and the subsystem
the “dd” process is attached to.
Here is the probe function code:
void my_handler (struct bio *bio)
void my_handler (struct bio *bio)
{
struct task_struct *task = current;
char *str = "dd";
int i = 0 ;
if (strncmp(str,task->comm,2) == 0)
{
printk("assignment: current
process: %s, PID: %d\n", task->comm, task->pid);
for (i=0;i<CGROUP_SUBSYS_COUNT;i++)
{
printk("cgroup subsys count =
%d\n",i);
if(task->cgroups->subsys !=
NULL)
{
if(task->cgroups->subsys[i] != NULL)
{
if(task->cgroups->subsys[i]->cgroup != NULL)
{
if(task->cgroups->subsys[i]->cgroup->name != NULL)
printk("cgroup->name = %s\n", task->cgroups->subsys[i]->cgroup->name->name);
if(task->cgroups->subsys[i]->ss != NULL)
if
(task->cgroups->subsys[i]->ss->name != NULL)
printk("cgroup->subsys name
= %s\n", task->cgroups->subsys[i]->ss->name);
}
}
}
else
{
printk("NULL\n");
}
}
}
jprobe_return();
}
Following is the
output we get:
2014-12-02T13:29:32.843643+05:30 lnx
kernel: [508988.896860] assignment: current process: dd, PID: 29713
2014-12-02T13:29:32.843644+05:30 lnx
kernel: [508988.896861] cgroup subsys count = 0
2014-12-02T13:29:32.843645+05:30 lnx
kernel: [508988.896863] cgroup->name
= /
2014-12-02T13:29:32.843653+05:30 lnx
kernel: [508988.896865] cgroup->subsys name
= cpuset
2014-12-02T13:29:32.843654+05:30 lnx
kernel: [508988.896866] cgroup subsys count = 1
2014-12-02T13:29:32.843656+05:30 lnx
kernel: [508988.896868] cgroup->name
= /
2014-12-02T13:29:32.843657+05:30 lnx
kernel: [508988.896870] cgroup->subsys name
= cpu
2014-12-02T13:29:32.843658+05:30 lnx
kernel: [508988.896871] cgroup subsys count = 2
2014-12-02T13:29:32.843659+05:30 lnx
kernel: [508988.896873] cgroup->name
= /
2014-12-02T13:29:32.843660+05:30 lnx
kernel: [508988.896874] cgroup->subsys name
= cpuacct
2014-12-02T13:29:32.843662+05:30 lnx
kernel: [508988.896876] cgroup subsys count = 3
2014-12-02T13:29:32.843663+05:30 lnx
kernel: [508988.896878] cgroup->name
= /
2014-12-02T13:29:32.843665+05:30 lnx
kernel: [508988.896879] cgroup->subsys name
= memory
2014-12-02T13:29:32.843666+05:30 lnx
kernel: [508988.896881] cgroup subsys count = 4
2014-12-02T13:29:32.843667+05:30 lnx
kernel: [508988.896882] cgroup->name
= /
2014-12-02T13:29:32.843668+05:30 lnx
kernel: [508988.896884] cgroup->subsys name
= devices
2014-12-02T13:29:32.843669+05:30 lnx
kernel: [508988.896886] cgroup subsys count = 5
2014-12-02T13:29:32.843671+05:30 lnx
kernel: [508988.896887] cgroup->name
= /
2014-12-02T13:29:32.843672+05:30 lnx
kernel: [508988.896888] cgroup->subsys name
= freezer
2014-12-02T13:29:32.843681+05:30
lnx kernel: [508988.896890] cgroup subsys count = 6
2014-12-02T13:29:32.843682+05:30
lnx kernel: [508988.896891] cgroup->name
= test1
2014-12-02T13:29:32.843683+05:30
lnx kernel: [508988.896893] cgroup->subsys name = blkio
2014-12-02T13:29:32.843685+05:30 lnx
kernel: [508988.896894] cgroup subsys count = 7
2014-12-02T13:29:32.843686+05:30 lnx
kernel: [508988.896896] cgroup->name
= /
2014-12-02T13:29:32.843688+05:30 lnx
kernel: [508988.896897] cgroup->subsys name
= perf_event
2014-12-02T13:29:32.843689+05:30 lnx
kernel: [508988.896898] cgroup subsys count = 8
2014-12-02T13:29:32.843690+05:30 lnx
kernel: [508988.896901] cgroup->name
= /
2014-12-02T13:29:32.843692+05:30 lnx
kernel: [508988.896902] cgroup->subsys name
= hugetlb
2014-12-02T13:29:32.843693+05:30 lnx
kernel: [508988.896902] cgroup subsys count = 9
2014-12-02T13:29:32.843694+05:30 lnx
kernel: [508988.896903] cgroup subsys count = 10
From this example we see that for the dd process all
the susbsystems(resources) are using the default root (”/”) cgroup. The blkio
subsys uses the test1 cgroup.
Now further we
will see that how the cgroup initialization is done and the code corresponding
to various steps used.
Linux cgroups
initialization at boot up:
A new file system of type
"cgroup" (VFS) is registered on Linux start.
started like :
start_kernel
-> cgroup_init_early -> cgroup_init_subsys ->
cgroup_init
cgroup_init_subsys
top cgroup state is created :
/*
Create the top cgroup state for this subsystem */
list_add(&ss->sibling,
&cgroup_dummy_root.subsys_list);
cgroupfs_root is created.
Filesystem
registration :
mounting unmounting operations :
cgroup_init()
err
= register_filesystem(&cgroup_fs_type);
static struct file_system_type cgroup_fs_type = {
.name
= "cgroup",
.mount
= cgroup_mount,
.kill_sb
= cgroup_kill_sb,
};
CGROUP ACTIONS :
All cgroups actions are performed via
filesystem actions (create/remove directory, reading/writing to files in it,
mounting/mount options).
mount operations are mentioned
previously. The read, write , create and remove are identified by :
kernel/cgroup.c :
static const struct file_operations cgroup_file_operations = {
.read
= cgroup_file_read,
.write
= cgroup_file_write,
.llseek
= generic_file_llseek,
.open
= cgroup_file_open,
.release
= cgroup_file_release,
};
static const struct inode_operations cgroup_file_inode_operations = {
.setxattr
= cgroup_setxattr,
.getxattr
= cgroup_getxattr,
.listxattr
= cgroup_listxattr,
.removexattr
= cgroup_removexattr,
};
static const struct inode_operations cgroup_dir_inode_operations = {
.lookup
= simple_lookup,
.mkdir
= cgroup_mkdir,
.rmdir
= cgroup_rmdir,
.rename
= cgroup_rename,
.setxattr
= cgroup_setxattr,
.getxattr
= cgroup_getxattr,
.listxattr
= cgroup_listxattr,
.removexattr
= cgroup_removexattr,
};
The control group can be mounted anywhere
on the filesystem. Systemd uses /sys/fs/cgroup. When mounting, we can specify
with mount options (-o) which subsystems we want to use.
Say for make a cgroup with blkio
susbsystem commands will be :
mount
-t tmpfs cgroup_root /sys/fs/cgroup
mkdir
/sys/fs/cgroup/blkio
mount
-t cgroup -o blkio none /sys/fs/cgroup/blkio
mkdir
-p /sys/fs/cgroup/blkio/test1/
mount -t cgroup -o blkio none
/sys/fs/cgroup/blkio
This command calls a cgroup_mount and creates the files in this directory :
lnx:/sys/fs/cgroup/blkio # ls
blkio.io_merged blkio.io_service_time_recursive blkio.reset_stats
blkio.throttle.write_bps_device
cgroup.event_control
blkio.io_merged_recursive blkio.io_serviced blkio.sectors
blkio.throttle.write_iops_device
cgroup.procs
blkio.io_queued
blkio.io_serviced_recursive
blkio.sectors_recursive
blkio.time
cgroup.sane_behavior
blkio.io_queued_recursive blkio.io_wait_time blkio.throttle.io_service_bytes blkio.time_recursive notify_on_release
blkio.io_service_bytes blkio.io_wait_time_recursive blkio.throttle.io_serviced blkio.weight release_agent
blkio.io_service_bytes_recursive blkio.leaf_weight
blkio.throttle.read_bps_device
blkio.weight_device
tasks
blkio.io_service_time blkio.leaf_weight_device blkio.throttle.read_iops_device cgroup.clone_children
Now make directory in this newly created
cgroup.
mkdir -p /sys/fs/cgroup/blkio/test1
cgroup_create is called. Here the new cgroup is created and the
blkio subsystem is initialised :
/*
allocate the cgroup and its ID, 0 is reserved for the root */
cgrp
= kzalloc(sizeof(*cgrp), GFP_KERNEL);
if
(!cgrp)
return
-ENOMEM;
name
= cgroup_alloc_name(dentry);
cgroup blkio subsystem is allocated :
css
= ss->css_alloc(cgroup_css(parent,
ss));
For the (blkio controller) blkcg the
function called is blkcg_css_alloc.
In this function blkcg is initialised :
blkcg
= kzalloc(sizeof(*blkcg), GFP_KERNEL);
if
(!blkcg)
return
ERR_PTR(-ENOMEM);
blkcg->cfq_weight
= CFQ_WEIGHT_DEFAULT;
blkcg->cfq_leaf_weight
= CFQ_WEIGHT_DEFAULT;
blkcg->id
= atomic64_inc_return(&id_seq); /* root is 0, start from 1 */
spin_lock_init(&blkcg->lock);
INIT_RADIX_TREE(&blkcg->blkg_tree,
GFP_ATOMIC);
INIT_HLIST_HEAD(&blkcg->blkg_list);
struct blkcg {
struct
cgroup_subsys_state css;
spinlock_t lock;
struct
radix_tree_root blkg_tree;
struct
blkcg_gq *blkg_hint;
struct
hlist_head blkg_list;
/*
for policies to test whether associated blkcg has changed */
uint64_t id;
/*
TODO: per-policy storage in blkcg */
unsigned
int cfq_weight; /* belongs to cfq */
unsigned
int cfq_leaf_weight;
};
init_css is called to initialise cgroup_subsys_state from the blkio subsystem and cgroup.
init_css(css, ss, cgrp);
2014-12-08T10:28:48.196121+05:30 lnx
kernel: [243681.125109] //init_css Handler hit
2014-12-08T10:28:48.196140+05:30 lnx
kernel: [243681.125115] cgrp name = test3
2014-12-08T10:28:48.196144+05:30 lnx
kernel: [243681.125117] ss name = blkio
jprobe from css_init
2014-12-08T10:28:48.196175+05:30 lnx
kernel: [243681.125235]
[<ffffffff810d7d69>] cgroup_mkdir+0x299/0x670
2014-12-08T10:28:48.196177+05:30 lnx
kernel: [243681.125246]
[<ffffffff811a9d50>] vfs_mkdir+0xb0/0x160
2014-12-08T10:28:48.196179+05:30 lnx
kernel: [243681.125254]
[<ffffffff811af28b>] SyS_mkdirat+0xab/0xe0
2014-12-08T10:28:48.196181+05:30 lnx kernel:
[243681.125265]
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
This also generates the directory
structure for the subsystem using function calls of cgroup_addrm_files, cgroup_populate_dir.
dump_stack example via jprobe :
2014-12-08T09:59:57.364352+05:30 lnx
kernel: [241951.139904]
[<ffffffff810d6909>] cgroup_populate_dir+0x69/0x110
2014-12-08T09:59:57.364354+05:30 lnx
kernel: [241951.139909]
[<ffffffff810d80ad>] cgroup_mkdir+0x5dd/0x670
2014-12-08T09:59:57.364356+05:30 lnx kernel:
[241951.139914]
[<ffffffff811a9d50>] vfs_mkdir+0xb0/0x160
2014-12-08T09:59:57.364357+05:30 lnx
kernel: [241951.139919]
[<ffffffff811af28b>] SyS_mkdirat+0xab/0xe0
Changing the
cgroup policies/properties:
The cgroup properties can be changed by writuing
to the files in /sys/fs/cgroup/blkio/<cgroup_name>/property.
Example :
echo
1000 > /sys/fs/cgroup/blkio/test1/blkio.weight
This will call write function of cgroup :
cgroup_file_write
if
(cft->write)
return
cft->write(css, cft, file, buf, nbytes, ppos);
this
will call the weight cftype's write function
{
.name
= "weight",
.flags
= CFTYPE_NOT_ON_ROOT,
.read_seq_string
= cfq_print_weight,
.write_u64
= cfq_set_weight,
},
And
in function __cfq_set_weight the value is set to the blkcg
blkcg->cfq_leaf_weight
= val;
Attaching a task
to the cgroup:
echo
<PID> > /sys/fs/cgroup/blkio/test1/tasks
This will fetch function from:
*/
{
.name
= "tasks",
.flags
= CFTYPE_INSANE, /*
use "procs" instead */
.open
= cgroup_tasks_open,
.write_u64
= cgroup_tasks_write,
.release
= cgroup_pidlist_release,
.mode
= S_IRUGO | S_IWUSR,
},
{
cgroup_tasks_write calls
attach_task_by_pid
attach_task_by_pid in file cgroup.c
ret
= cgroup_attach_task(cgrp, tsk,
threadgroup);
A new css_set is created and attached to the task_struct of this process:
cgroup_task_migrate(tc->cgrp,
tc->task, tc->cset);
Association of
request_queue and the block cgroup
Whenever the I/O comes to a block layer
the association is created between the devices request queue and the block
group.
This association “struct blkcg_gq” is created when I/O comes to a device. It is
created in function “blkg_create”. A
sample dump_stack of creation of association:
2014-12-15T14:02:07.036066+05:30 lnx
kernel: [860978.128274] ////blkg_create Handler hit
2014-12-15T14:02:07.036078+05:30 lnx
kernel: [860978.128281] CPU: 6 PID: 17627 Comm: dd Tainted: P OENX 3.12.28-4-default #1
2014-12-15T14:02:07.036083+05:30 lnx
kernel: [860978.128286] ffff8810568495c0
ffffffff8150b1db ffffffff81acd8c0 ffffffffa039f018
2014-12-15T14:02:07.036084+05:30 lnx
kernel: [860978.128291] ffffffff8128a2f5
ffff88103e6a62c0 ffff880855f48078 ffff88103e712880
2014-12-15T14:02:07.036086+05:30 lnx
kernel: [860978.128296] ffff880855f48078
ffffffff812719b8 0000000000000001 ffff881055749808
2014-12-15T14:02:07.036092+05:30 lnx
kernel: [860978.128301] Call Trace:
2014-12-15T14:02:07.036094+05:30 lnx
kernel: [860978.128314]
[<ffffffff8100467d>] dump_trace+0x7d/0x2d0
2014-12-15T14:02:07.036095+05:30 lnx
kernel: [860978.128321]
[<ffffffff81004964>] show_stack_log_lvl+0x94/0x170
2014-12-15T14:02:07.036096+05:30 lnx
kernel: [860978.128326]
[<ffffffff81005d91>] show_stack+0x21/0x50
2014-12-15T14:02:07.036098+05:30 lnx
kernel: [860978.128332]
[<ffffffff8150b1db>] dump_stack+0x41/0x51
2014-12-15T14:02:07.036099+05:30 lnx
kernel: [860978.128337]
[<ffffffffa039f018>] my_handler+0x18/0x20 [probe]
2014-12-15T14:02:07.036100+05:30 lnx
kernel: [860978.128347]
[<ffffffff8128a2f5>] blkg_lookup_create+0x45/0xc0
2014-12-15T14:02:07.036102+05:30 lnx
kernel: [860978.128352]
[<ffffffff812719b8>] get_request+0x88/0x6f0
2014-12-15T14:02:07.036110+05:30 lnx
kernel: [860978.128507]
[<ffffffff8127028f>] __blk_run_queue+0x2f/0x40
2014-12-15T14:02:07.036111+05:30 lnx
kernel: [860978.128512]
[<ffffffff81273860>] blk_flush_plug_list+0x1e0/0x240
2014-12-15T14:02:07.036125+05:30 lnx
kernel: [860978.128517]
[<ffffffff81273c20>] blk_finish_plug+0x10/0x40
2014-12-15T14:02:07.036127+05:30 lnx
kernel: [860978.128522]
[<ffffffff81140f9f>] __do_page_cache_readahead+0x17f/0x1f0
2014-12-15T14:02:07.036128+05:30 lnx
kernel: [860978.128528]
[<ffffffff8114115a>] ondemand_readahead+0x14a/0x280
2014-12-15T14:02:07.036130+05:30 lnx
kernel: [860978.128534]
[<ffffffff81137129>] generic_file_aio_read+0x459/0x6f0
2014-12-15T14:02:07.036131+05:30 lnx
kernel: [860978.128542] [<ffffffff8119e2cc>]
do_sync_read+0x5c/0x90
2014-12-15T14:02:07.036133+05:30 lnx
kernel: [860978.128547]
[<ffffffff8119e879>] vfs_read+0x99/0x160
2014-12-15T14:02:07.036134+05:30 lnx
kernel: [860978.128552]
[<ffffffff8119f378>] SyS_read+0x48/0xa0
2014-12-15T14:02:07.036136+05:30 lnx
kernel: [860978.128557]
[<ffffffff81519329>] system_call_fastpath+0x16/0x1b
2014-12-15T14:02:07.036137+05:30 lnx
kernel: [860978.128567]
[<00007fc484734480>] 0x7fc48473447f
In this example the blkcg_gq association is created from get_request function. The newly creates blk cgroup and request queue association is added to request_queue -> blkg_list.
In this example the blkcg_gq association is created from get_request function. The newly creates blk cgroup and request queue association is added to request_queue -> blkg_list.
Also the association is also kept in “struct blkcg -> blkg_list
References:
[1] https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt
[2] https://www.kernel.org/doc/Documentation/cgroups/blkio-controller.txt
Changed links :
Changed links :
No comments:
Post a Comment