Monday, 18 January 2016

Linux PCI bus enumeration PCI config reads and writes

In this blog we will see the linux code flow for the PCI bus enumeration. It tells us which functions fill up config data in pci_dev structure for the devices.
Later I have also mentioned how are the function calls when pci config space reads are done.
I have used Linux kernel 3.15 for this illustration.
The PCI enumeration is started from acpi_init in acpi supported platforms. The BIOS has already initialised the config space for devices. This config data is fetched while pci bus enumeration.

The function acpi_init calls  acpi_scan_init
acpi_scan_init calls acpi_pci_root_init() function. This function adds a pci_root_handler for ACPI scan.

static struct acpi_scan_handler pci_root_handler = {
.ids = root_device_ids,
.attach = acpi_pci_root_add,
.detach = acpi_pci_root_remove,
.hotplug = {
.enabled = true,
.scan_dependent = acpi_pci_root_scan_dependent,
},
};

After this  acpi_scan_init calls acpi_bus_scan
acpi_bus_scan scans for handlers and tries to call the handler in function acpi_scan_attach_handler. It calls the attach function for acpi handler. For this time it comes to be acpi_pci_root_add.

acpi_pci_root_add function fills the bus number in root->secondary resource.
The ACPI method of METHOD_NAME__CRS is called to fill the bus number.
Similarly root->segment is filled from ACPI method METHOD_NAME__SEG
root->mcfg_addr is filled from METHOD_NAME__CBA

After filling these info it calls function pci_acpi_scan_root. In function pci_acpi_scan_root since the bus is not added it calls pci_create_root_bus to allocate a root bus and corresponding host bridge.

It then calls function pci_scan_child_bus to proceed further. This function for the first time is scanning the root bus itself. Now this function calls pci_scan_slot for all the devices.
The devfn is combination of devicesID (5bits) and device function (3 bits). So this loop iterates through all the deviceIDs
/* Go find them, Rover! */
for (devfn = 0; devfn < 0x100; devfn += 8)
pci_scan_slot(bus, devfn)

After a while if the device comes out to be a bridge the this function also calls pci_scan_bridge.

Lets look at function pci_scan_slot. This function calls pci_scan_single_device calling pci_scan_device.

pci_scan_device reads the vendor ID for device.
pci_bus_read_dev_vendor_id if the read fails  or the data returned is 0xffffffff then device is not present.

pci_scan_device allocated pci_dev structure and calls pci_setup_device.

pci_setup_device reads the various pci data from PCI config space and fills it in pci_dev structure.

pci_scan_single_device also calls pci_device_add. pci_device_add initializes struct device for pci_dev. Then it scans the various PCI capabilities in function pci_init_capabilities.

Sample call stack for pci_dev initialisation :
[    0.154205]  [<ffffffff816872a5>] pci_mmcfg_read+0x125/0x130
[    0.154208]  [<ffffffff8168b5f3>] raw_pci_read+0x23/0x40
[    0.154211]  [<ffffffff8168b63c>] pci_read+0x2c/0x30
[    0.154214]  [<ffffffff813e78c6>] pci_bus_read_config_dword+0x66/0x90
[    0.154217]  [<ffffffff813e9b5d>] pci_cfg_space_size_ext+0x6d/0xb0
[    0.154220]  [<ffffffff813ea968>] pci_cfg_space_size+0x68/0x70
[    0.154223]  [<ffffffff813eab31>] pci_setup_device+0x1c1/0x530
[    0.154226]  [<ffffffff814f10f7>] ? get_device+0x17/0x30
[    0.154229]  [<ffffffff813eb082>] pci_scan_single_device+0x82/0xc0
[    0.154232]  [<ffffffff813eb10e>] pci_scan_slot+0x4e/0x140
[    0.154235]  [<ffffffff813ec38d>] pci_scan_child_bus+0x3d/0x160
[    0.154238]  [<ffffffff81689f30>] pci_acpi_scan_root+0x360/0x550
[    0.154242]  [<ffffffff8143cf2c>] acpi_pci_root_add+0x3b7/0x49b
[    0.154245]  [<ffffffff8143ed7d>] ? acpi_pnp_match+0x31/0xa8
[    0.154248]  [<ffffffff81438fef>] acpi_bus_attach+0x109/0x1fc
[    0.154251]  [<ffffffff814f5a9e>] ? device_attach+0x6e/0xd0
[    0.154254]  [<ffffffff8143906a>] acpi_bus_attach+0x184/0x1fc
[    0.154256]  [<ffffffff814f5a9e>] ? device_attach+0x6e/0xd0
[    0.154259]  [<ffffffff8143906a>] acpi_bus_attach+0x184/0x1fc
[    0.154262]  [<ffffffff81d85be8>] ? acpi_sleep_proc_init+0x2a/0x2a
[    0.154265]  [<ffffffff814391d5>] acpi_bus_scan+0x5b/0x6d
[    0.154268]  [<ffffffff81d86039>] acpi_scan_init+0x6d/0x1b3
[    0.154271]  [<ffffffff81d9bfa9>] ? __pci_mmcfg_init+0x60/0xa7
[    0.154274]  [<ffffffff81d85e37>] acpi_init+0x24f/0x267
[    0.154277]  [<ffffffff81002144>] do_one_initcall+0xd4/0x210
[    0.154280]  [<ffffffff81091a00>] ? parse_args+0x70/0x480
[    0.154283]  [<ffffffff810b32f8>] ? __wake_up+0x48/0x60
[    0.154286]  [<ffffffff81d3e27a>] kernel_init_freeable+0x16c/0x1f9
[    0.154289]  [<ffffffff81d3d9a7>] ? initcall_blacklist+0xc0/0xc0
[    0.154292]  [<ffffffff817a00c0>] ? rest_init+0x80/0x80
[    0.154295]  [<ffffffff817a00ce>] kernel_init+0xe/0xf0
[    0.154298]  [<ffffffff817b5d98>] ret_from_fork+0x58/0x90
[    0.154301]  [<ffffffff817a00c0>] ? rest_init+0x80/0x80


How the PCI config reads/writes occur ? 
We see that several function pci_read_config_byte, pci_read_config_word etc. are called to read the config space of PCI. These functions are defined in include/linux/pci.h as
static inline int pci_read_config_word(const struct pci_dev *dev, int where, u16 *val)
{
return pci_bus_read_config_word(dev->bus, dev->devfn, where, val);
}

The functions of pci_bus_read_config_word are defined in drivers/pci/access.c
#define PCI_OP_READ(size,type,len) \
int pci_bus_read_config_##size \
(struct pci_bus *bus, unsigned int devfn, int pos, type *value) \
{ \
int res; \
unsigned long flags; \
u32 data = 0; \
if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER; \
raw_spin_lock_irqsave(&pci_lock, flags); \
res = bus->ops->read(bus, devfn, pos, len, &data); \
*value = (type)data; \
raw_spin_unlock_irqrestore(&pci_lock, flags); \
return res; \
}

Here it is using the bus->ops->read function to do the reads. This read is initialized at init time. The variables of raw_pci_ops and raw_pci_ext_ops are initialised for this.
raw_pci_ext_ops is initialised to pci_mmcfg in function pci_mmcfg_arch_init
raw_pci_ext_ops = &pci_mmcfg;

[    0.140712] kundan pci_mmcfg_arch_init 132
[    0.140715] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.19.8-ckt9 #4
[    0.140717] Hardware name: LENOVO 28427ZQ/INVALID, BIOS 6JET58WW (1.16 ) 09/17/2009
[    0.140719]  0000000000000000 ffff88013afd3dd8 ffffffff817ae263 0000000000001970
[    0.140723]  ffffffff81cd7740 ffff88013afd3df8 ffffffff81d9bb9b 0000000000000aae
[    0.140726]  ffffffff81cd7740 ffff88013afd3e18 ffffffff81d9bfa9 ffffffff81c1d060
[    0.140729] Call Trace:
[    0.140733]  [<ffffffff817ae263>] dump_stack+0x45/0x57
[    0.140736]  [<ffffffff81d9bb9b>] pci_mmcfg_arch_init+0x5a/0x63
[    0.140740]  [<ffffffff81d9bfa9>] __pci_mmcfg_init+0x60/0xa7
[    0.140743]  [<ffffffff81d9c637>] pci_mmcfg_late_init+0x27/0x29
[    0.140746]  [<ffffffff81d85e32>] acpi_init+0x24a/0x267
[    0.140749]  [<ffffffff81002144>] do_one_initcall+0xd4/0x210
[    0.140752]  [<ffffffff81091a00>] ? parse_args+0x70/0x480
[    0.140755]  [<ffffffff810b32f8>] ? __wake_up+0x48/0x60
[    0.140758]  [<ffffffff81d3e27a>] kernel_init_freeable+0x16c/0x1f9
[    0.140761]  [<ffffffff81d3d9a7>] ? initcall_blacklist+0xc0/0xc0
[    0.140764]  [<ffffffff817a00c0>] ? rest_init+0x80/0x80
[    0.140767]  [<ffffffff817a00ce>] kernel_init+0xe/0xf0
[    0.140770]  [<ffffffff817b5d98>] ret_from_fork+0x58/0x90
[    0.140773]  [<ffffffff817a00c0>] ? rest_init+0x80/0x80

Now pci_mmcfg is defined as :
const struct pci_raw_ops pci_mmcfg = {
.read = pci_mmcfg_read,
.write = pci_mmcfg_write,
};

static int pci_mmcfg_read(unsigned int seg, unsigned int bus,
 unsigned int devfn, int reg, int len, u32 *value)
{
char __iomem *addr;

/* Why do we have this when nobody checks it. How about a BUG()!? -AK */
if (unlikely((bus > 255) || (devfn > 255) || (reg > 4095))) {
err: *value = -1;
return -EINVAL;
}

rcu_read_lock();
addr = pci_dev_base(seg, bus, devfn);
if (!addr) {
rcu_read_unlock();
goto err;
}

switch (len) {
case 1:
*value = mmio_config_readb(addr + reg);
break;
case 2:
*value = mmio_config_readw(addr + reg);
break;
case 4:
*value = mmio_config_readl(addr + reg);
break;
}
rcu_read_unlock();

return 0;
}

So the pci_dev_base function actually fetches the Memory mapped address for PCI config reads and writes.

static char __iomem *pci_dev_base(unsigned int seg, unsigned int bus, unsigned int devfn)
{
struct pci_mmcfg_region *cfg = pci_mmconfig_lookup(seg, bus);

if (cfg && cfg->virt)
return cfg->virt + (PCI_MMCFG_BUS_OFFSET(bus) | (devfn << 12));
return NULL;
}

struct pci_mmcfg_region *pci_mmconfig_lookup(int segment, int bus)
{
struct pci_mmcfg_region *cfg;

list_for_each_entry_rcu(cfg, &pci_mmcfg_list, list)
if (cfg->segment == segment &&
   cfg->start_bus <= bus && bus <= cfg->end_bus)
return cfg;

return NULL;
}

pci_mmcfg_list should have been populated for this bus. This is done in pci_mmconfig_add function. This value is assigned from ACPI by parsing the SFI table. Here is a call stack for it getting initialized

[    0.112385]  [<ffffffff81d9c0ab>] pci_mmconfig_add+0x3a/0xa0
[    0.112388]  [<ffffffff81d9c39d>] pci_parse_mcfg+0x8e/0x13b
[    0.112391]  [<ffffffff81d9c30f>] ? pci_mmcfg_e7520+0x61/0x61
[    0.112394]  [<ffffffff81d9bab0>] ? pcibios_resource_survey+0x72/0x72
[    0.112397]  [<ffffffff81d85030>] acpi_table_parse+0x6c/0x89
[    0.112400]  [<ffffffff81d9c054>] acpi_sfi_table_parse.constprop.8+0x17/0x34
[    0.112403]  [<ffffffff81d9c5ff>] pci_mmcfg_early_init+0xea/0xfb
[    0.112406]  [<ffffffff81d9bacb>] pci_arch_init+0x1b/0x6a
[    0.112409]  [<ffffffff81002144>] do_one_initcall+0xd4/0x210
[    0.112411]  [<ffffffff81091a00>] ? parse_args+0x70/0x480
[    0.112414]  [<ffffffff810b32f8>] ? __wake_up+0x48/0x60
[    0.112417]  [<ffffffff81d3e27a>] kernel_init_freeable+0x16c/0x1f9
[    0.112420]  [<ffffffff81d3d9a7>] ? initcall_blacklist+0xc0/0xc0
[    0.112423]  [<ffffffff817a00c0>] ? rest_init+0x80/0x80
[    0.112426]  [<ffffffff817a00ce>] kernel_init+0xe/0xf0
[    0.112429]  [<ffffffff817b5d98>] ret_from_fork+0x58/0x90
[    0.112432]  [<ffffffff817a00c0>] ? rest_init+0x80/0x80


There is another way to access the PCI config space using older method. This method is now also supported by PCI as a backward compatible stuff. Check http://wiki.osdev.org/PCI for relevance of 0xCFC and 0xCF8 ports

static int pci_conf1_read(unsigned int seg, unsigned int bus,
 unsigned int devfn, int reg, int len, u32 *value)
{
unsigned long flags;

if (seg || (bus > 255) || (devfn > 255) || (reg > 4095)) {
*value = -1;
return -EINVAL;
}

raw_spin_lock_irqsave(&pci_config_lock, flags);

outl(PCI_CONF1_ADDRESS(bus, devfn, reg), 0xCF8);

switch (len) {
case 1:
*value = inb(0xCFC + (reg & 3));
break;
case 2:
*value = inw(0xCFC + (reg & 2));
break;
case 4:
*value = inl(0xCFC);
break;
}

raw_spin_unlock_irqrestore(&pci_config_lock, flags);

return 0;
}


Info about driver loading :
https://www.linux.com/news/hardware/peripherals/180950-udev
https://bobcares.com/blog/udev-introduction-to-device-management-in-modern-linux-system/


How the probe method of pci device driver is called ?
As we have seen in the PCI enumeration that all the devices attached are checked in a tree like fashion, reading the vendor ID of all the devices. For the devices if the vendor ID comes out to be valid then the function pci_device_add is called.

This function sends a Uevent to user space mentioning the vendor ID, device ID etc. The Udevd sees that this device is enumerated/added. Udevd looks inside /lib/modules/<kernel>/modules.alias for the driver present for this PCI device. Then Udevd loads the driver. Loading driver does pci_register_driver. This goes through following calls and calls the probe of driver.

Jan 10 19:57:42 localhost kernel: [   11.547671] CPU: 1 PID: 345 Comm: udevd Not tainted 4.0.0 #1
Jan 10 19:57:42 localhost kernel: [   11.547672] Hardware name: Hewlett-Packard HP ZBook 840 G2/2216, BIOS M71 Ver. 00.55 05/14/2014
Jan 10 19:57:42 localhost kernel: [   11.547673]  ffff88015707f400 ffff8801568af948 ffffffff816cd469 0000000000000080
Jan 10 19:57:42 localhost kernel: [   11.547676]  0000000000000000 ffff8801568af978 ffffffff8144a2b2 0000000000000000
Jan 10 19:57:42 localhost kernel: [   11.547678]  ffff88015707f400 ffffffffa01f3d20 0000000000000003 ffff8801568afa18
Jan 10 19:57:42 localhost kernel: [   11.547681] Call Trace:
Jan 10 19:57:42 localhost kernel: [   11.547684]  [<ffffffff816cd469>] dump_stack+0x45/0x57
Jan 10 19:57:42 localhost kernel: [   11.547687]  [<ffffffff8144a2b2>] platform_device_add+0x32/0x300
Jan 10 19:57:42 localhost kernel: [   11.547690]  [<ffffffffa01df44f>] mfd_add_device+0x30f/0x3a0 [mfd_core]
Jan 10 19:57:42 localhost kernel: [   11.547693]  [<ffffffff811b9466>] ? __kmalloc+0x166/0x1a0
Jan 10 19:57:42 localhost kernel: [   11.547696]  [<ffffffffa01df6d3>] ? mfd_add_devices+0x53/0x980 [mfd_core]
Jan 10 19:57:42 localhost kernel: [   11.547699]  [<ffffffffa01df6d3>] ? mfd_add_devices+0x53/0x980 [mfd_core]
Jan 10 19:57:42 localhost kernel: [   11.547702]  [<ffffffffa01df735>] mfd_add_devices+0xb5/0x980 [mfd_core]
Jan 10 19:57:42 localhost kernel: [   11.547705]  [<ffffffff815a54d3>] ? raw_pci_read+0x23/0x40
Jan 10 19:57:42 localhost kernel: [   11.547709]  [<ffffffffa01f0515>] lpc_ich_probe+0x395/0x5c4 [lpc_ich]
Jan 10 19:57:42 localhost kernel: [   11.547712]  [<ffffffff8136f70e>] local_pci_probe+0x4e/0xa0
Jan 10 19:57:42 localhost kernel: [   11.547716]  [<ffffffff814432a7>] ? get_device+0x17/0x30
Jan 10 19:57:42 localhost kernel: [   11.547720]  [<ffffffff8136f989>] pci_device_probe+0xd9/0x120
Jan 10 19:57:42 localhost kernel: [   11.547724]  [<ffffffff814480bd>] driver_probe_device+0x9d/0x3c0
Jan 10 19:57:42 localhost kernel: [   11.547728]  [<ffffffff8144848b>] __driver_attach+0xab/0xb0
Jan 10 19:57:42 localhost kernel: [   11.547732]  [<ffffffff814483e0>] ? driver_probe_device+0x3c0/0x3c0
Jan 10 19:57:42 localhost kernel: [   11.547737]  [<ffffffff8144617d>] bus_for_each_dev+0x5d/0xa0
Jan 10 19:57:42 localhost kernel: [   11.547740]  [<ffffffff814479fe>] driver_attach+0x1e/0x20
Jan 10 19:57:42 localhost kernel: [   11.547744]  [<ffffffff814476a4>] bus_add_driver+0x124/0x250
Jan 10 19:57:42 localhost kernel: [   11.547748]  [<ffffffffa01e4000>] ? 0xffffffffa01e4000
Jan 10 19:57:42 localhost kernel: [   11.547752]  [<ffffffff81448ca4>] driver_register+0x64/0xf0
Jan 10 19:57:42 localhost kernel: [   11.547756]  [<ffffffff8136ebbb>] __pci_register_driver+0x4b/0x50
Jan 10 19:57:42 localhost kernel: [   11.547760]  [<ffffffffa01e401e>] lpc_ich_driver_init+0x1e/0x1000 [lpc_ich]
Jan 10 19:57:42 localhost kernel: [   11.547763]  [<ffffffff81000310>] do_one_initcall+0xc0/0x1e0
Jan 10 19:57:42 localhost kernel: [   11.547767]  [<ffffffff811b97a5>] ? kmem_cache_alloc_trace+0x35/0x140
Jan 10 19:57:42 localhost kernel: [   11.547769]  [<ffffffff816c9e86>] do_init_module+0x61/0x1ce
Jan 10 19:57:42 localhost kernel: [   11.547772]  [<ffffffff810f2ad6>] load_module+0x1d16/0x2580
Jan 10 19:57:42 localhost kernel: [   11.547775]  [<ffffffff810ee7e0>] ? unset_module_core_ro_nx+0x80/0x80
Jan 10 19:57:42 localhost kernel: [   11.547779]  [<ffffffff816d6222>] ? page_fault+0x22/0x30
Jan 10 19:57:42 localhost kernel: [   11.547782]  [<ffffffff810f3443>] SyS_init_module+0x103/0x160
Jan 10 19:57:42 localhost kernel: [   11.547785]  [<ffffffff816d47b2>] system_call_fastpath+0x12/0x17

2 comments:

  1. I have one question when the pci power management code comes into picture

    ReplyDelete