<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>NekoDaemon&#39;s Blog</title>
  
  
  <link href="https://blog.mylab.cc/atom.xml" rel="self"/>
  
  <link href="https://blog.mylab.cc/"/>
  <updated>2024-04-01T14:59:47.916Z</updated>
  <id>https://blog.mylab.cc/</id>
  
  <author>
    <name>NekoDaemon</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title>Enable L3 PFC + DCQCN for RoCE on Edgecore SONiC</title>
    <link href="https://blog.mylab.cc/2024/04/01/Enable-L3-PFC-DCQCN-for-RoCE-on-Edgecore-SONiC/"/>
    <id>https://blog.mylab.cc/2024/04/01/Enable-L3-PFC-DCQCN-for-RoCE-on-Edgecore-SONiC/</id>
    <published>2024-04-01T14:59:46.000Z</published>
    <updated>2024-04-01T14:59:47.916Z</updated>
    
    <content type="html"><![CDATA[<p>This article is only for reference, as Edgecore SONiC, a customized variant with a lot of proprietary commands, is quite different from community SONiC. Also, some commands and configurations are specialized for certain switch ASICs, such as Intel Tofino I use right now. Thus, I would still suggest do not throw the official guidebook away, read it carefully, and it will save your life.</p><h2 id="concepts">Concepts</h2><ul><li>DSCP / dot11p Tag: A value embedded in the packet header.</li><li>Buffer: Receive buffer (SRAM) on switches partitioned into lossy pool and lossless pool.</li><li>Traffic Class (TC): An intermediate value</li><li>Priority Group (PG): Receive queue on switches.</li><li>Queue: Send queue on switches.</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">                  Map             =               </span><br><span class="line">                 ┌────► Queue ID ───► PFC Priority</span><br><span class="line">          Map    │      (Egress)      of RX PAUSE </span><br><span class="line">DSCP Tag ─────► TC                                </span><br><span class="line">                 │                =               </span><br><span class="line">                 └────► PG ID ──────► PFC Priority</span><br><span class="line">                  Map   (Ingress)     of TX PAUSE </span><br><span class="line">                           │                      </span><br><span class="line">                           │ Bind                 </span><br><span class="line">                           ├──────► Lossless      </span><br><span class="line">                           │        Buffer        </span><br><span class="line">                           │ Bind                 </span><br><span class="line">                           └──────► Lossy         </span><br><span class="line">                                    Buffer        </span><br><span class="line">                                                  </span><br><span class="line">Fig. Numerical Relationship between terminologies </span><br><span class="line"><span class="keyword">for</span> Intel Tofino 2 on EdgeCore SONiC OS           </span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">              ┌───────────────────────────────────────────────────────────────────────┐                          </span><br><span class="line">              │ Switch                                                                │                          </span><br><span class="line">┌─────────┐   │   ┌─────────┐   ┌──────────┐   ┌────────┐   ┌─────────┐   ┌───────┐   │                          </span><br><span class="line">│ Packet  │   │   │   PG0   │   │  Lossy   │   │        │   │ Queue0  │   │       │   │                          </span><br><span class="line">│ DSCP=0  ├───┼──►│ of Eth0 ├──►│  Buffer  ├──►│        ├──►│ of Eth8 ├──►│       │   │   ┌─────────┐ ┌─────────┐</span><br><span class="line">└─────────┘   │   └─────────┘   └──────────┘   │  XBAR  │   └─────────┘   │ Sche  │   │   │ Packet  │ │ Packet  │</span><br><span class="line">              │       ...                      │ Switch │       ...       │ duler ├───┼──►│ DSCP=0  │ │ DSCP=26 │</span><br><span class="line">┌─────────┐   │   ┌─────────┐   ┌──────────┐   │        │   ┌─────────┐   │       │   │   └─────────┘ └─────────┘</span><br><span class="line">│ Packet  ├───┼──►│   PG3   ├──►│ Lossless ├──►│        ├──►│ Queue3  ├──►│       │   │                          </span><br><span class="line">│ DSCP=26 │   │   │ of Eth0 │   │  Buffer  │   │        │   │ of Eth8 │   │       │   │                          </span><br><span class="line">└─────────┘   │   └─────────┘   └──────────┘   └────────┘   └─────────┘   └───────┘   │                          </span><br><span class="line">              │                                                                       │                          </span><br><span class="line">              └───────────────────────────────────────────────────────────────────────┘                          </span><br><span class="line">                                                                                                                 </span><br><span class="line">                 Fig. Ingress/Egress procedure of Intel Tofino 2 on EdgeCore SONiC OS                            </span><br></pre></td></tr></table></figure><h2 id="optional-factory-reset-configuration">(Optional) Factory Reset Configuration</h2><p>If you would like to drop your modification to the system, try these commands.</p><h3 id="normal">Normal</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">sudo rm /etc/sonic/config_db.json</span><br><span class="line">sudo config-setup factory</span><br><span class="line">sudo config reload -y</span><br></pre></td></tr></table></figure><h3 id="force">Force</h3><p>If the system has already been trapped in a strange state, these commands might be able to do a force reset.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">sudo rm /etc/sonic/config_db.json</span><br><span class="line">sudo config-setup factory</span><br><span class="line">sudo config reload -y -f</span><br><span class="line">sudo service swss restart</span><br></pre></td></tr></table></figure><h2 id="optional-set-up-ports">(Optional) Set up Ports</h2><p>If you have trouble enabling L2 forwarding, maybe there is a mismatch in the port speed or FEC configuration.</p><h3 id="check-status-of-nics">Check Status of NICs</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo mlxlink -d mlx5_0 -m -c -e</span><br></pre></td></tr></table></figure><p>If <code>State</code> is not <code>Active</code>, there might be something wrong with the port configuration or physical connection.</p><h3 id="enable-auto-negotiation-on-nics">Enable Auto-Negotiation on NICs</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo ethtool -s enp216s0f0np0 autoneg on</span><br></pre></td></tr></table></figure><h3 id="disable-auto-negotiation-on-switch">Disable Auto-Negotiation on Switch</h3><p>Disable Auto-Negotition when it does not work as you expect. If it works well, just leave it alone.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sudo config int autoneg Ethernet0 disabled</span><br><span class="line">sudo config int autoneg Ethernet8 disabled</span><br></pre></td></tr></table></figure><h3 id="force-100gb-port-speed">Force 100Gb Port Speed</h3><p>Change the port speed as you wish. The supported port speed varies depending on switches and NICs.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sudo config interface breakout Ethernet0 <span class="string">&#x27;1x100G[40G](4)&#x27;</span></span><br><span class="line">sudo config interface breakout Ethernet8 <span class="string">&#x27;1x100G[40G](4)&#x27;</span></span><br></pre></td></tr></table></figure><h3 id="optional-configure-fec-forward-error-correction-to-rs-mode"><strong>(Optional) Configure FEC (Forward Error Correction) to RS mode</strong></h3><p>Setting FEC to <code>none</code> is also fine. Just make sure switches and NICs use the same mode.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sudo config interface fec Ethernet0 rs</span><br><span class="line">sudo config interface fec Ethernet8 rs</span><br></pre></td></tr></table></figure><h3 id="bring-up-ports">Bring up Ports</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo config interface startup Ethernet0-8</span><br></pre></td></tr></table></figure><h2 id="optional-enable-l2-forwarding">(Optional) Enable L2 Forwarding</h2><p>By default, L2 switching is disabled on SONiC OS, unlike other switches.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">sudo config vlan add 1000</span><br><span class="line">sudo config vlan member add -u 1000 Ethernet0</span><br><span class="line">sudo config vlan member add -u 1000 Ethernet8</span><br><span class="line">sudo config inter ip add Vlan1000 10.200.0.1/24</span><br></pre></td></tr></table></figure><h2 id="enable-l3-pfc">Enable L3 PFC</h2><p>Since Intel Tofino 2 only supports 5 PGs, we decided to use only 5 TCs/PGs/Queues and then do 1:1 mapping between PGs/Queues and TCs.</p><h3 id="load-buffer-profile">Load Buffer Profile</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">sudo config qos reload</span><br><span class="line">sudo config save -y</span><br><span class="line">sudo reboot</span><br></pre></td></tr></table></figure><h3 id="set-and-apply-dscp-to-tc-table-for-ports">Set and Apply DSCP-to-TC Table for Ports</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">sudo config qos dscp-tc add dscp-tc-prof --dscp 0-15 --tc 1</span><br><span class="line">sudo config qos dscp-tc update dscp-tc-prof --dscp 16-23 --tc 2</span><br><span class="line">sudo config qos dscp-tc update dscp-tc-prof --dscp 24-31 --tc 3</span><br><span class="line">sudo config qos dscp-tc update dscp-tc-prof --dscp 32-39 --tc 4</span><br><span class="line">sudo config qos dscp-tc update dscp-tc-prof --dscp 40-63 --tc 5</span><br><span class="line"></span><br><span class="line">sudo config interface qos dscp-tc <span class="built_in">bind</span> Ethernet0 dscp-tc-prof</span><br><span class="line">sudo config interface qos dscp-tc <span class="built_in">bind</span> Ethernet8 dscp-tc-prof</span><br></pre></td></tr></table></figure><h3 id="set-and-apply-tc-to-queue-table-for-ports">Set and Apply TC-to-Queue Table for Ports</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">sudo config qos tc-queue add tc-queue-prof --tc 1 --queue 1</span><br><span class="line">sudo config qos tc-queue update tc-queue-prof --tc 2 --queue 2</span><br><span class="line">sudo config qos tc-queue update tc-queue-prof --tc 3 --queue 3</span><br><span class="line">sudo config qos tc-queue update tc-queue-prof --tc 4 --queue 4</span><br><span class="line">sudo config qos tc-queue update tc-queue-prof --tc 5 --queue 5</span><br></pre></td></tr></table></figure><h3 id="set-and-apply-tc-to-pg-priority-group-table">Set and Apply TC-to-PG (Priority Group) Table</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">sudo config qos tc-pg add tc-pg-prof --tc 1 --pg 1</span><br><span class="line">sudo config qos tc-pg update tc-pg-prof --tc 2 --pg 2</span><br><span class="line">sudo config qos tc-pg update tc-pg-prof --tc 3 --pg 3</span><br><span class="line">sudo config qos tc-pg update tc-pg-prof --tc 4 --pg 4</span><br><span class="line">sudo config qos tc-pg update tc-pg-prof --tc 5 --pg 5</span><br></pre></td></tr></table></figure><h3 id="specify-queue-scheduler-ets">Specify Queue Scheduler (ETS)</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">sudo config scheduler add sched-dwrr-100 --sched_type DWRR --weight 100</span><br><span class="line">sudo config scheduler add sched-strict --sched_type STRICT</span><br><span class="line"></span><br><span class="line">sudo config interface scheduler <span class="built_in">bind</span> queue Ethernet0 3 sched-dwrr-100</span><br><span class="line">sudo config interface scheduler <span class="built_in">bind</span> queue Ethernet8 3 sched-dwrr-100</span><br></pre></td></tr></table></figure><h3 id="bind-lossless-buffer-profile">Bind Lossless Buffer Profile</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">sudo config interface buffer <span class="built_in">bind</span> priority-group Ethernet0 3 ingress_lossless_profile</span><br><span class="line">sudo config interface buffer <span class="built_in">bind</span> priority-group Ethernet8 3 ingress_lossless_profile</span><br><span class="line"></span><br><span class="line">sudo config interface buffer <span class="built_in">bind</span> queue Ethernet0 3 egress_lossless_profile</span><br><span class="line">sudo config interface buffer <span class="built_in">bind</span> queue Ethernet8 3 egress_lossless_profile</span><br></pre></td></tr></table></figure><h3 id="enable-pfc-for-ports">Enable PFC for Ports</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sudo config interface pfc priority Ethernet0 3 on</span><br><span class="line">sudo config interface pfc priority Ethernet8 3 on</span><br></pre></td></tr></table></figure><h2 id="enable-ecn">Enable ECN</h2><p>Weighted random early detection (WRED) was initially purposed to randomly drop packets to signal the sender’s congestion control algorithm to slow the sending rate. When WRED works in ECN mode, it will randomly put ECN marks into forwarded packets instead of simply dropping them, and the possibility depends on the buffer usage.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">sudo config wred add wred-prof --mode ecn --gmin 1048576 --gmax 2097152 --gdrop 5</span><br><span class="line"></span><br><span class="line">sudo config interface wred <span class="built_in">bind</span> queue Ethernet0 3 wred-prof</span><br><span class="line">sudo config interface wred <span class="built_in">bind</span> queue Ethernet8 3 wred-prof</span><br></pre></td></tr></table></figure><h2 id="references">References</h2><ul><li>Edgecore SONiC User Guide (Authentication is required)</li><li><a href="https://support.edge-core.com/hc/en-us/articles/900000210426--Enterprise-SONiC-VLAN-Inter-VLAN-Routing">https://support.edge-core.com/hc/en-us/articles/900000210426--Enterprise-SONiC-VLAN-Inter-VLAN-Routing</a></li><li><a href="https://support.edge-core.com/hc/en-us/articles/900006399883--Enterprise-SONiC-Reset-default-configuration">https://support.edge-core.com/hc/en-us/articles/900006399883--Enterprise-SONiC-Reset-default-configuration</a></li><li><a href="https://support.edge-core.com/hc/en-us/articles/900000200743--Enterprise-SONiC-Switch-Port-Attributes">https://support.edge-core.com/hc/en-us/articles/900000200743--Enterprise-SONiC-Switch-Port-Attributes</a></li><li><a href="https://github.com/sonic-net/SONiC/issues/339">https://github.com/sonic-net/SONiC/issues/339</a></li><li><a href="https://github.com/sonic-net/SONiC/issues/181">https://github.com/sonic-net/SONiC/issues/181</a></li><li><a href="https://github.com/sonic-net/SONiC/issues/28">https://github.com/sonic-net/SONiC/issues/28</a></li><li><a href="https://github.com/sonic-net/SONiC/wiki/technical_faq">https://github.com/sonic-net/SONiC/wiki/technical_faq</a></li><li><a href="https://github.com/sonic-net/SONiC/wiki/L2-Switch-mode">https://github.com/sonic-net/SONiC/wiki/L2-Switch-mode</a></li><li><a href="https://github.com/sonic-net/SONiC/wiki/Configuration-with-Minigraph-(~Sep-2017)">https://github.com/sonic-net/SONiC/wiki/Configuration-with-Minigraph-(~Sep-2017)</a></li><li><a href="https://github.com/sonic-net/sonic-buildimage/issues/9470">https://github.com/sonic-net/sonic-buildimage/issues/9470</a></li><li><a href="https://github.com/sonic-net/sonic-buildimage/issues/5947">https://github.com/sonic-net/sonic-buildimage/issues/5947</a></li><li><a href="https://lfnetworking.org/wp-content/uploads/sites/7/2022/06/Tech-Talk5-How_does_SONiC_Operate_on_a_Programmable_Switch_Edgecore_Howard-Hsu.pdf">https://lfnetworking.org/wp-content/uploads/sites/7/2022/06/Tech-Talk5-How_does_SONiC_Operate_on_a_Programmable_Switch_Edgecore_Howard-Hsu.pdf</a></li><li><a href="https://www.supermicro.com/manuals/network/Supermicro_Datacenter_SONiC_Configuration_Guide.pdf">https://www.supermicro.com/manuals/network/Supermicro_Datacenter_SONiC_Configuration_Guide.pdf</a></li><li><a href="https://www.juniper.net/documentation/en_US/release-independent/nce/topics/concept/sonic-feature-config.html#wred-conf">https://www.juniper.net/documentation/en_US/release-independent/nce/topics/concept/sonic-feature-config.html</a></li><li><a href="https://netbergtw.com/top-support/netberg-sonic/switching-silicon-shell/">https://netbergtw.com/top-support/netberg-sonic/switching-silicon-shell/</a></li><li><a href="https://gitlab.tongyuejun.cn/zhangjx/p4_doc/-/blob/0c64dcd7e21a0815ca9da2d10d9659002cbb4590/Wedge100BF_User_Manual.org">https://gitlab.tongyuejun.cn/zhangjx/p4_doc/-/blob/0c64dcd7e21a0815ca9da2d10d9659002cbb4590/Wedge100BF_User_Manual.org</a></li><li><a href="https://www.infoq.cn/article/kgucyp5s8m7hecl6qiel">https://www.infoq.cn/article/kgucyp5s8m7hecl6qiel</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;This article is only for reference, as Edgecore SONiC, a customized variant with a lot of proprietary commands, is quite different from community SONiC. Also, some commands and configurations are specialized for certain switch ASICs, such as Intel Tofino I use right now. Thus, I would still suggest do not throw the official guidebook away, read it carefully, and it will save your life.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Fastest way to Install WireGuard-Go on Ubuntu 22.04</title>
    <link href="https://blog.mylab.cc/2024/02/10/Fastest-way-to-Install-WireGuard-Go-on-Ubuntu-22-04/"/>
    <id>https://blog.mylab.cc/2024/02/10/Fastest-way-to-Install-WireGuard-Go-on-Ubuntu-22-04/</id>
    <published>2024-02-10T15:35:04.000Z</published>
    <updated>2024-02-10T15:35:39.241Z</updated>
    
    <content type="html"><![CDATA[<p>I am surprised that there is no documentation about how to install WireGuard-Go from the Ubuntu official repo via APT, and even the installation process is a little bit tricky.</p><h2 id="prerequisite">Prerequisite</h2><p>This method only applies to <strong>Ubuntu 22.04</strong> (or higher version, maybe), and you should be able to create TUN/TAP devices. For Ubuntu 20.04 or older, the official repo doesn't provide WireGuard-Go.</p><p>Note that if you are allowed to load kernel modules, for example, you are not setting up a VPN inside OpenVZ or Docker Containers, you may consider Kernel Space WireGuard instead of User Space WireGuard (e.g., WireGuard-Go introduced in this post).</p><h2 id="step-1.-install-dependencies">Step 1. Install Dependencies</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sudo apt update</span><br><span class="line">sudo apt install wireguard-go wireguard-tools</span><br></pre></td></tr></table></figure><p>Note: The package <code>wireguard</code> should <strong>NOT</strong> be installed. If you have installed it, just remove it.</p><h2 id="step-2.-create-soft-link-to-binary-file">Step 2. Create Soft Link to Binary File</h2><p>Surprisingly, the binary file installed by <code>apt</code> is named <code>wireguard</code> instead of <code>wireguard-go</code>, which is not recognized by <code>wg-quick</code>. So the workaround here is just simply creating a soft link to <code>wireguard</code> file.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cd</span> /bin</span><br><span class="line">sudo ln -s wireguard wireguard-go</span><br></pre></td></tr></table></figure><h2 id="step-3.-launch-wireguard-go">Step 3. Launch WireGuard-Go</h2><p>The tutorial about writing the configuration file of <code>wg-quick</code> is omitted here since we can use the same configuration for both Kernel Space and User Space WireGuard, and we can launch WireGuard-Go in the same way as Kernel Space WireGuard using <code>wg-quick</code>.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Configuration file is written to /etc/wireguard/wg0.conf</span></span><br><span class="line">sudo wg-quick up wg0</span><br></pre></td></tr></table></figure><h2 id="optional-step-4.-run-wireguard-go-on-startup">(Optional) Step 4. Run WireGuard-Go on Startup</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo systemctl <span class="built_in">enable</span> wg-quick@wg0</span><br></pre></td></tr></table></figure>]]></content>
    
    
    <summary type="html">&lt;p&gt;I am surprised that there is no documentation about how to install WireGuard-Go from the Ubuntu official repo via APT, and even the installation process is a little bit tricky.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Fix corrupted NVIDIA Driver after Upgrading / Downgrading Ubuntu Kernel</title>
    <link href="https://blog.mylab.cc/2023/09/19/Fix-corrupted-NVIDIA-Driver-after-Upgrading-Downgrading-Ubuntu-Kernel/"/>
    <id>https://blog.mylab.cc/2023/09/19/Fix-corrupted-NVIDIA-Driver-after-Upgrading-Downgrading-Ubuntu-Kernel/</id>
    <published>2023-09-19T15:28:54.000Z</published>
    <updated>2023-09-19T17:19:20.463Z</updated>
    
    <content type="html"><![CDATA[<p>One day, you rebooted your server and suddenly found your cute GPUs had all disappeared. Then you executed <code>nvidia-smi</code> to see what was going on, but you only got this error message.</p><blockquote><p>NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.</p></blockquote><p>I know what you're gonna say: Nvidia F*** You!</p><h2 id="causes">Causes</h2><p>Actually, I encountered this problem many times. It is most likely caused by upgrading or downgrading the Linux kernel without properly generating kernel modules, which might be essential parts of GPU drivers.</p><h2 id="steps">Steps</h2><h3 id="check-system-status">Check System Status</h3><p>Right now <code>nvidia</code> module is supposed not to be loaded (Could check with <code>lsmod | grep nvidia</code>). We could try to load the kernel manually.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo modprobe nvidia</span><br></pre></td></tr></table></figure><p>You should get an error message like this.</p><blockquote><p>modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.15.0-84-generic</p></blockquote><p>Meanwhile, check whether <code>/usr/lib/modules/5.15.0-84-generic/updates/dkms/nvidia.ko</code> is missing.</p><p>If you don’t see the error message above and that kernel module file does exist, you might have other issues, such as hardware failure. At this time, try to read kernel logs through <code>dmesg</code> and check the existence of GPUs through <code>lspci -vvv</code>, which should give you some clues.</p><h3 id="reinstall-dkms-and-nvidia-drivers">Reinstall DKMS and NVIDIA Drivers</h3><p>DKMS, a utility that manages drivers, as well as NVIDIA Drivers, might be broken. We could fix them by removing them first and installing them back later.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">sudo rm -r /var/lib/dkms/nvidia</span><br><span class="line">sudo apt install --reinstall dkms</span><br><span class="line">sudo apt autoremove --purge nvidia* cuda*</span><br><span class="line"></span><br><span class="line"><span class="comment"># Please refer to https://developer.nvidia.com/cuda-downloads to install the latest CUDA toolkits</span></span><br><span class="line"><span class="comment"># Commands below work only on Ubuntu 22.04</span></span><br><span class="line">wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb</span><br><span class="line">sudo dpkg -i cuda-keyring_1.1-1_all.deb</span><br><span class="line">sudo apt update</span><br><span class="line">sudo apt install cuda</span><br></pre></td></tr></table></figure><p>Note: Installing the full CUDA Toolkits is the <strong>only</strong> way I recommend to install drivers. Using the drivers provided by the Ubuntu official repo is <strong>NOT</strong> recommended.</p><h3 id="optional-reinstall-nvidia-docker-runtime">(Optional) Reinstall NVIDIA Docker Runtime</h3><p>The previous step will also remove NVIDIA Docker Runtime, which may lead to this error if you use Docker.</p><blockquote><p>Error response from daemon: Cannot restart container ...: could not select device driver "" with capabilities: [[gpu]]</p></blockquote><p>Thus, we need to install it back too.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo apt install -y nvidia-container-toolkit</span><br></pre></td></tr></table></figure><h2 id="references">References</h2><ul><li><a href="https://askubuntu.com/questions/1122512/nvidia-modules-are-missing-after-each-kernel-upgrade-in-18-04">https://askubuntu.com/questions/1122512/nvidia-modules-are-missing-after-each-kernel-upgrade-in-18-04</a></li><li><a href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html">https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;One day, you rebooted your server and suddenly found your cute GPUs had all disappeared. Then you executed &lt;code&gt;nvidia-smi&lt;/code&gt; to see what was going on, but you only got this error message.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Enable L3 PFC + DCQCN for RoCE on Mellanox ConnectX NICs</title>
    <link href="https://blog.mylab.cc/2023/07/24/Enable-L3-PFC-DCQCN-for-RoCE-on-Mellanox-ConnectX-NICs/"/>
    <id>https://blog.mylab.cc/2023/07/24/Enable-L3-PFC-DCQCN-for-RoCE-on-Mellanox-ConnectX-NICs/</id>
    <published>2023-07-23T18:02:17.000Z</published>
    <updated>2024-03-31T17:04:39.362Z</updated>
    
    <content type="html"><![CDATA[<p>RoCE networks, a high-performance implementation of RDMA networks, offload flow control and congestion control algorithms to hardware to achieve high performance. However, these algorithms target lossless networks so they can be simple enough to implement on hardware. Thus, we should mitigate packet loss issues and guarantee lossless networks to our best. DCQCN (Congestion Control) + PFC (Flow Control) is a common option for many data centers. We observed that our system would suffer severe performance fluctuation if disabling them.</p><p>(Updated on Mar 31, 2024) You may <strong>not</strong> need PFC for modern NICs (ConnectX-6 or newer), which support NVIDIA RTTCC congestion control that doesn't rely on PFC and ECN.</p><h2 id="discussion">Discussion</h2><h3 id="l2-pfc-v.s.-l3-pfc">L2 PFC v.s. L3 PFC</h3><p>It is recommended to set up L3 PFC rather than L2 PFC nowadays. Probably, it is because setting up L3 PFC is easier.</p><p>Note: <code>802.1p</code>, <code>pcp</code> could refer to L2 PFC; <code>dscp</code> could refer to L3 PFC.</p><h3 id="terminologies">Terminologies</h3><ul><li>Type of Service (ToS): A value to distinguish applications. RoCE is one of the applications.</li><li>DSCP / dot11p Tag: A value embedded in the packet header.</li><li>Buffer: Receive buffer (SRAM) on NICs. Can be partitioned into multiple regions.</li><li>Traffic Class (TC): An intermediate value between PFC Priority and Queue ID.</li><li>Queue: Send queue on NICs.</li><li>Note: Service Level is a concept in Infiniband networks, not related to RoCE networks.</li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">     &gt;&gt;2             Map                 Map       </span><br><span class="line">TOS ─────► DSCP Tag ─────► PFC Priority ─────► TC  </span><br><span class="line">                                 │             │   </span><br><span class="line">                                 │ Map         │ = </span><br><span class="line">                                 │             │   </span><br><span class="line">                                 ▼             ▼   </span><br><span class="line">                              BufferID      QueueID</span><br><span class="line">                                                   </span><br><span class="line"> Fig. Numerical Relationship between terminologies </span><br><span class="line"> <span class="keyword">for</span> Mellanox ConnectX NICs.                       </span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">               ┌─────────────────────────────┐                          </span><br><span class="line">┌──────────┐   │ NIC                         │                          </span><br><span class="line">│ TCP App  │   │   ┌────────┐   ┌───────┐    │                          </span><br><span class="line">│ ToS=0    ├───┼──►│ Queue0 ├──►│       │    │   ┌─────────┐ ┌─────────┐</span><br><span class="line">└──────────┘   │   └────────┘   │ Sche  │    │   │ Packet  │ │ Packet  │</span><br><span class="line">               │       ...      │ duler ├────┼──►│ DSCP=0  │ │ DSCP=26 │</span><br><span class="line">┌──────────┐   │   ┌────────┐   │       │    │   └─────────┘ └─────────┘</span><br><span class="line">│ RDMA App ├───┼──►│ Queue3 ├──►│       │    │                          </span><br><span class="line">│ ToS=106  │   │   └────────┘   └───────┘    │                          </span><br><span class="line">└──────────┘   │                             │                          </span><br><span class="line">               └─────────────────────────────┘                          </span><br><span class="line">                                                                        </span><br><span class="line">  Fig. Egress procedure of Mellanox ConnectX NICs.                      </span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">              ┌──────────────────────────────┐   ┌────────┐</span><br><span class="line">┌─────────┐   │ NIC                          │   │ System │</span><br><span class="line">│ Packet  │   │   ┌─────────┐   ┌────────┐   │   │ Memory │</span><br><span class="line">│ DSCP=0  ├───┼──►│ Buffer0 ├──►│        │   │   │        │</span><br><span class="line">└─────────┘   │   └─────────┘   │  DMA   │   │   │        │</span><br><span class="line">              │                 │ Engine ├───┼──►│        │</span><br><span class="line">┌─────────┐   │   ┌─────────┐   │        │   │   │        │</span><br><span class="line">│ Packet  ├───┼──►│ Buffer1 ├──►│        │   │   │        │</span><br><span class="line">│ DSCP=26 │   │   └─────────┘   └────────┘   │   │        │</span><br><span class="line">└─────────┘   │                              │   │        │</span><br><span class="line">              └──────────────────────────────┘   └────────┘</span><br><span class="line">                                                           </span><br><span class="line">   Fig. Ingress procedure of Mellanox ConnectX NICs.       </span><br></pre></td></tr></table></figure><h2 id="prerequisite">Prerequisite</h2><p>If network traffic goes through network switches, ensure <strong>L3 PFC</strong> and <strong>ECN</strong> are enabled on switches. We omitted the detailed steps to configure switches here as they depend on the particular switch SKU and OS.</p><p>Note: we could set up a direct connection to test whether our configuration on NICs works or not.</p><h2 id="steps">Steps</h2><h3 id="enable-dcqcn">Enable DCQCN</h3><p>DCQCN is enabled by default on Mellanox ConnextX NICs. But, we need to ensure ECN marking is enabled on all network switches.</p><h3 id="identify-interface-name-and-device-name">Identify Interface Name and Device Name</h3><p>We could check the interface and device name with <code>show_gids</code>, which should output something like below.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">DEVPORTINDEXGIDIPv4  VERDEV</span><br><span class="line">---------------------------  ------</span><br><span class="line">mlx5_1130000:0000:0000:0000:0000:ffff:0ac8:000e10.200.0.14  v2enp216s0f1np1</span><br></pre></td></tr></table></figure><p>Here <code>mlx5_1</code> is the device name that is usually used to describe RDMA devices, and <code>enp216s0f1np1</code> is the interface name that can be managed by Linux as an Ethernet interface. We could save them as environment variables.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">export</span> IF_NAME=enp216s0f1np1</span><br><span class="line"><span class="built_in">export</span> DEV_NAME=mlx5_1</span><br></pre></td></tr></table></figure><h3 id="tune-pfc-headroom-size">Tune PFC Headroom Size</h3><p>We should reserve enough buffer space to store on-the-fly packets since senders (or requestors) need to take time (latency) to respond to PFC pauses. The latency will be affected by the cable length (a.k.a. propagation delay). Mellanox requires us to set the cable length manually, and then the NICs will automatically calculate the correct headroom size.</p><p>Fortunately, the cable length is recorded in the transceiver’s EEPROM.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo mlxlink -d <span class="variable">$DEV_NAME</span> -m -c -e</span><br></pre></td></tr></table></figure><p>From the outputs, we could find the cable length.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">Module Info</span><br><span class="line">-----------</span><br><span class="line">Identifier                      : QSFP28</span><br><span class="line">Compliance                      : 100GBASE-SR4 or 25GBASE-SR</span><br><span class="line">Cable Technology                : 850 nm VCSEL</span><br><span class="line">Cable Type                      : Optical Module (separated)</span><br><span class="line">OUI                             : Other</span><br><span class="line">Vendor Name                     : ...</span><br><span class="line">Vendor Part Number              : ...</span><br><span class="line">Vendor Serial Number            : ...</span><br><span class="line">Rev                             : A0</span><br><span class="line">Wavelength [nm]                 : 850</span><br><span class="line">Transfer Distance [m]           : 50 <span class="comment"># here</span></span><br></pre></td></tr></table></figure><p>Then, apply this parameter to our QoS setting.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --cable_len=50</span><br></pre></td></tr></table></figure><h3 id="enable-l3-pfc">Enable L3 PFC</h3><p>Execute the following commands to activate PFC and apply the PFC setting to RoCE traffic. Note: the configuration is <strong>NOT</strong> persistent. You may need to re-run all the commands above to enable PFC each time the machine reboots. Also, ensure PFC is enabled for DSCP 26 traffic on all network switches.</p><p>(Updated on Mar 31, 2024) You may need to use a different PFC priority or DSCP value depending on the configuration/restriction of network switches. Conventionally, we use Priority 3 for lossless networks.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># use L3 PFC, default=pcp (L2 PFC)</span></span><br><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --trust dscp</span><br><span class="line"></span><br><span class="line"><span class="comment"># enable PFC on PFC Priority 3</span></span><br><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --pfc 0,0,0,1,0,0,0,0</span><br><span class="line"></span><br><span class="line"><span class="comment"># clear Traffic Class (TC) settings</span></span><br><span class="line"><span class="built_in">echo</span> <span class="string">&quot;tclass=-1&quot;</span> | sudo tee /sys/class/infiniband/<span class="variable">$DEV_NAME</span>/tc/1/traffic_class</span><br><span class="line"></span><br><span class="line"><span class="comment"># set default ToS (= DSCP value * 4) for RoCE traffic</span></span><br><span class="line"><span class="built_in">echo</span> 106 | sudo tee /sys/class/infiniband/<span class="variable">$DEV_NAME</span>/tc/1/traffic_class</span><br><span class="line"></span><br><span class="line"><span class="comment"># set default ToS for RoCE traffic</span></span><br><span class="line">sudo cma_roce_tos -d <span class="variable">$DEV_NAME</span> -t 106</span><br></pre></td></tr></table></figure><h2 id="verification">Verification</h2><h3 id="show-pfc-setting">Show PFC Setting</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span></span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br></pre></td><td class="code"><pre><span class="line">DCBX mode: OS controlled</span><br><span class="line">Priority trust state: dscp</span><br><span class="line">dscp2prio mapping:</span><br><span class="line">        prio:0 dscp:07,06,05,04,03,02,01,00,</span><br><span class="line">        prio:1 dscp:15,14,13,12,11,10,09,08,</span><br><span class="line">        prio:2 dscp:23,22,21,20,19,18,17,16,</span><br><span class="line">        prio:3 dscp:31,30,29,28,27,26,25,24,</span><br><span class="line">        prio:4 dscp:39,38,37,36,35,34,33,32,</span><br><span class="line">        prio:5 dscp:47,46,45,44,43,42,41,40,</span><br><span class="line">        prio:6 dscp:55,54,53,52,51,50,49,48,</span><br><span class="line">        prio:7 dscp:63,62,61,60,59,58,57,56,</span><br><span class="line">default priority:</span><br><span class="line">Receive buffer size (bytes): 20016,156096,0,0,0,0,0,0,total_size=1027728</span><br><span class="line">Cable len: 7</span><br><span class="line">PFC configuration:</span><br><span class="line">        priority    0   1   2   3   4   5   6   7</span><br><span class="line">        enabled     0   0   0   1   0   0   0   0</span><br><span class="line">        buffer      0   0   0   1   0   0   0   0</span><br><span class="line">tc: 1 ratelimit: unlimited, tsa: vendor</span><br><span class="line">         priority:  0</span><br><span class="line">tc: 0 ratelimit: unlimited, tsa: vendor</span><br><span class="line">         priority:  1</span><br><span class="line">tc: 2 ratelimit: unlimited, tsa: vendor</span><br><span class="line">         priority:  2</span><br><span class="line">tc: 3 ratelimit: unlimited, tsa: vendor</span><br><span class="line">         priority:  3</span><br><span class="line">tc: 4 ratelimit: unlimited, tsa: vendor</span><br><span class="line">         priority:  4</span><br><span class="line">tc: 5 ratelimit: unlimited, tsa: vendor</span><br><span class="line">         priority:  5</span><br><span class="line">tc: 6 ratelimit: unlimited, tsa: vendor</span><br><span class="line">         priority:  6</span><br><span class="line">tc: 7 ratelimit: unlimited, tsa: vendor</span><br><span class="line">         priority:  7</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">         &gt;&gt;2            Map               Map       </span><br><span class="line">TOS 106 ─────► DSCP 26 ─────► PFC Prio 3 ─────► TC 3</span><br><span class="line">                                    │           │   </span><br><span class="line">                                    │ Map       │ = </span><br><span class="line">                                    │           │   </span><br><span class="line">                                    ▼           ▼   </span><br><span class="line">                                 Buffer 1    Queue 3</span><br></pre></td></tr></table></figure><h3 id="check-dcqcn-is-functioning">Check DCQCN is functioning</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Check DCQCN is enabled on Prio 3</span></span><br><span class="line">cat /sys/class/net/<span class="variable">$IF_NAME</span>/ecn/roce_np/<span class="built_in">enable</span>/3</span><br><span class="line">cat /sys/class/net/<span class="variable">$IF_NAME</span>/ecn/roce_rp/<span class="built_in">enable</span>/3</span><br><span class="line"></span><br><span class="line"><span class="comment"># Check counters related to DCQCN</span></span><br><span class="line">cat /sys/class/infiniband/<span class="variable">$DEV_NAME</span>/ports/1/hw_counters/np_cnp_sent</span><br><span class="line">cat /sys/class/infiniband/<span class="variable">$DEV_NAME</span>/ports/1/hw_counters/np_ecn_marked_roce_packets</span><br><span class="line">cat /sys/class/infiniband/<span class="variable">$DEV_NAME</span>/ports/1/hw_counters/rp_cnp_handled</span><br></pre></td></tr></table></figure><p>Note: Two cases triggering NP to send CNP packets:</p><ul><li>NP’s NIC receives a packet with an ECN mark (marked by the switch indicating the switch’s buffer is about to be out of capacity).</li><li>NP’s NIC receives an out-of-order packet (packet loss occurred).</li></ul><h3 id="check-pfc-is-functioning">Check PFC is functioning</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">ethtool -S <span class="variable">$IF_NAME</span> | grep prio3</span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">rx_prio3_bytes: 462536457742</span><br><span class="line">rx_prio3_packets: 500098087</span><br><span class="line">rx_prio3_discards: 0</span><br><span class="line">tx_prio3_bytes: 1180912512618</span><br><span class="line">tx_prio3_packets: 1155032777</span><br><span class="line">rx_prio3_pause: 214479</span><br><span class="line">rx_prio3_pause_duration: 213496</span><br><span class="line">tx_prio3_pause: 12</span><br><span class="line">tx_prio3_pause_duration: 13</span><br><span class="line">rx_prio3_pause_transition: 107222</span><br></pre></td></tr></table></figure><p>Note: <code>tx_prio3_pause</code> refers to the number of PFC pauses sent from this NIC as the server cannot absorb the network traffic quickly.</p><h2 id="miscellaneous">Miscellaneous</h2><h3 id="advanced-qos-settings">Advanced QoS Settings</h3><p>The default values of the parameters below should be fine…</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># mapping Priority to TC</span></span><br><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --prio_tc=0,1,2,3,4,5,6,7</span><br><span class="line"></span><br><span class="line"><span class="comment"># mapping Priority to receive buffers</span></span><br><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --prio2buffer=0,0,0,1,0,0,0,0</span><br><span class="line"></span><br><span class="line"><span class="comment"># adjust receive buffer sizes</span></span><br><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --buffer_size=20016,156096,0,0,0,0,0,0</span><br><span class="line"></span><br><span class="line"><span class="comment">## Set alternative TSA</span></span><br><span class="line"><span class="comment"># set to vendor (default)</span></span><br><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --tsa=vendor,vendor,vendor,vendor,vendor,vendor,vendor,vendor</span><br><span class="line"><span class="comment"># set to ets</span></span><br><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --tsa=ets,ets,ets,ets,ets,ets,ets,ets --tcbw=0,0,0,100,0,0,0,0</span><br><span class="line"></span><br><span class="line"><span class="comment"># Remap DSCP value to specified PFC Priority</span></span><br><span class="line"><span class="comment"># Map DSCP 3 to Prio 3</span></span><br><span class="line">sudo mlnx_qos -i <span class="variable">$IF_NAME</span> --dcsp2prio=<span class="string">&#x27;set,3,3&#x27;</span></span><br></pre></td></tr></table></figure><h3 id="performance-counters-on-nics">Performance Counters on NICs</h3><p>Refer to this: <a href="https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters">https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters</a>.</p><h3 id="failed-to-map-roce-traffic-to-specified-pfc-priority">Failed to map RoCE traffic to specified PFC Priority</h3><ul><li>Consider upgrading OFED drivers.</li><li><code>ibv_qp_attr.ah_attr.grh.traffic_class</code> may override default ToS Value.</li></ul><h3 id="configuration-for-bluefield-dpu">Configuration for BlueField DPU</h3><p>QoS settings could be set on the host using <code>mlnx_qos</code>, but Traffic Class values must be set on the DPU side individually.</p><h2 id="references">References</h2><ul><li><a href="https://baiwei0427.github.io/papers/lumina-sigcomm2023.pdf">https://baiwei0427.github.io/papers/lumina-sigcomm2023.pdf</a><ul><li>Many thanks to Wei Bai for helping us configure our networks!</li></ul></li><li><a href="https://docs.nvidia.com/networking/pages/viewpage.action?pageId=39264632">https://docs.nvidia.com/networking/pages/viewpage.action?pageId=39264632</a></li><li><a href="https://enterprise-support.nvidia.com/s/article/lossless-roce-configuration-for-linux-drivers-in-dscp-based-qos-mode">https://enterprise-support.nvidia.com/s/article/lossless-roce-configuration-for-linux-drivers-in-dscp-based-qos-mode</a></li><li><a href="https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters">https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters</a></li><li><a href="https://support.huawei.com/enterprise/zh/doc/EDOC1100197616/3dfff4ec">https://support.huawei.com/enterprise/zh/doc/EDOC1100197616/3dfff4ec</a></li><li><a href="https://www.rdmamojo.com/2013/01/12/ibv_modify_qp/">https://www.rdmamojo.com/2013/01/12/ibv_modify_qp/</a></li><li><a href="https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf">https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p523.pdf</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;RoCE networks, a high-performance implementation of RDMA networks, offload flow control and congestion control algorithms to hardware to achieve high performance. However, these algorithms target lossless networks so they can be simple enough to implement on hardware. Thus, we should mitigate packet loss issues and guarantee lossless networks to our best. DCQCN (Congestion Control) + PFC (Flow Control) is a common option for many data centers. We observed that our system would suffer severe performance fluctuation if disabling them.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Quick way to fix wrong font size and location exported by PowerPoint Save as Picture on Mac</title>
    <link href="https://blog.mylab.cc/2023/05/07/Quick-way-to-fix-wrong-font-size-exported-by-PowerPoint-Save-as-Picture-on-Mac/"/>
    <id>https://blog.mylab.cc/2023/05/07/Quick-way-to-fix-wrong-font-size-exported-by-PowerPoint-Save-as-Picture-on-Mac/</id>
    <published>2023-05-06T21:19:04.000Z</published>
    <updated>2023-05-06T21:49:28.587Z</updated>
    
    <content type="html"><![CDATA[<p>Microsoft is always on its way of producing bugs...</p><h2 id="tldr">TL;DR</h2><ul><li>Select and <strong>Copy</strong> the object to export</li><li>Click <strong>Edit</strong> -&gt; <strong>Paste Special</strong> -&gt; <strong>PDF</strong></li><li>Click <strong>Save as Picture</strong> in the right-click menu to export the newly created object</li></ul><p>This method is working on PowerPoint 16.72</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;Microsoft is always on its way of producing bugs...&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>How to properly setup GPUDirect RDMA</title>
    <link href="https://blog.mylab.cc/2023/03/29/How-to-properly-setup-GPUDirect-RDMA/"/>
    <id>https://blog.mylab.cc/2023/03/29/How-to-properly-setup-GPUDirect-RDMA/</id>
    <published>2023-03-29T15:05:35.000Z</published>
    <updated>2023-06-09T12:51:45.688Z</updated>
    
    <content type="html"><![CDATA[<p>GPUDirect RDMA (GDR) is an incredible technology allowing remote machines directly to manipulate the local GPU's memory. However, there are not many online resources discussing about this technology. So, I felt very confused when I encountered issues relevant to RDMA, especially for GDR.</p><h2 id="prerequisite">Prerequisite</h2><h3 id="install-rnic-drivers-and-toolkits">Install RNIC Drivers and Toolkits</h3><p>In this tutorial, I will use Mellanox ConnextX RDMA NIC (RNIC) as an example to demonstrate configuration steps.</p><p>Note that some configuration steps are vendor-specific, which means for different vendor's RNIC, you may need to find the alternative solution if my approach is not applicable for your RNIC. Also I didn't test GDR on a NIC made by vendors other than Mellanox. (I suspect only Mellanox's RNIC supports GDR).</p><p>For ConnextX RNIC, the corresponding drivers and toolkits are all packed in <a href="https://network.nvidia.com/products/infiniband-drivers/linux/mlnx_ofed/">Mellanox OFED</a>.</p><h3 id="install-cuda-drivers-and-toolkits">Install CUDA Drivers and Toolkits</h3><p>I won't introduce how to install these things as there are already many tutorials about this topic on Internet. I would recommend to check <a href="https://developer.nvidia.com/cuda-downloads">the official website</a>, and install the packages this website provides.</p><p>Note that installing CUDA Drivers through <code>apt</code> and Toolkits through <code>conda</code> separately is <strong>NOT</strong> recommended.</p><h3 id="verify-installation">Verify Installation</h3><p>Once you properly installed them, just execute the command <code>nvidia-smi topo -m</code> and you should see something like:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">$ nvidia-smi topo -m</span><br><span class="line">GPU0GPU1GPU2GPU3NIC0NIC1CPU AffinityNUMA Affinity</span><br><span class="line">GPU0 X NV4NV4NV4SYSSYS0-127N/A</span><br><span class="line">GPU1NV4 X NV4NV4SYSSYS0-127N/A</span><br><span class="line">GPU2NV4NV4 X NV4PHBPHB0-127N/A</span><br><span class="line">GPU3NV4NV4NV4 X SYSSYS0-127N/A</span><br><span class="line">NIC0SYSSYSPHBSYS X PIX</span><br><span class="line">NIC1SYSSYSPHBSYSPIX X</span><br><span class="line"></span><br><span class="line">Legend:</span><br><span class="line"></span><br><span class="line">  X    = Self</span><br><span class="line">  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)</span><br><span class="line">  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node</span><br><span class="line">  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)</span><br><span class="line">  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)</span><br><span class="line">  PIX  = Connection traversing at most a single PCIe bridge</span><br><span class="line">  NV<span class="comment">#  = Connection traversing a bonded set of # NVLinks</span></span><br><span class="line"></span><br><span class="line">NIC Legend:</span><br><span class="line"></span><br><span class="line">  NIC0: mlx5_0</span><br><span class="line">  NIC1: mlx5_1</span><br></pre></td></tr></table></figure><p>We will discuss what this output represents in the following section. For now, you should be able to identify both your GPUs and NICs from this output.</p><h3 id="check-gdr-hardware-support">Check GDR Hardware Support</h3><p>Every system is not created equal. Continue the above example, we can see there are many types of relationship between individual GPU and NIC such as <code>SYS</code> and <code>PHB</code>. In fact, they will greatly affect the GDR performance.</p><p>From my experience, I believe:</p><ul><li>GDR Performance is good: <code>PIX</code>, <code>PXB</code></li><li>GDR Performance is likely to be good: <code>PHB</code><ul><li>Might be as good as <code>PIX</code> and <code>PXB</code></li></ul></li><li>GDR Performance is bad: <code>SYS</code>, <code>NODE</code></li></ul><blockquote><p>My benchmark result of GDR performance with 100 Gbps RoCE Network:</p><ul><li>Dual Intel Xeon 4112 + NVIDIA Tesla V100<ul><li><code>SYS</code>: ~2 GB/s</li><li><code>PIX</code>: ~10 GB/s</li></ul></li><li>Single AMD EPYC 7763 + NVIDIA Tesla A100<ul><li><code>SYS</code>: ~6 GB/s</li><li><code>PHB</code>: ~10 GB/s</li></ul></li><li>Dual Intel Xeon E5-2630 v4 + NVIDIA Tesla P100<ul><li><code>SYS</code>: ~0.3 GB/s</li></ul></li></ul></blockquote><blockquote><p>Here are <a href="https://github.com/NVIDIA/nccl/issues/489">discussion about PHB</a> and <a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html">description about P2P Level</a>.</p></blockquote><blockquote><p>(Updated on Jun 9, 2023) Here is a <a href="https://dshcherb.github.io/2019/02/02/interpreting-pcie-device-to-cpu-locality-information.html">systematic introduction to PCIe Affinity</a>.</p></blockquote><p>If you unfortunately got some <code>SYS</code> or <code>NODE</code>, this relationship can be possibly corrected by plugging your GPU or NIC into proper PCIe slots.</p><ul><li>For multi-socket system, some certain PCIe slots are physically connected to one CPU socket (package) while the other slots are connected to other sockets.</li><li>For system with chiplet-based CPU (e.g., AMD EPYC), some certain PCIe might be physically connected to one chipet.<ul><li>That is still the case for AMD EPYC 7003 Series which has a separate I/O die.</li></ul></li></ul><h2 id="load-nvidia-peer-memory-kernel-module">Load NVIDIA Peer Memory Kernel Module</h2><p><code>nvidia_peermem</code> module is bundled in CUDA Toolkit downloaded from <a href="https://developer.nvidia.com/cuda-downloads">here</a>. By default, this kernel module will be not loaded automatically. Thus, we could manually load this module with the command <code>sudo modprobe nvidia_peermem</code>.</p><p>To check this module is loaded correctly, execute the command <code>lsmod | grep nvidia_peermem</code> and see if this module name exists in the output.</p><p>If you cannot find this kernel module in the system, you might consider to install the latest CUDA Toolkit.</p><blockquote><p>There is another version of this module that you can find on <a href="https://github.com/Mellanox/nv_peer_memory">this repo</a>, and it is called <code>nv_peer_mem</code> instead. But it appears to be no longer maintained.</p></blockquote><h2 id="disable-pcie-acs">Disable PCIe ACS</h2><p>Many reports (<a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html">[1]</a>, <a href="https://github.com/NVIDIA/nccl/issues/631">[2]</a>) have mentioned PCIe ACS may hurt the GDR performance. PCIe ACS is a security feature, but we never care about the security when we are hungry for performance. Here is the script to disable it. Note that this script is only for your reference. You may need to modify the content according to your machine's configuration.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#!/bin/bash</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Author: OmniReduce Team</span></span><br><span class="line"><span class="comment"># This script applies what was mentioned in:</span></span><br><span class="line"><span class="comment"># https://forums.developer.nvidia.com/t/multi-gpu-peer-to-peer-access-failing-on-tesla-k80/39748/9</span></span><br><span class="line"><span class="comment"># To disable PCIe ACS</span></span><br><span class="line"></span><br><span class="line"><span class="built_in">echo</span> <span class="string">&quot;Before===========================&quot;</span></span><br><span class="line">sudo lspci -vvv | grep -i acsctl</span><br><span class="line"><span class="built_in">echo</span> <span class="string">&quot;=================================&quot;</span></span><br><span class="line"></span><br><span class="line">pcis=$(lspci | grep -i plx | cut -d<span class="string">&#x27; &#x27;</span> -f1 | tr <span class="string">&#x27;\r\n&#x27;</span> <span class="string">&#x27; &#x27;</span>)</span><br><span class="line"><span class="built_in">echo</span> <span class="string">&quot;Disabling ACS on <span class="variable">$pcis</span>&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="keyword">for</span> pci <span class="keyword">in</span> <span class="variable">$pcis</span></span><br><span class="line"><span class="keyword">do</span></span><br><span class="line">setpci -s <span class="variable">$pci</span> f2a.w=0000</span><br><span class="line"><span class="keyword">done</span></span><br><span class="line"></span><br><span class="line"><span class="built_in">echo</span> <span class="string">&quot;After============================&quot;</span></span><br><span class="line">sudo lspci -vvv | grep -i acsctl</span><br><span class="line"><span class="built_in">echo</span> <span class="string">&quot;=================================&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="built_in">echo</span> <span class="string">&quot;Make sure all ACS features are disabled as ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-&quot;</span></span><br></pre></td></tr></table></figure><h2 id="verification">Verification</h2><p>Up to now, GDR supposed to work. To verify that, I would recommend to use <a href="https://github.com/linux-rdma/perftest.git">OFED PerfTest</a>.</p><blockquote><ul><li><strong>DO NOT</strong> use PerfTest provided by OFED or APT. You should compile PerfTest by yourself because the binary distribution doesn't support GDR</li><li>Both Client and Server are capable of utilizing GDR</li><li><code>ib_send_bw</code> has some bugs in GDR tests</li></ul></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Compile</span></span><br><span class="line">./autogen.sh &amp;&amp; ./configure CUDA_H_PATH=/usr/<span class="built_in">local</span>/cuda/include/cuda.h &amp;&amp; make -j</span><br><span class="line"></span><br><span class="line"><span class="comment"># Launch as server</span></span><br><span class="line">./ib_write_bw -d ib_dev --use_cuda=&lt;gpu index&gt; -a</span><br><span class="line">./ib_write_bw -d mlx5_0 --use_cuda=0 -a <span class="comment"># Run on GPU0 and MLX5_0</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># Launch as client</span></span><br><span class="line">./ib_write_bw -d ib_dev --use_cuda=&lt;gpu index&gt; -a &lt;server ip addr&gt;</span><br><span class="line">./ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.200.0.10 <span class="comment"># Run on GPU0 and MLX5_0</span></span><br></pre></td></tr></table></figure><blockquote><p>If you would like to test with NCCL, it is recommended to refer <a href="https://www.alibabacloud.com/help/en/elastic-compute-service/latest/sccgn-series-instance-instructions">this article</a>.</p></blockquote><h2 id="troubleshooting">Troubleshooting</h2><p>If you still encounter errors like <code>ibv_create_qp failed</code> or <code>ibv_reg_mr failed</code>, this might be caused by Linux user limits (ulimit).</p><p>A fast and dirty way to temporarily fix this issue is to run the program as root user. Once this dirty fix works, you can refer <a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html">this article</a> to permanently solve this issue.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;GPUDirect RDMA (GDR) is an incredible technology allowing remote machines directly to manipulate the local GPU&#39;s memory. However, there are not many online resources discussing about this technology. So, I felt very confused when I encountered issues relevant to RDMA, especially for GDR.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>快速理解并行 Makefile</title>
    <link href="https://blog.mylab.cc/2023/01/31/%E5%BF%AB%E9%80%9F%E7%90%86%E8%A7%A3%E5%B9%B6%E8%A1%8C-Makefile/"/>
    <id>https://blog.mylab.cc/2023/01/31/%E5%BF%AB%E9%80%9F%E7%90%86%E8%A7%A3%E5%B9%B6%E8%A1%8C-Makefile/</id>
    <published>2023-01-30T21:24:47.000Z</published>
    <updated>2023-01-30T21:27:09.676Z</updated>
    
    <content type="html"><![CDATA[<p>本教程为南科大2022年超算校内赛第二题扩展教程，<a href="https://github.com/Tonny-Gu/HelloHPC">赛题的repo</a>。</p><p>Makefile 是 GNU Make 这个工具所需的文件，可以看作是一个比较特殊的 Bash 脚本。而对比现代的编译辅助工具，Make 显得非常简陋，直接用这玩意如同在打火机一块钱一个的时代钻木取火，但好处就是这个工具并不是很难理解。</p><p>首先要明确的是，Make 这个工具是用来编译比较大型的工程的（就是管理一大堆源代码文件），所以它很多设计是围绕编译展开的，而且它是上世纪八十年代的产物，面向的是当时手无寸铁的程序员，至少比起啥工具都没有的情况，Make 还是有很大作用的。总的来说，Make 其实主要是在做这几件事： - 判断是否可以跳过执行某些 Linux 命令（command） - 确定 Linux 命令执行的顺序，并行执行不相关的命令</p><h2 id="基本语法">基本语法</h2><p>我们先抛开编译不谈，来看 Make 的规则（Rule）的基本语法。</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">target: prerequisites</span></span><br><span class="line">    command</span><br><span class="line">    command</span><br></pre></td></tr></table></figure><p>Make 大概是个缝合怪，有很多东西坨在一起了。比如说目标target，就有： - 真target：会对应一个具体的文件 - 假target（Phony Target）：和磁盘上的文件完全没关系</p><p>然后执行条件（prerequisites）是一个或者一组目标，这些目标会决定下面的命令（command）要不要被执行。</p><p>命令（command）就是 Linux 的命令，一般情况下这些命令写清楚了<strong>如何产生一个真target</strong>（如果 target 是真 target 的话）。</p><p>继续抛开编译不谈，来看<code>make/makefile1</code>里的简单例子，</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">all: son.bak.txt</span></span><br><span class="line">@echo I am all <span class="comment"># display I am all on screen</span></span><br><span class="line"></span><br><span class="line"><span class="section">son.bak.txt: son.txt</span></span><br><span class="line">@echo I am son.bak.txt <span class="comment"># display I am son.bak.txt on screen</span></span><br><span class="line">@cp son.txt son.bak.txt <span class="comment"># copy son.txt to son.bak.txt</span></span><br></pre></td></tr></table></figure><p>在 command 前加一个<code>@</code>，可以让 Make 不把原始命令显示出来。</p><p>这里<code>son.bak.txt</code>是一个真 target，因为真的会有<code>son.bak.txt</code>这么个文件（在执行了<code>cp</code>命令以后）。</p><p><code>son.txt</code>也是一个真 target，并且一开始就有了。</p><p>习惯上<code>all</code>是一个假 target，类似于整个 Makefile 的入口（或者说总目标），并且会写成第一个 target（就像这里写在了第一行）。当我们执行<code>make</code>命令的时候（不手动指定 target），其实就是去产生（make）<code>all</code> target。</p><p>我们实际上并不用去特别区分真假 target，只需要知道有一些特殊的 target 就行。</p><p>什么叫 make 一个 target？其实就是： - 如果存在一个执行条件里的 target 没有被 make 过，就先去 make 执行条件里还没 make 过的 target - 在执行条件里所有的 target 都 make 了以后，<strong>在某些时刻</strong>，把它的 command 执行一遍</p><p>在这个例子里，Make 的故事是： - 我们要 make <code>all</code> target - Make 首先跑去 make <code>all</code> - 然后很快发现<code>son.bak.txt</code>没有 make 过，就跑去 make <code>son.bak.txt</code> - 接着一看<code>son.bak.txt</code>要求要 make <code>son.txt</code> - 但是 - 没有一条关于<code>son.txt</code>的 rule - <code>son.txt</code>已经有这个文件了 - 那我们就当<code>son.txt</code>已经 make 好了 - 执行条件里所有的 target 都好了，再看发现我们没有<code>son.bak.txt</code>这个文件 - 那就执行 command（<code>cp</code>和<code>echo</code>） - command 执行完了，就当<code>son.bak.txt</code> make 好了 - <code>all</code>执行条件里所有的 target 都好了，再看发现我们没有<code>all</code>这个文件 - 那就执行 command（<code>echo</code>）（假 target 永远都要执行一次 command，毕竟对应的文件永远不存在） - command 执行完了，就当<code>all</code> make 好了</p><p>程序输出如下：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># make</span></span><br><span class="line">I am son.bak.txt</span><br><span class="line">I am all</span><br></pre></td></tr></table></figure><h2 id="判断是否需要执行命令">判断是否需要执行命令</h2><p>Make 的一大作用是，当我们有一大堆代码文件的时候，我们并不希望改动一个文件就要重新编译所有的文件，当然是<strong>拎出所有受到的影响的代码</strong>来重新编译，这样就可以节约很多时间。所以 Make 会自动<strong>选择性的执行</strong> Makefile 里的 command。那什么时候 command 会被执行？</p><h3 id="执行条件的-target-有更新">执行条件的 target 有更新</h3><p>继续上一个例子，当我们在执行一次<code>make</code>命令以后，如果试着再<code>make</code>一次，程序的输出就只有</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># make</span></span><br><span class="line">I am all</span><br></pre></td></tr></table></figure><p>很明显，只有<code>all</code> target 被执行了。在这个例子里，Make 的故事是： - 我们要 make <code>all</code> target - Make 首先跑去 make <code>all</code> - 然后很快发现<code>son.bak.txt</code>没有 make 过，就跑去 make <code>son.bak.txt</code> - 接着一看<code>son.bak.txt</code>要求要 make <code>son.txt</code> - 那我们当<code>son.txt</code>已经 make 好了 - 执行条件里所有的 target 都好了，再看发现我们没有<code>son.bak.txt</code>这个文件 - 我们有<code>son.bak.txt</code>，而且和<code>son.txt</code>一样新 - 那就跳过执行 command - <code>all</code>执行条件里所有的 target 都好了，再看发现我们没有<code>all</code>这个文件 - 那就执行 command（<code>echo</code>）（假 target 永远都要执行一次 command，毕竟对应的文件永远不存在） - command 执行完了，就当<code>all</code> make 好了</p><blockquote><p>如果我们去掉 Makefile 里<code>all</code>的command，再执行<code>make</code>:</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">&gt;all: son.bak.txt</span></span><br><span class="line"></span><br><span class="line"><span class="section">&gt;son.bak.txt: son.txt</span></span><br><span class="line">echo I am son.bak.txt</span><br><span class="line">cp son.txt son.bak.txt</span><br></pre></td></tr></table></figure><p>输出就只剩下这些东西。这代表这次<code>make</code>啥事也没做。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">&gt;<span class="comment"># make</span></span><br><span class="line">&gt;make: Nothing to be <span class="keyword">done</span> <span class="keyword">for</span> <span class="string">&#x27;all&#x27;</span>.</span><br></pre></td></tr></table></figure></blockquote><p>也不难理解，我们没修改过<code>son.txt</code>，那也没有必要去重新复制一份<code>son.bak.txt</code>。Make 会根据 target 的执行条件里是否有依赖的 target 更新了（执行过 command，或者文件有新修改）来判断是否需要被重新 make 这个 target。</p><blockquote><p>具体来说，Make 会比较<code>son.bak.txt</code>和<code>son.txt</code>的修改时间来判断两文件谁新谁旧。</p></blockquote><p>我们可以编辑一下<code>son.txt</code>，随便写一些东西并保存，再执行<code>make</code>的时候，就会发现 Make 重新复制了一份<code>son.bak.txt</code>。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># echo 1 &gt; son.txt # write 1 to file son.txt</span></span><br><span class="line"><span class="comment"># make</span></span><br><span class="line">I am son.bak.txt</span><br><span class="line">I am all</span><br></pre></td></tr></table></figure><p>我们的<code>I am son.bak.txt</code>回来啦，意味着<code>son.bak.txt</code>被重新复制了一份。</p><p>再举一个复杂一点的例子，在<code>make/makefile2</code>里，Makefile 的内容如下：</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">all: son.txt</span></span><br><span class="line"></span><br><span class="line"><span class="section">son.txt: mom.txt dad.txt</span></span><br><span class="line">cat mom.txt &gt; son.txt</span><br><span class="line">cat dad.txt &gt;&gt; son.txt <span class="comment"># merge the content of mom.txt and dad.txt</span></span><br><span class="line"></span><br><span class="line"><span class="section">mom.txt: grandma.txt</span></span><br><span class="line">cat grandma.txt &gt; mom.txt <span class="comment"># replace the content with grandma.txt</span></span><br><span class="line"></span><br><span class="line"><span class="section">dad.txt: grandpa.txt</span></span><br><span class="line">cat grandma.txt &gt; dad.txt</span><br><span class="line"></span><br><span class="line"><span class="section">clean:</span></span><br><span class="line">@rm mom.txt dad.txt</span><br></pre></td></tr></table></figure><p>同时我们新增一个<code>clean</code>假 target，用来清理 Makefile 产生的中间文件。输入命令<code>make clean</code>即可删掉<code>mom.txt</code>和<code>dad.txt</code>。</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># make clean</span></span><br><span class="line"><span class="comment"># make</span></span><br><span class="line">cat grandma.txt &gt; mom.txt <span class="comment"># replace the content with grandma.txt</span></span><br><span class="line">cat grandma.txt &gt; dad.txt</span><br><span class="line">cat mom.txt &gt; son.txt</span><br><span class="line">cat dad.txt &gt;&gt; son.txt <span class="comment"># merge the content of mom.txt and dad.txt</span></span><br><span class="line"><span class="comment"># make</span></span><br><span class="line">make: <span class="string">&#x27;all&#x27;</span> is up to date.</span><br><span class="line"><span class="comment"># echo 1 &gt; grandma.txt </span></span><br><span class="line"><span class="comment"># make</span></span><br><span class="line">cat grandma.txt &gt; mom.txt <span class="comment"># replace the content with grandma.txt</span></span><br><span class="line">cat mom.txt &gt; son.txt</span><br><span class="line">cat dad.txt &gt;&gt; son.txt <span class="comment"># merge the content of mom.txt and dad.txt</span></span><br></pre></td></tr></table></figure><p>注意到第一次<code>make</code>之后，如果未对文件作出修改，所有的 target 除<code>all</code>以外会被跳过执行 command。当仅有<code>grandma.txt</code>被修改时，<code>dad.txt</code>这个 target 也会跳过执行 command。期间 Make 的故事是： - make <code>all</code> target - make <code>son.txt</code> - make <code>mom.txt</code> - make <code>grandma.txt</code> - 检查<code>grandma.txt</code>文件是否有新修改 - 有更新，执行<code>cat grandma.txt &gt; mom.txt</code> - make <code>dad.txt</code> - make <code>grandpa.txt</code> - 检查<code>grandma.txt</code>文件是否有新修改 - 没更新，啥也不做 - 检查<code>mom.txt</code>和<code>dad.txt</code>是否执行过 command - <code>mom.txt</code>有执行过 command，执行<code>cat mom.txt &gt; son.txt</code>和<code>cat dad.txt &gt;&gt; son.txt</code> - 检查<code>son.txt</code>是否执行过 command - <code>son.txt</code>有执行过 command，执行空指令</p><h3 id="存在依赖关系">存在依赖关系</h3><p>显而易见，下面的例子里，<code>aunt.txt</code>和<code>uncle.txt</code>因为和<code>all</code>屁关系没有，从未被依赖过，所以它们的 command 压根就不可能被执行。</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">all: son.txt</span></span><br><span class="line"></span><br><span class="line"><span class="section">son.txt: mom.txt dad.txt</span></span><br><span class="line">cat mom.txt &gt; son.txt</span><br><span class="line">cat dad.txt &gt;&gt; son.txt</span><br><span class="line"></span><br><span class="line"><span class="section">mom.txt: grandma.txt</span></span><br><span class="line">cat grandma.txt &gt; mom.txt</span><br><span class="line"></span><br><span class="line"><span class="section">dad.txt: grandpa.txt</span></span><br><span class="line">cat grandma.txt &gt; dad.txt</span><br><span class="line"></span><br><span class="line"><span class="section">aunt.txt: grandma.txt</span></span><br><span class="line">cat grandma.txt &gt; aunt.txt</span><br><span class="line"></span><br><span class="line"><span class="section">uncle.txt: grandpa.txt</span></span><br><span class="line">cat grandma.txt &gt; uncle.txt</span><br></pre></td></tr></table></figure><h2 id="确定命令执行的顺序">确定命令执行的顺序</h2><p>Makefile 可以说是文件和命令的依赖关系的说明书，描述了通过什么文件以及什么命令可以产生什么文件。比如在上个例子里，<code>son.txt</code>文件的产生依赖<code>mom.txt</code>和<code>dad.txt</code>，<code>mom.txt</code>和<code>dad.txt</code>的产生分别依赖<code>grandma.txt</code>和<code>grandpa.txt</code>。在真实环境中，为了加速工程的编译速度，几个被依赖的，但它们之间没有依赖关系的 target 的 command 可以被同时执行。这也因此要求我们准确描述依赖关系。如果我们把上个例子的 Makefile 写成：</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">all: son.txt mom.txt dad.txt</span></span><br><span class="line"></span><br><span class="line"><span class="section">son.txt:</span></span><br><span class="line">cat mom.txt &gt; son.txt</span><br><span class="line">cat dad.txt &gt;&gt; son.txt</span><br><span class="line"></span><br><span class="line"><span class="section">mom.txt: grandma.txt</span></span><br><span class="line">cat grandma.txt &gt; mom.txt</span><br><span class="line"></span><br><span class="line"><span class="section">dad.txt: grandpa.txt</span></span><br><span class="line">cat grandma.txt &gt; dad.txt</span><br></pre></td></tr></table></figure><p>就有可能造成<code>son.txt</code>，<code>mom.txt</code>，<code>dad.txt</code>的 command 被同时执行，就存在<code>mom.txt</code>文件还没产生，<code>son.txt</code>的命令（需要<code>mom.txt</code>作为输入）就开始执行了的可能性。</p><table><thead><tr class="header"><th>时刻</th><th><code>son.txt</code></th><th><code>mom.txt</code></th><th><code>dad.txt</code></th></tr></thead><tbody><tr class="odd"><td>1</td><td><code>cat mom.txt &gt; son.txt</code></td><td></td><td></td></tr><tr class="even"><td>2</td><td>？？我文件呢？</td><td></td><td><code>cat &gt; dad.txt</code></td></tr><tr class="odd"><td>3</td><td></td><td><code>cat &gt; mom.txt</code></td><td></td></tr></tbody></table><p>所以我们需要确保执行的顺序像下面那个样子。</p><table><thead><tr class="header"><th>时刻</th><th><code>son.txt</code></th><th><code>mom.txt</code></th><th><code>dad.txt</code></th></tr></thead><tbody><tr class="odd"><td>1</td><td></td><td><code>cat &gt; mom.txt</code></td><td><code>cat &gt; dad.txt</code></td></tr><tr class="even"><td>2</td><td><code>cat mom.txt &gt; son.txt</code></td><td></td><td></td></tr></tbody></table><p>使用<code>make -j</code>可以启动并行编译，在下面的例子中，我们可以看到<code>grandpa</code>和<code>grandma</code>同时开始执行，<code>son</code>在<code>mom</code>和<code>dad</code>都结束后才开始执行，总耗时为12s。</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">all: son</span></span><br><span class="line">@echo <span class="string">&quot;finish        @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">son: mom dad</span></span><br><span class="line">@echo <span class="string">&quot;son     begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 2</span><br><span class="line">@echo <span class="string">&quot;son     end&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">mom: grandma</span></span><br><span class="line">@echo <span class="string">&quot;mom     begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 3</span><br><span class="line">@echo <span class="string">&quot;mom     end&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">dad: grandpa</span></span><br><span class="line">@echo <span class="string">&quot;dad     begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 5</span><br><span class="line">@echo <span class="string">&quot;dad     end&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">grandma:</span></span><br><span class="line">@echo <span class="string">&quot;grandma begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 3</span><br><span class="line">@echo <span class="string">&quot;grandma end&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">grandpa:</span></span><br><span class="line">@echo <span class="string">&quot;grandpa begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 5</span><br><span class="line">@echo <span class="string">&quot;grandpa end&quot;</span></span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># make -j</span></span><br><span class="line">grandma begin @ Fri Dec 23 23:34:14 UTC 2022</span><br><span class="line">grandpa begin @ Fri Dec 23 23:34:14 UTC 2022</span><br><span class="line">grandma end</span><br><span class="line">mom     begin @ Fri Dec 23 23:34:17 UTC 2022</span><br><span class="line">grandpa end</span><br><span class="line">dad     begin @ Fri Dec 23 23:34:19 UTC 2022</span><br><span class="line">mom     end</span><br><span class="line">dad     end</span><br><span class="line">son     begin @ Fri Dec 23 23:34:24 UTC 2022</span><br><span class="line">son     end</span><br><span class="line">finish        @ Fri Dec 23 23:34:26 UTC 2022</span><br></pre></td></tr></table></figure><p>可视化结果如下：</p><table><thead><tr class="header"><th>时刻</th><th><code>son</code></th><th><code>mom</code></th><th><code>dad</code></th><th><code>grandma</code></th><th><code>grandpa</code></th></tr></thead><tbody><tr class="odd"><td>14</td><td></td><td></td><td></td><td>执行</td><td>执行</td></tr><tr class="even"><td>15</td><td></td><td></td><td></td><td>执行</td><td>执行</td></tr><tr class="odd"><td>16</td><td></td><td></td><td></td><td>结束</td><td>执行</td></tr><tr class="even"><td>17</td><td></td><td>执行</td><td></td><td></td><td>执行</td></tr><tr class="odd"><td>18</td><td></td><td>执行</td><td></td><td></td><td>结束</td></tr><tr class="even"><td>19</td><td></td><td>结束</td><td>执行</td><td></td><td></td></tr><tr class="odd"><td>20</td><td></td><td></td><td>执行</td><td></td><td></td></tr><tr class="even"><td>21</td><td></td><td></td><td>执行</td><td></td><td></td></tr><tr class="odd"><td>22</td><td></td><td></td><td>执行</td><td></td><td></td></tr><tr class="even"><td>23</td><td></td><td></td><td>结束</td><td></td><td></td></tr><tr class="odd"><td>24</td><td>执行</td><td></td><td></td><td></td><td></td></tr><tr class="even"><td>25</td><td>结束</td><td></td><td></td><td></td><td></td></tr></tbody></table><p>将代码修改成如下形式，可以使所有 target 同时开始执行，总耗时为5s。</p><figure class="highlight makefile"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">all: son mom dad grandma grandpa</span></span><br><span class="line">@echo <span class="string">&quot;finish        @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">son: </span></span><br><span class="line">@echo <span class="string">&quot;son     begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 2</span><br><span class="line">@echo <span class="string">&quot;son     end&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">mom: </span></span><br><span class="line">@echo <span class="string">&quot;mom     begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 3</span><br><span class="line">@echo <span class="string">&quot;mom     end&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">dad: </span></span><br><span class="line">@echo <span class="string">&quot;dad     begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 5</span><br><span class="line">@echo <span class="string">&quot;dad     end&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">grandma:</span></span><br><span class="line">@echo <span class="string">&quot;grandma begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 3</span><br><span class="line">@echo <span class="string">&quot;grandma end&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="section">grandpa:</span></span><br><span class="line">@echo <span class="string">&quot;grandpa begin @&quot;</span> <span class="string">&quot;$(shell date)&quot;</span></span><br><span class="line">@sleep 5</span><br><span class="line">@echo <span class="string">&quot;grandpa end&quot;</span></span><br></pre></td></tr></table></figure><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># make -j</span></span><br><span class="line">son     begin @ Fri Dec 23 23:36:33 UTC 2022</span><br><span class="line">mom     begin @ Fri Dec 23 23:36:33 UTC 2022</span><br><span class="line">dad     begin @ Fri Dec 23 23:36:33 UTC 2022</span><br><span class="line">grandma begin @ Fri Dec 23 23:36:33 UTC 2022</span><br><span class="line">grandpa begin @ Fri Dec 23 23:36:33 UTC 2022</span><br><span class="line">son     end</span><br><span class="line">mom     end</span><br><span class="line">grandma end</span><br><span class="line">dad     end</span><br><span class="line">grandpa end</span><br><span class="line">finish        @ Fri Dec 23 23:36:38 UTC 2022</span><br></pre></td></tr></table></figure><p>可视化结果如下：</p><table><thead><tr class="header"><th>时刻</th><th><code>son</code></th><th><code>mom</code></th><th><code>dad</code></th><th><code>grandma</code></th><th><code>grandpa</code></th></tr></thead><tbody><tr class="odd"><td>33</td><td>执行</td><td>执行</td><td>执行</td><td>执行</td><td>执行</td></tr><tr class="even"><td>34</td><td>结束</td><td>执行</td><td>执行</td><td>执行</td><td>执行</td></tr><tr class="odd"><td>35</td><td></td><td>结束</td><td>执行</td><td>结束</td><td>执行</td></tr><tr class="even"><td>36</td><td></td><td></td><td>执行</td><td></td><td>执行</td></tr><tr class="odd"><td>37</td><td></td><td></td><td>结束</td><td></td><td>结束</td></tr></tbody></table><h2 id="references">References</h2><ul><li>https://makefiletutorial.com/</li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;本教程为南科大2022年超算校内赛第二题扩展教程，&lt;a href=&quot;https://github.com/Tonny-Gu/HelloHPC&quot;&gt;赛题的repo&lt;/a&gt;。&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Easy ways to setup Reverse Proxy for NAT-Passthrough</title>
    <link href="https://blog.mylab.cc/2022/10/03/Easy-ways-to-setup-Reverse-Proxy-for-NAT-Passthrough/"/>
    <id>https://blog.mylab.cc/2022/10/03/Easy-ways-to-setup-Reverse-Proxy-for-NAT-Passthrough/</id>
    <published>2022-10-03T11:12:20.000Z</published>
    <updated>2022-10-03T19:20:47.395Z</updated>
    
    <content type="html"><![CDATA[<p>It's time to abandon NPS, Frp, or other solutions that are hard to configure or no longer maintained. Thanks to Docker, it's possible to set up a reliable reverse proxy with single command.</p><h2 id="prerequisite">Prerequisite</h2><p>Make sure you have a machine with a public IP address (it could be a VPS), otherwise our method may not be applicable. Let's take exposing the SSH port of a machine behind NAT as an example. There are three roles in total:</p><ul><li>Local Machine: The machine behind NAT firewall with a local IP address, and you want to expose its SSH port 22 to the public network.</li><li>Public Server: The machine with a public IP address (a.b.c.d). Typically you rent it from a public cloud service provider like AWS or Azure. Its port 22 is also open and being used by its own SSH service.</li><li>Your Computer: Another machine you are currently using.</li></ul><h2 id="recommended-persistent-method-via-gost-with-docker">(Recommended) Persistent Method: via Gost with Docker</h2><h3 id="before-getting-started-install-docker">Before getting started: Install Docker</h3><p>The first step is to install the Docker on both <strong>the local machine</strong> and <strong>the public server</strong>. For Ubuntu 18.04+, I personally prefer to install Docker through apt.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">local-machine&amp;public-server$ sudo apt update</span><br><span class="line">local-machine&amp;public-server$ sudo apt install docker.io -y</span><br></pre></td></tr></table></figure><h3 id="recommended-using-websocket-tls-gost-relay-protocol">(Recommended) Using Websocket + TLS + Gost Relay Protocol</h3><p>Gost supports nuermous proxy protocols. For reliablity, it is suggested to use a secured protocol to resist the interference by some secure gateways like GFW. Luckily, with the help of Docker, it is very easy to set up a secured tunnel. Another good news is that Docker Daemon will help to monitor the service and automatically restart Gost Service at boot or on failure. Let's say goodbye to the annoying Systemd.</p><p><strong>On Public Server</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">cd</span> ~ <span class="comment"># other any other path you like</span></span><br><span class="line">mkdir gost</span><br><span class="line">docker run --name gost_server --net host -v $(<span class="built_in">pwd</span>)/gost:/root -id --restart always --entrypoint <span class="string">&quot;&quot;</span> -w <span class="string">&quot;/root&quot;</span> gogost/gost gost -L <span class="string">&quot;relay+wss://&lt;user&gt;:&lt;passwd&gt;@:&lt;gost_service_port&gt;?bind=true&quot;</span></span><br></pre></td></tr></table></figure><p><strong>On Local Machine</strong></p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run --name gost_client --net host -id --restart always --entrypoint <span class="string">&quot;&quot;</span>  gogost/gost gost -L rtcp://:&lt;remote_port&gt;/:&lt;local_port&gt; -F <span class="string">&quot;relay+wss://&lt;user&gt;:&lt;passwd&gt;@&lt;pub_server_ip_or_domain&gt;:&lt;gost_service_port&gt;&quot;</span></span><br></pre></td></tr></table></figure><p>Here is the explanation of parameters:</p><ul><li><code>user</code> / <code>passwd</code>: The username and the password for Gost. They are irrelevant to any other account such as Linux accounts.</li><li><code>pub_server_ip_or_domain</code>: It could be either the public IP address or the binded domain name. If you would like to use your valid SSL certificate issued by CA, you should only use the domain name here.</li><li><code>gost_service_port</code>: Could be arbitrary value. It is also fine to set this port number to 443 to pretend as a HTTPS server.</li><li><code>remote_port</code>: The port that the public server's Gost listens to. The data received from this port will be forwarded to <code>local_port</code> on the local machine. It could be arbitrary value.</li><li><code>local_port</code>: The port that should associate with some services on the local machine. In our case, it should be 22, the SSH port.</li></ul><p>Back to our case, the corresponding commands will be:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># On Public Server</span></span><br><span class="line"><span class="built_in">cd</span> ~</span><br><span class="line">mkdir gost</span><br><span class="line">docker run --name gost_server --net host -v $(<span class="built_in">pwd</span>)/gost:/root -id --restart always --entrypoint <span class="string">&quot;&quot;</span> -w <span class="string">&quot;/root&quot;</span> gogost/gost gost -L <span class="string">&quot;relay+wss://user:pass@:1234?bind=true&quot;</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># On Local Machine</span></span><br><span class="line">docker run --name gost_client --net host -id --restart always --entrypoint <span class="string">&quot;&quot;</span>  gogost/gost gost -L rtcp://:2022/:22 -F <span class="string">&quot;relay+wss://user:pass@a.b.c.d:1234&quot;</span> <span class="comment"># a.b.c.d is the public IP of the public server</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># or</span></span><br><span class="line">docker run --name gost_client --net host -id --restart always --entrypoint <span class="string">&quot;&quot;</span>  gogost/gost gost -L rtcp://:2022/:22 -F <span class="string">&quot;relay+wss://user:pass@mydomain.com:1234&quot;</span> <span class="comment"># mydomain.com can be parsed to a.b.c.d</span></span><br></pre></td></tr></table></figure><p>After that, the SSH port 22 on the local machine should be mapped to the port 2022 on the public server now.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># user-local is the name of the user on local machine</span></span><br><span class="line">your-computer$ ssh user-local@a.b.c.d -p 2022</span><br><span class="line"><span class="comment"># or</span></span><br><span class="line">your-computer$ ssh user-local@mydomain.com -p 2022</span><br></pre></td></tr></table></figure><blockquote><p><strong>Debugging Tips:</strong> If Gost is not working properly, we can read the log using the command below.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker logs -f gost_server <span class="comment"># or gost_client</span></span><br></pre></td></tr></table></figure></blockquote><h4 id="optional-using-your-ssl-certificate">(Optional) Using your SSL certificate</h4><p>By default, Gost will generate a self-signed SSL certificate if the user doesn't specify one. However, this might be considered unsafe as the communication is no longer able to defense the MITM attack. Moreover, some secure gateways may disrupt the TLS session with a self-signed certificate.</p><p>The solution is simple. Do you remember the <code>gost</code> directory we created before? We just need to put our certificate there and restart the service. There should be two files in total, <code>cert.pem</code> and <code>key.pem</code>, and their content should start with <code>-----BEGIN CERTIFICATE----</code> and <code>-----BEGIN RSA PRIVATE KEY-----</code> respectively.</p><p>Lastly, the command to restart the service is:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># On Public Server</span></span><br><span class="line">docker restart gost_server</span><br></pre></td></tr></table></figure><h2 id="temporary-method-via-openssh-client">Temporary Method: via OpenSSH Client</h2><p>If you don't want to install any new software, we can also utilize SSH to build a tunnel. This method is also simple (sometimes), but it is not quite reliable. As you might know, a SSH connection can be easily disrupted due to many reasons. In our case, we just need to type this command on the local machine to set up a tunnel:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">local-machine$ ssh -R <span class="string">&quot;[::]:2022:localhost:22&quot;</span> user-public@a.b.c.d</span><br><span class="line"><span class="comment"># a.b.c.d is the public IP, user-public is the name of the user on public server</span></span><br></pre></td></tr></table></figure><p>If everything goes well, now you can connect to the local machine through:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">your-computer$ ssh user-local@a.b.c.d -p 2022</span><br><span class="line"><span class="comment"># user-local is the name of the user on local machine</span></span><br></pre></td></tr></table></figure><p>However, if you failed to connect to your local machine. Please check the <code>sshd</code> configuration on the public server.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">public-server$ sudo nano /etc/ssh/sshd_config</span><br></pre></td></tr></table></figure><p>Find the following line (for nano editor, press Ctrl-W to search), uncomment, and replaced <code>no</code> with <code>yes</code>.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Before</span></span><br><span class="line"><span class="comment">#GatewayPorts no</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># After</span></span><br><span class="line">GatewayPorts yes</span><br></pre></td></tr></table></figure><p>Then restart the SSH server.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">public-server$ sudo service sshd restart</span><br></pre></td></tr></table></figure>]]></content>
    
    
    <summary type="html">&lt;p&gt;It&#39;s time to abandon NPS, Frp, or other solutions that are hard to configure or no longer maintained. Thanks to Docker, it&#39;s possible to set up a reliable reverse proxy with single command.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Slurm Quick Installation for Cluster on Ubuntu 20.04</title>
    <link href="https://blog.mylab.cc/2022/09/02/Slurm-Quick-Installation-for-Cluster-on-Ubuntu-20-04/"/>
    <id>https://blog.mylab.cc/2022/09/02/Slurm-Quick-Installation-for-Cluster-on-Ubuntu-20-04/</id>
    <published>2022-09-02T04:17:16.000Z</published>
    <updated>2022-09-04T20:24:41.591Z</updated>
    
    <content type="html"><![CDATA[<p>Slurm will make a bunch of seperated machines look much like a cluster, is it right?</p><h2 id="naming-convention-of-nodes">Naming Convention of Nodes</h2><p>A common cluster should comprise management nodes and compute nodes. This aritcle will take our cluster as an example to demostrate steps to install and configure Slurm. In our case, the management node is called <code>clab-mgt01</code> while the compute nodes are named from <code>clab01</code> to <code>clab20</code> in order.</p><h2 id="install-dependencies">Install Dependencies</h2><p>Execute the following command to install the dependencies <strong>on all machines</strong>. (<code>clab-all</code> refers to all machines including management and compute nodes).</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">clab-all$ sudo apt install slurm-wlm slurm-client munge</span><br></pre></td></tr></table></figure><blockquote><p>Tips: There are several tools that may help to manage multiple nodes easily:</p><ul><li>iTerm2 (on Mac) / Terminator (on Linux)</li><li>csshX (on Mac) / cssh (on Linux)</li><li>Parallel SSH (at cluster side)</li></ul></blockquote><h2 id="generate-slurm-configuration">Generate Slurm Configuration</h2><p>There is <a href="https://slurm.schedmd.com/configurator.html">an official online configuration generator</a>. And we should carefully check the fields below.</p><ul><li><strong>SlurmctldHost</strong>: <code>clab-mgt01</code> in our case.</li><li><strong>NodeName</strong>: <code>clab[01-20]</code> in our case.</li><li><strong>CPUs</strong>: It is recommended to leave it blank.</li><li><strong>Sockets</strong>: For a dual-socket server we commonly see, it should be <code>2</code>.</li><li><strong>CoresPerSocket</strong>: Number of physical cores per socket.</li><li><strong>ThreadsPerCore</strong>: For a regular x86 server, if hyperthreading is enabled, it should be <code>2</code>, otherwise <code>1</code>.</li><li><strong>RealMemory</strong>: Optional.</li></ul><p>Click <code>submit</code>, then we could copy the file content to <code>/etc/slurm-llnl/slurm.conf</code> <strong>on all machines</strong>.</p><blockquote><p>Tips: Don't forget the shared storage (e.g. NFS storage) on the cluster. We could utilize it to distribute files.</p></blockquote><h2 id="distribute-munge-key">Distribute Munge Key</h2><p>Once Munge is installed successfully, the key <code>/etc/munge/munge.key</code> will be automatically generated. It is requried for all machines to hold the same key. Therefore, we could distribute the key <strong>on the management node</strong> to <strong>the remaining nodes</strong> including compute nodes and other backup management node if existing.</p><blockquote><p>Tips: Again. We could also utilize the shared storage to distribute the key.</p></blockquote><p>Then make sure the permission and the ownership are correctly set.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">clab-all$ sudo chmod 400 /etc/munge/munge.key</span><br><span class="line">clab-all$ chown munge:munge /etc/munge/munge.key</span><br></pre></td></tr></table></figure><h2 id="patch-slurm-cgroup-integration">Patch Slurm Cgroup Integration</h2><p>By default, there Slurm cannot work with Cgroup well. If we start Slurm service right now, we may receive this error shown below.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">error: cgroup namespace <span class="string">&#x27;freezer&#x27;</span> not mounted. aborting</span><br></pre></td></tr></table></figure><p>Therefore, by pasting the following content to <code>/etc/slurm/cgroup.conf</code> <strong>on compute nodes</strong>, this issue can be fixed.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">CgroupMountpoint=/sys/fs/cgroup</span><br></pre></td></tr></table></figure><p>or using this command:</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">echo</span> CgroupMountpoint=/sys/fs/cgroup &gt;&gt; /etc/slurm/cgroup.conf</span><br></pre></td></tr></table></figure><h2 id="fix-directory-permission">Fix Directory Permission</h2><p>For unknown reasons, the permission of the relevant directory is not set properly, which may lead to this error.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">slurmctld: fatal: mkdir(/var/spool/slurmctld): Permission denied</span><br></pre></td></tr></table></figure><p>The solution is executing the commands below <strong>on management nodes</strong>.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">clab-mgt$ sudo mkdir -p /var/spool/slurmctld</span><br><span class="line">clab-mgt$ sudo chown slurm:slurm /var/spool/slurmctld/</span><br></pre></td></tr></table></figure><h2 id="start-slurm-service">Start Slurm Service</h2><p>So far, we have finished the basic configuration. Let us launch Slurm now.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># On management nodes</span></span><br><span class="line">clab-mgt$ sudo systemctl <span class="built_in">enable</span> munge</span><br><span class="line">clab-mgt$ sudo systemctl start munge</span><br><span class="line">clab-mgt$ sudo systemctl <span class="built_in">enable</span> slurmctld</span><br><span class="line">clab-mgt$ sudo systemctl start slurmctld</span><br><span class="line"></span><br><span class="line"><span class="comment"># On compute nodes</span></span><br><span class="line">clab-comp$ sudo systemctl <span class="built_in">enable</span> munge</span><br><span class="line">clab-comp$ sudo systemctl start munge</span><br><span class="line">clab-comp$ sudo systemctl <span class="built_in">enable</span> slurmd</span><br><span class="line">clab-comp$ sudo systemctl start slurmd</span><br></pre></td></tr></table></figure><p>Run <code>sinfo</code> and we should see all the compute nodes are ready.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ sinfo</span><br><span class="line">PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST</span><br><span class="line">debug*       up   infinite     20   idle clab[01-20]</span><br></pre></td></tr></table></figure><h2 id="debugging-tips">Debugging Tips</h2><p>If your Slurm is not working correctly, you could try with these commands to debug.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">clab-mgt$ sudo slurmctld -D</span><br><span class="line">clab-comp$ sudo slurmd -D</span><br></pre></td></tr></table></figure><h2 id="references">References</h2><ul><li><a href="https://www.cnblogs.com/aobaxu/p/16195237.html">https://www.cnblogs.com/aobaxu/p/16195237.html</a></li><li><a href="https://stackoverflow.com/questions/62641323/error-cgroup-namespace-freezer-not-mounted-aborting">https://stackoverflow.com/questions/62641323/error-cgroup-namespace-freezer-not-mounted-aborting</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;Slurm will make a bunch of seperated machines look much like a cluster, is it right?&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Tips of configuring InfiniBand adapters</title>
    <link href="https://blog.mylab.cc/2022/08/31/Tips-of-configuring-InfiniBand-Adapters/"/>
    <id>https://blog.mylab.cc/2022/08/31/Tips-of-configuring-InfiniBand-Adapters/</id>
    <published>2022-08-30T16:35:51.000Z</published>
    <updated>2022-08-31T11:33:36.107Z</updated>
    
    <content type="html"><![CDATA[<p>After reconfiguring clusters from scratch for several times, it seems that I am gradually adapting to this mystery and strange InfiniBand world...</p><h2 id="relationship-among-infiniband-roce-ipoib-and-ethernet-mode">Relationship among InfiniBand, RoCE, IPoIB, and Ethernet Mode</h2><p>Let us take Mellanox ConnectX Adapter as an example. Actually, this adapter can work in either InfiniBand Mode or Ethernet Mode, which is configurable with some tools provided by the vendor. As iWARP is not widely adopted, our article will not discuss this protocol.</p><table><colgroup><col style="width: 45%" /><col style="width: 30%" /><col style="width: 23%" /></colgroup><thead><tr class="header"><th></th><th>InfiniBand Mode</th><th>Ethernet Mode</th></tr></thead><tbody><tr class="odd"><td>Supported by ConnectX</td><td>Yes</td><td>Yes</td></tr><tr class="even"><td>RDMA Support</td><td>Yes</td><td>Yes</td></tr><tr class="odd"><td>Programmable with Verbs</td><td>Yes</td><td>Yes</td></tr><tr class="even"><td>TCP/IP Support</td><td>Needs IPoIB</td><td>Yes</td></tr><tr class="odd"><td>Configurable with Netplan (e.g. Assign IP Address)</td><td>Needs IPoIB</td><td>Yes</td></tr><tr class="even"><td>Layout of RDMA Packet</td><td>IB Frame + IB Header</td><td>ETH Frame + RoCE Header</td></tr><tr class="odd"><td>Layout of TCP Packet</td><td>IB Frame + IB/IPoIB/IP/TCP Headers</td><td>ETH Frame + IP/TCP Headers</td></tr></tbody></table><p>Note that RoCE Header is a general concept. And RoCEv1 and RoCEv2 give different detailed definitions of this part.</p><h2 id="identify-infiniband-ethernet-mode">Identify InfiniBand / Ethernet Mode</h2><p>The easiest way is to directly have a look at the interface name and link type with <code>ifconfig</code> or <code>ip</code> under Linux. An InfiniBand adapter working in Ethernet mode looks exactly the same as a regular Ethernet adapter.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">$ ip a</span><br><span class="line"><span class="comment"># InfiniBand Mode</span></span><br><span class="line">4: ibp129s0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 2044 qdisc mq state UP group default qlen 256</span><br><span class="line">    link/infiniband ...</span><br><span class="line">    inet 192.168.7.100/24 brd 192.168.7.255 scope global ibp129s0</span><br><span class="line">       valid_lft forever preferred_lft forever</span><br><span class="line"></span><br><span class="line"><span class="comment"># Ethernet Mode</span></span><br><span class="line">7: ens1f1: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc mq state UP group default qlen 1000</span><br><span class="line">    link/ether ...</span><br><span class="line">    inet 10.200.0.1/24 brd 10.200.0.255 scope global ens1f1</span><br><span class="line">       valid_lft forever preferred_lft forever</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>Besides, <code>ibdev2netdev</code> can also help.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">$ ibdev2netdev</span><br><span class="line">mlx4_0 port 1 ==&gt; ibp129s0 (Up)</span><br><span class="line">mlx5_0 port 1 ==&gt; ens1f0 (Up)</span><br></pre></td></tr></table></figure><p>Another approach is through <code>ibstat</code>. And the field <code>Link layer</code> shows which mode the adapter is working in.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br></pre></td><td class="code"><pre><span class="line">$ ibstat</span><br><span class="line"><span class="comment"># InfiniBand Mode</span></span><br><span class="line">CA <span class="string">&#x27;mlx4_0&#x27;</span></span><br><span class="line">CA <span class="built_in">type</span>: MT4099</span><br><span class="line">Number of ports: 1</span><br><span class="line">Firmware version: 2.42.5000</span><br><span class="line">Hardware version: 1</span><br><span class="line">Node GUID: </span><br><span class="line">System image GUID: </span><br><span class="line">Port 1:</span><br><span class="line">State: Active</span><br><span class="line">Physical state: LinkUp</span><br><span class="line">Rate: 56</span><br><span class="line">Base lid: 1</span><br><span class="line">LMC: 0</span><br><span class="line">SM lid: 1</span><br><span class="line">Capability mask: 0x0251486a</span><br><span class="line">Port GUID: </span><br><span class="line">Link layer: InfiniBand</span><br><span class="line"></span><br><span class="line"><span class="comment"># Ethernet Mode</span></span><br><span class="line">CA <span class="string">&#x27;mlx5_0&#x27;</span></span><br><span class="line">CA <span class="built_in">type</span>: MT4119</span><br><span class="line">Number of ports: 1</span><br><span class="line">Firmware version: 16.25.1020</span><br><span class="line">Hardware version: 0</span><br><span class="line">Node GUID: </span><br><span class="line">System image GUID: </span><br><span class="line">Port 1:</span><br><span class="line">State: Active</span><br><span class="line">Physical state: LinkUp</span><br><span class="line">Rate: 100</span><br><span class="line">Base lid: 0</span><br><span class="line">LMC: 0</span><br><span class="line">SM lid: 0</span><br><span class="line">Capability mask: 0x00010000</span><br><span class="line">Port GUID: </span><br><span class="line">Link layer: Ethernet</span><br></pre></td></tr></table></figure><h2 id="change-infiniband-ethernet-mode">Change InfiniBand / Ethernet Mode</h2><p>To alter the work mode, there doesn't exist a general way for now. For Mellanox ConnectX Adapter, the vendor provided a tool called <code>mlxconfig</code>. Here is the usage listed in <a href="https://docs.nvidia.com/networking/display/MFTv4110/Using+mlxconfig">the official document</a>, where you can find more information about it.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">$ sudo mlxconfig -d /dev/mst/mt4103_pci_cr0 <span class="built_in">set</span> LINK_TYPE_P1=1 LINK_TYPE_P2=1</span><br><span class="line"> </span><br><span class="line">Device <span class="comment">#1:</span></span><br><span class="line">----------</span><br><span class="line">Device <span class="built_in">type</span>:   ConnectX3Pro</span><br><span class="line">PCI device:    /dev/mst/mt4103_pci_cr0</span><br><span class="line">Configurations:        Next Boot        New</span><br><span class="line">  LINK_TYPE_P1         ETH(2)           IB(1)</span><br><span class="line">  LINK_TYPE_P2         ETH(2)           IB(1)</span><br><span class="line"> </span><br><span class="line">Apply new Configuration? ? (y/n) [n] : y</span><br><span class="line">Applying... Done!</span><br><span class="line">-I- Please reboot machine to load new configurations.</span><br></pre></td></tr></table></figure><p>Note that P1 and P2 are referring to two separated ports on the adapter. <strong>Attention: Please make sure the network switch is capable of handling InfiniBand or Ethernet Frame before altering the work mode .</strong> If the switch cannot recognize the data frame sent from the server, you might observer <code>Physical state: Polling</code> reported by <code>ibstat</code>, as the packet is not forwarded by the switch correctly. Certain network switches can only forward one type of data frame at a time, which means you may need to manually reconfigure the switch to let it work with the other type of data frame.</p><h2 id="configure-ipoib">Configure IPoIB</h2><p>By default, the IPoIB will be automatically configured when the IP address is assigned to the interface. The IP address can be managed by <code>netplan</code> or <code>NetworkManager</code>, which depends on your Linux distro. As for the configuration file, there is no difference between the InfiniBand and regular Ethernet Adapters.</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Assign a static IP address with netplan for an InfiniBand interface</span></span><br><span class="line"><span class="attr">network:</span></span><br><span class="line">  <span class="attr">ethernets:</span></span><br><span class="line">    <span class="attr">ibp129s0:</span></span><br><span class="line">      <span class="attr">addresses:</span></span><br><span class="line">      <span class="bullet">-</span> <span class="number">192.168</span><span class="number">.7</span><span class="number">.100</span><span class="string">/24</span></span><br><span class="line">  <span class="attr">version:</span> <span class="number">2</span></span><br></pre></td></tr></table></figure><p>Once the above configuration is applied and the interface is brought up successfully. We can see <code>ib_ipoib</code> module is loaded.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">$ lsmod | grep ipoib</span><br><span class="line">ib_ipoib              180224  0</span><br><span class="line">$ ip a</span><br><span class="line">4: ibp129s0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 2044 qdisc mq state UP group default qlen 256</span><br><span class="line">    link/infiniband ...</span><br><span class="line">    inet 192.168.7.100/24 brd 192.168.7.255 scope global ibp129s0</span><br><span class="line">       valid_lft forever preferred_lft forever</span><br></pre></td></tr></table></figure><p>If the IP address doesn't appear in <code>ip a</code>, we need to check the status of the InfiniBand adapter and make sure its state is active in <code>ibstat</code>. A common mistake is forgetting to enable <code>opensm</code> / <code>opensmd</code>, which will make the adapter stuck at <code>State: Initializing</code>. Note that <code>opensmd</code> will not launch on startup by default.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Start OpenSM</span></span><br><span class="line">$ sudo opensm</span><br><span class="line"></span><br><span class="line"><span class="comment"># Start OpenSM As Daemon</span></span><br><span class="line">$ sudo service opensmd start <span class="comment"># Method 1</span></span><br><span class="line">$ sudo systemctl start opensmd <span class="comment"># Method 2</span></span><br><span class="line">$ sudo /etc/init.d/opensmd start <span class="comment"># Method 3</span></span><br></pre></td></tr></table></figure><h2 id="identify-roce-version">Identify RoCE Version</h2><p>The major difference between RoCEv1 and RoCEv2 is that RoCEv2 is able to utilize IP networking to route while RoCEv1 is routing via MAC addresses. A funny fact is RoCEv1 and RoCEv2 may be enable simultaneously, and we could choose the version at runtime through specifying Group ID (GID). There is a script written by Mellanox named <code>show_gids</code> and it will display RoCE versions associated to GIDs.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">$ show_gids</span><br><span class="line">DEVPORTINDEXGIDIPv4  VERDEV</span><br><span class="line">---------------------------  ------</span><br><span class="line">mlx5_010fe80:0000:0000:0000:...v1ens1f0</span><br><span class="line">mlx5_011fe80:0000:0000:0000:...v2ens1f0</span><br><span class="line">mlx5_0120000:0000:0000:0000:...11.0.0.201  v1ens1f0</span><br><span class="line">mlx5_0130000:0000:0000:0000:...11.0.0.201  v2ens1f0</span><br></pre></td></tr></table></figure><h2 id="check-adapter-speed">Check Adapter Speed</h2><p><code>ethtool</code> can read out this information and it can work with both InfiniBand and Ethernet mode.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">$ ethtool ibp129s0</span><br><span class="line">Settings <span class="keyword">for</span> ibp129s0:</span><br><span class="line">...</span><br><span class="line">Speed: 56000Mb/s</span><br><span class="line">Duplex: Full</span><br><span class="line"></span><br><span class="line">$ ethtool ens1f0</span><br><span class="line">Settings <span class="keyword">for</span> ens1f0:</span><br><span class="line">...</span><br><span class="line">Speed: 100000Mb/s</span><br><span class="line">Duplex: Full</span><br></pre></td></tr></table></figure><h2 id="references">References</h2><ul><li><a href="https://www.advancedclustering.com/act_kb/infiniband-port-states/">https://www.advancedclustering.com/act_kb/infiniband-port-states/</a></li><li><a href="https://zhuanlan.zhihu.com/p/32105832">https://zhuanlan.zhihu.com/p/32105832</a></li><li><a href="https://wiki.archlinux.org/title/InfiniBand">https://wiki.archlinux.org/title/InfiniBand</a></li><li><a href="https://docs.nvidia.com/networking/display/MLNXOFEDv461000/OpenSM">https://docs.nvidia.com/networking/display/MLNXOFEDv461000/OpenSM</a></li><li><a href="https://www.cnblogs.com/juzib/p/13273380.html">https://www.cnblogs.com/juzib/p/13273380.html</a></li><li><a href="https://blog.51cto.com/liangchaoxi/4044293">https://blog.51cto.com/liangchaoxi/4044293</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;After reconfiguring clusters from scratch for several times, it seems that I am gradually adapting to this mystery and strange InfiniBand world...&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Building OpenWrt from Scratch for ARM64 UEFI ACPI VM</title>
    <link href="https://blog.mylab.cc/2022/02/26/Building-OpenWrt-from-Scratch-for-ARM64-UEFI-ACPI-VM/"/>
    <id>https://blog.mylab.cc/2022/02/26/Building-OpenWrt-from-Scratch-for-ARM64-UEFI-ACPI-VM/</id>
    <published>2022-02-26T09:34:48.000Z</published>
    <updated>2022-02-26T18:16:15.842Z</updated>
    
    <content type="html"><![CDATA[<p>OpenWrt doesn't provide a combined disk image for ARM virtual machines, unlike what they did for x86 VMs. Meanwhile, their official ARM64 kernel release can't boot in UEFI environment. But we can still make it work by compiling it from source and building a disk image manually.</p><p>Since Arm community has various opinions on how to boot an Arm machine, such as UEFI + ACPI (widely used by commercial Arm servers as well as modern x86 systems), U-Boot + Device Tree (mostly used by embedded devices with limited resouces), and even UEFI + Device Tree (like Huawei L420 notebook I owned), I would suggest that don't expect OpenWrt will provide official support for UEFI + ACPI systems in recent days as it is designed to run on tiny routers.</p><h2 id="compile-kernel-and-rootfs-from-source">Compile Kernel and Rootfs from Source</h2><p>Don't be scared. With the help of <code>buildroot</code>, which could automatically prepare the cross-compilation toolchain we need, this step is much simple nowadays.</p><blockquote><p>Note: My test environment is Ubuntu 21.10 ARM64 on Apple M1 Pro. It doesn't matter if you use a machine with a different system or architecture like AMD64, but you may need to take a few extra steps if so.</p></blockquote><h3 id="install-dependencies">Install Dependencies</h3><p>For Debian / Ubuntu users,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">sudo apt update</span><br><span class="line">sudo apt install build-essential ccache ecj fastjar file g++ gawk \</span><br><span class="line">gettext git java-propose-classpath libelf-dev libncurses5-dev \</span><br><span class="line">libncursesw5-dev libssl-dev python python2.7-dev python3 unzip wget \</span><br><span class="line">python3-distutils python3-setuptools python3-dev rsync subversion \</span><br><span class="line">swig time xsltproc zlib1g-dev </span><br></pre></td></tr></table></figure><blockquote><p>Note: The content of this sub-section is copied from the <a href="https://openwrt.org/docs/guide-developer/toolchain/install-buildsystem">official guide</a>. Take a look at it if this command is not applicable for your system.</p></blockquote><h3 id="download-the-source-code">Download the Source Code</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">git <span class="built_in">clone</span> https://git.openwrt.org/openwrt/openwrt.git</span><br><span class="line"><span class="built_in">cd</span> openwrt</span><br><span class="line">git tag</span><br><span class="line">git checkout v21.02.2</span><br><span class="line">./scripts/feeds update -a</span><br></pre></td></tr></table></figure><h3 id="configure-the-project">Configure the Project</h3><ol type="1"><li>Import the official configuration.</li></ol><p>To save our effort, it is a good idea to modify an existing configuration instead of creating a new one.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">wget https://downloads.openwrt.org/releases/21.02.2/targets/armvirt/64/config.buildinfo</span><br><span class="line">cp config.buildinfo .config</span><br></pre></td></tr></table></figure><ol start="2" type="1"><li>Add UEFI ACPI support.</li></ol><p>Open file <code>target/linux/armvirt/config-5.4</code>, and append the following lines to the end of file.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">CONFIG_EFI_STUB=y</span><br><span class="line">CONFIG_EFI=y</span><br><span class="line">CONFIG_EFI_VARS=y</span><br><span class="line">CONFIG_ARCH_SUPPORTS_ACPI=y</span><br><span class="line">CONFIG_ACPI=y</span><br></pre></td></tr></table></figure><ol start="3" type="1"><li>Launch Memuconfig.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">make menuconfig</span><br><span class="line">make kernel_menuconfig</span><br></pre></td></tr></table></figure><p>Tweak the configuration as you like, but you should clearly understand the consequence before you turn on and off something. Keeping default options is also fine.</p><blockquote><p>Note: These commands will build the whole toolchain from source for the first time they are executed. The compilation process is very slow.</p></blockquote><h3 id="build-the-kernel-and-rootfs">Build the Kernel and Rootfs</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">make -j $(nproc) defconfig download clean world</span><br></pre></td></tr></table></figure><p>It will compile the kernel and all of the selected pre-installed utilities, then generate an EFI binary of Linux Kernel and an Ext4 / SquashFS partition image of Rootfs.</p><h3 id="verify-the-firmware-image">Verify the Firmware Image</h3><p>The exciting moment comes. Let's test the kernel and rootfs we just built.</p><ol type="1"><li>Install QEMU.</li></ol><p>For Ubuntu users, I would suggest to install virt-manager instead, which offers a helpful GUI wizard for QEMU.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo apt install virt-manager</span><br></pre></td></tr></table></figure><ol start="2" type="1"><li>Launch a virtual machine.</li></ol><p>The magical QEMU allows virtual machines to boot a kernel without a bootloader. That is a great feature enables us to test the kernel's functionality at the early stage.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">qemu-system-aarch64 -m 512 -nographic -cpu cortex-a72 -smp 1 -M virt -kernel ~/openwrt/bin/targets/armvirt/64/openwrt-21.02.2-armvirt-64-Image-initramfs -bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd</span><br></pre></td></tr></table></figure><blockquote><p>Note: <code>Image-initramfs</code> is the kernel binary while it integrates the OpenWrt's Rootfs as <code>initramfs</code>, so this virtual machine will lose data each time it reboots.</p></blockquote><blockquote><p>Note: If you encounter this issue,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">EFI stub: Booting Linux Kernel...</span><br><span class="line">EFI stub: ERROR: Failed to relocate kernel</span><br><span class="line">EFI stub: ERROR: Failed to relocate kernel</span><br></pre></td></tr></table></figure><p>The solution is to increase the memory capacity of your virtual machine. Empirically, it should be at least 256 MB.</p></blockquote><h2 id="build-the-disk-image">Build the Disk Image</h2><p>Considered that data loss is not acceptable, while not every hypervisor is capable of launching a kernel directly, we should put everything we built into a disk, or virtual machine's disk image.</p><p>To keep things simple, let's start from building a raw disk image, which is one of the virtual disk formats supported by QEMU.</p><h3 id="create-an-empty-disk-image">Create an Empty Disk Image</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ dd <span class="keyword">if</span>=/dev/zero of=disk.img bs=1M count=1024</span><br><span class="line">1024+0 records <span class="keyword">in</span></span><br><span class="line">1024+0 records out</span><br><span class="line">1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.958229 s, 1.1 GB/s</span><br></pre></td></tr></table></figure><p>This command will create an empty disk image. Feel free to replace the value of <code>count</code> to change the size of the disk. (size = 1 MB * 1024 = 1 GB)</p><h3 id="partition-mount-and-format-the-disk-image">Partition, Mount, and Format the Disk Image</h3><ol type="1"><li>Partition the disk.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ fdisk disk.img</span><br><span class="line"></span><br><span class="line">Welcome to fdisk (util-linux 2.36.1).</span><br><span class="line">Changes will remain <span class="keyword">in</span> memory only, until you decide to write them.</span><br><span class="line">Be careful before using the write <span class="built_in">command</span>.</span><br><span class="line"></span><br><span class="line">Device does not contain a recognized partition table.</span><br><span class="line">Created a new DOS disklabel with disk identifier 0xb89701e3.</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): g</span><br><span class="line">Created a new GPT disklabel (GUID: 43B50BB3-20FD-3D4B-BFE1-50B5016F8059).</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): n</span><br><span class="line">Partition number (1-128, default 1):</span><br><span class="line">First sector (2048-2097118, default 2048):</span><br><span class="line">Last sector, +/-sectors or +/-size&#123;K,M,G,T,P&#125; (2048-2097118, default 2097118): +100M</span><br><span class="line"></span><br><span class="line">Created a new partition 1 of <span class="built_in">type</span> <span class="string">&#x27;Linux filesystem&#x27;</span> and of size 100 MiB.</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): t</span><br><span class="line">Selected partition 1</span><br><span class="line">Partition <span class="built_in">type</span> or <span class="built_in">alias</span> (<span class="built_in">type</span> L to list all): uefi</span><br><span class="line">Changed <span class="built_in">type</span> of partition <span class="string">&#x27;Linux filesystem&#x27;</span> to <span class="string">&#x27;EFI System&#x27;</span>.</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): n</span><br><span class="line">Partition number (2-128, default 2):</span><br><span class="line">First sector (206848-2097118, default 206848):</span><br><span class="line">Last sector, +/-sectors or +/-size&#123;K,M,G,T,P&#125; (206848-2097118, default 2097118):</span><br><span class="line"></span><br><span class="line">Created a new partition 2 of <span class="built_in">type</span> <span class="string">&#x27;Linux filesystem&#x27;</span> and of size 923 MiB.</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): p</span><br><span class="line">Disk disk.img: 1 GiB, 1073741824 bytes, 2097152 sectors</span><br><span class="line">Units: sectors of 1 * 512 = 512 bytes</span><br><span class="line">Sector size (logical/physical): 512 bytes / 512 bytes</span><br><span class="line">I/O size (minimum/optimal): 512 bytes / 512 bytes</span><br><span class="line">Disklabel <span class="built_in">type</span>: gpt</span><br><span class="line">Disk identifier: 43B50BB3-20FD-3D4B-BFE1-50B5016F8059</span><br><span class="line"></span><br><span class="line">Device      Start     End Sectors  Size Type</span><br><span class="line">disk.img1    2048  206847  204800  100M EFI System</span><br><span class="line">disk.img2  206848 2097118 1890271  923M Linux filesystem</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): w</span><br><span class="line">The partition table has been altered.</span><br><span class="line">Syncing disks.</span><br></pre></td></tr></table></figure><p>A new GPT partition table with two partitions is written to the disk image.</p><ol start="2" type="1"><li>Mount the disk image as a logical disk.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo losetup -Pf disk.img</span><br><span class="line">tonny@vm:~$ lsblk</span><br><span class="line">NAME                      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT</span><br><span class="line">loop5                       7:5    0    1G  0 loop</span><br><span class="line">├─loop5p1                 259:4    0  100M  0 part</span><br><span class="line">└─loop5p2                 259:5    0  923M  0 part</span><br></pre></td></tr></table></figure><p>OS has recognized the two partitions, <code>loop5p1</code> and <code>loop5p2</code>.</p><ol start="3" type="1"><li>Format the partitions.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~/mnt$ sudo mkfs.vfat /dev/loop5p1</span><br><span class="line">mkfs.fat 4.2 (2021-01-31)</span><br></pre></td></tr></table></figure><p>We don't need to format the second partition (Rootfs) for now, because we can directly restore the partition image of Rootfs instead, which is already formatted with Ext4 File System.</p><ol start="4" type="1"><li>Mount ESP partition.</li></ol><p>ESP partition contains the EFI executables of bootloaders (e.g., GRUB), as well as its configuration files. We can also put the kernel binary here.</p><blockquote><p>Note: Some Linux distributions, like Ubuntu, will put their kernel in a third partition.</p></blockquote><p>Unlike Rootfs, OpenWrt Build System won't generate an ESP partition image for ARM64 platform. That means we have to build ESP partition manually.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~/mnt$ mkdir -p ~/mnt/esp</span><br><span class="line">tonny@vm:~/mnt$ sudo mount /dev/loop5p1 ~/mnt/esp</span><br></pre></td></tr></table></figure><h3 id="restore-rootfs-partition-image">Restore Rootfs Partition Image</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo dd <span class="keyword">if</span>=~/openwrt/bin/targets/armvirt/64/openwrt-21.02.2-armvirt-64-rootfs-ext4.img of=/dev/loop5p2 bs=1M</span><br><span class="line">104+0 records <span class="keyword">in</span></span><br><span class="line">104+0 records out</span><br><span class="line">109051904 bytes (109 MB, 104 MiB) copied, 0.613318 s, 178 MB/s</span><br><span class="line">tonny@vm:~$ sudo resize2fs /dev/loop5p2</span><br><span class="line">resize2fs 1.46.3 (27-Jul-2021)</span><br><span class="line">Resizing the filesystem on /dev/loop5p2 to 236283 (4k) blocks.</span><br><span class="line">The filesystem on /dev/loop5p2 is now 236283 (4k) blocks long.</span><br></pre></td></tr></table></figure><p>The size of Rootfs image is about 128 MB, which implies that the file system inside will assume the partition size is about 128 MB. The size of our Rootfs partition is likely larger than this number, so we should notify the filesystem there is a change on the partition size.</p><h3 id="install-grub-to-esp-partition">Install GRUB to ESP Partition</h3><h4 id="install-arm64-grub-to-host">Install ARM64 GRUB to Host</h4><p>For Ubuntu users,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo apt install grub-efi-arm64-bin</span><br></pre></td></tr></table></figure><blockquote><p>Note: If your Host's architecture isn't ARM64, Apt may fail to find this package. Fortunately, thanks to Multiarch feature, we can easily install a package for other architectures. Take Ubuntu AMD64 as an example.</p><ol type="1"><li>Request for ARM64 architecture's packages.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo dpkg --add-architecture arm64</span><br></pre></td></tr></table></figure><ol start="2" type="1"><li>Add an Apt Repository for ARM64.</li></ol><p>Modify the file <code>/etc/apt/source.list</code> and add a ARM64 repository. Pay attention that ARM64 and AMD64 don't share the same repository, so we also need to add a filter for each repository. Here is an example.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">deb [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish main restricted universe multiverse</span><br><span class="line"><span class="comment"># deb-src [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish main restricted universe multiverse</span></span><br><span class="line"></span><br><span class="line">deb [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-security main restricted universe multiverse</span><br><span class="line"><span class="comment"># deb-src [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-security main restricted universe multiverse</span></span><br><span class="line"></span><br><span class="line">deb [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-updates main restricted universe multiverse</span><br><span class="line"><span class="comment"># deb-src [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-updates main restricted universe multiverse</span></span><br><span class="line"></span><br><span class="line">deb [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-backports main restricted universe multiverse</span><br><span class="line"><span class="comment"># deb-src [ arch=amd64 ] https://mirrors.ustc.edu.cn/ubuntu/ impish-backports main restricted universe multiverse</span></span><br><span class="line"></span><br><span class="line">deb [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish main restricted universe multiverse</span><br><span class="line"><span class="comment"># deb-src [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish main restricted universe multiverse</span></span><br><span class="line"></span><br><span class="line">deb [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-security main restricted universe multiverse</span><br><span class="line"><span class="comment"># deb-src [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-security main restricted universe multiverse</span></span><br><span class="line"></span><br><span class="line">deb [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-updates main restricted universe multiverse</span><br><span class="line"><span class="comment"># deb-src [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-updates main restricted universe multiverse</span></span><br><span class="line"></span><br><span class="line">deb [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-backports main restricted universe multiverse</span><br><span class="line"><span class="comment"># deb-src [ arch=arm64 ] https://mirrors.ustc.edu.cn/ubuntu-ports/ impish-backports main restricted universe multiverse</span></span><br></pre></td></tr></table></figure><ol start="3" type="1"><li>Install ARM64 GRUB</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sudo apt update</span><br><span class="line">sudo apt install grub-efi-arm64-bin</span><br></pre></td></tr></table></figure></blockquote><h4 id="generate-efi-executable">Generate EFI Executable</h4><ol type="1"><li>Check Partition's UUIDs.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ lsblk -o PATH,UUID,PARTUUID /dev/loop5</span><br><span class="line">PATH         UUID                                 PARTUUID</span><br><span class="line">/dev/loop5</span><br><span class="line">/dev/loop5p1 CF95-2044                            3754ccb7-1920-2b41-9962-af81ac6a04b2</span><br><span class="line">/dev/loop5p2 ff313567-e9f1-5a5d-9895-3ba130b4a864 e09a20c3-0ea7-0c48-b653-0482facd93db</span><br></pre></td></tr></table></figure><p>Those UUIDs will be referred by the GRUB configurations.</p><ol start="2" type="1"><li>Write Early-stage GRUB Configuration.</li></ol><p>Create a new file <code>~/grub-early.cfg</code>, and write the following lines. This configuration will be hardcoded into GRUB's EFI binary.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">search.fs_uuid CF95-2044 root</span><br><span class="line"><span class="built_in">set</span> prefix=(<span class="variable">$root</span>)<span class="string">&#x27;/boot&#x27;</span></span><br><span class="line">configfile <span class="variable">$prefix</span>/grub.cfg</span><br></pre></td></tr></table></figure><p>Replace the UUID with your <code>loop5p1</code>'s.</p><ol start="3" type="1"><li>Make GRUB EFI Executable.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># tonny@vm:~$ sudo mount /dev/loop5p1 ~/mnt/esp</span></span><br><span class="line">tonny@vm:~$ sudo mkdir -p ~/mnt/esp/EFI/BOOT/</span><br><span class="line">tonny@vm:~$ <span class="built_in">cd</span> ~/mnt/esp/EFI/BOOT/</span><br><span class="line">tonny@vm:~/mnt/esp/EFI/BOOT$ sudo grub-mkimage -c ~/grub-early.cfg -p /boot -o BOOTAA64.EFI -O arm64-efi boot chain configfile fat linux ls part_gpt reboot serial efi_gop search_fs_uuid</span><br></pre></td></tr></table></figure><blockquote><p>Note: It is not recommended to use <code>grub-install</code> here. One of its typical usages is,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sudo grub-install --target=arm64-efi --efi-directory ~/mnt/esp --bootloader-id=GRUB --boot-directory ~/mnt/esp/boot/</span><br></pre></td></tr></table></figure><p>The hidden disgusting thing is, if you use GRUB provided by Ubuntu, this command will hardcode an important GRUB variable <code>prefix='/EFI/ubuntu'</code> to the EFI binary, and there is no way to change it.</p></blockquote><h4 id="write-second-stage-grub-configuration">Write Second-stage GRUB Configuration</h4><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~/mnt/esp/EFI/BOOT$ <span class="built_in">cd</span> ../..</span><br><span class="line">tonny@vm:~/mnt/esp$ sudo mkdir boot</span><br><span class="line">tonny@vm:~/mnt/esp$ <span class="built_in">cd</span> boot/</span><br><span class="line">tonny@vm:~/mnt/esp/boot$ sudo nano grub.cfg <span class="comment"># or other text editor you feel comfortable with</span></span><br></pre></td></tr></table></figure><p>The content of <code>grub.cfg</code> is,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">serial --unit=0 --speed=115200 --word=8 --parity=no --stop=1 --rtscts=off</span><br><span class="line">terminal_input console serial; terminal_output console serial</span><br><span class="line"></span><br><span class="line"><span class="built_in">set</span> default=<span class="string">&quot;0&quot;</span></span><br><span class="line"><span class="built_in">set</span> timeout=<span class="string">&quot;5&quot;</span></span><br><span class="line"></span><br><span class="line">menuentry <span class="string">&quot;OpenWrt&quot;</span> &#123;</span><br><span class="line">linux /boot/vmlinuz root=PARTUUID=e09a20c3-0ea7-0c48-b653-0482facd93db rootwait   console=tty0 console=ttyS0,115200n8 noinitrd</span><br><span class="line">&#125;</span><br><span class="line">menuentry <span class="string">&quot;OpenWrt (failsafe)&quot;</span> &#123;</span><br><span class="line">linux /boot/vmlinuz failsafe=<span class="literal">true</span> root=PARTUUID=e09a20c3-0ea7-0c48-b653-0482facd93db rootwait   console=tty0 console=ttyS0,115200n8 noinitrd</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Replace <strong>PARTUUIDs</strong> (not UUIDs) with your <code>loop5p2</code>'s.</p><h3 id="copy-linux-kernel">Copy Linux Kernel</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~/mnt/esp/boot$ sudo cp ~/openwrt/bin/targets/armvirt/64/openwrt-21.02.2-armvirt-64-Image vmlinuz</span><br></pre></td></tr></table></figure><h3 id="verify-the-disk-image">Verify the Disk Image</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~/mnt/esp/boot$ <span class="built_in">cd</span> ~</span><br><span class="line">tonny@vm:~$ qemu-system-aarch64 -m 512 -nographic -cpu cortex-a72 -smp 1 -M virt -bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd -drive format=raw,file=disk.img</span><br></pre></td></tr></table></figure><p>If everything goes well, you could see your kernel is running happily. Enjoy it!</p><blockquote><p>Note: You don't have to unmount the disk before launching the virtual machine. But you should sync the disk to make sure all the data cached in memory is written back.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~/mnt/esp/boot$ sync</span><br></pre></td></tr></table></figure></blockquote><h3 id="clean-up">Clean up</h3><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~/mnt$ sudo umount ~/mnt/esp</span><br><span class="line">tonny@vm:~/mnt$ sudo losetup -d /dev/loop5</span><br></pre></td></tr></table></figure><h2 id="launch-vm-with-virt-manager">Launch VM with Virt-Manager</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~/mnt$ virt-manager</span><br></pre></td></tr></table></figure><blockquote><p>Note: Sometimes <code>vert-manager</code> requires permissions to run.</p></blockquote><p>The recommended configuration:</p><ul><li>Step 1:<ul><li>Architecture: <code>aarch64</code></li><li>Machine Type: <code>virt</code></li><li>Import existing disk image</li></ul></li><li>Step 2:<ul><li>Browse ➡️ Add pool <code>Home</code> ➡️ Choose Volume <code>disk.img</code></li><li>Choose OS: Generic Linux / OS</li></ul></li><li>Step 3:<ul><li>Memory: &gt;= 256 MB</li><li>CPU: Any</li></ul></li><li>Step 4:<ul><li><strong>Customize configuration before install</strong></li><li>Network (LAN Port): Bridge / Macvtap Bridge</li></ul></li><li>Configuration<ul><li><strong>Overview/Firmware: <code>UEFI aarch64</code></strong></li></ul></li></ul><blockquote><p>Note: You can't change the firmware type after pre-install configuration.</p></blockquote><h2 id="references">References</h2><ul><li><a href="https://gist.github.com/tstellanova/dea7593a7dfe4f48432a58cb007e7056">https://gist.github.com/tstellanova/dea7593a7dfe4f48432a58cb007e7056</a></li><li><a href="https://forum.openwrt.org/t/arm64-armvirt64-uefi-efi-openwrt-target/82740">https://forum.openwrt.org/t/arm64-armvirt64-uefi-efi-openwrt-target/82740</a></li><li><a href="https://forum.openwrt.org/t/how-to-install-openwrt-as-a-new-os-in-the-grub-menu/97465">https://forum.openwrt.org/t/how-to-install-openwrt-as-a-new-os-in-the-grub-menu/97465</a></li><li><a href="https://wiki.ubuntu.com/ARM64/QEMU">https://wiki.ubuntu.com/ARM64/QEMU</a></li><li><a href="https://wiki.archlinux.org/title/GRUB">https://wiki.archlinux.org/title/GRUB</a></li><li><a href="https://openwrt.org/docs/guide-user/virtualization/qemu">https://openwrt.org/docs/guide-user/virtualization/qemu</a></li><li><a href="https://krinkinmu.github.io/2020/11/21/EFI-aarch64.html">https://krinkinmu.github.io/2020/11/21/EFI-aarch64.html</a></li><li><a href="https://soha.moe/post/make-uefi-compatible-openwrt-disk-image.html">https://soha.moe/post/make-uefi-compatible-openwrt-disk-image.html</a></li><li><a href="https://git.openwrt.org/?p=openwrt/openwrt.git;hb=refs/heads/openwrt-21.02;a=blob;f=package/boot/grub2/Makefile">https://git.openwrt.org/?p=openwrt/openwrt.git;hb=refs/heads/openwrt-21.02;a=blob;f=package/boot/grub2/Makefile</a></li><li><a href="https://www.cxyzjd.com/article/u010875635/74289971">https://www.cxyzjd.com/article/u010875635/74289971</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;OpenWrt doesn&#39;t provide a combined disk image for ARM virtual machines, unlike what they did for x86 VMs. Meanwhile, their official ARM64 kernel release can&#39;t boot in UEFI environment. But we can still make it work by compiling it from source and building a disk image manually.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>How to resize the root LVM partition of Ubuntu</title>
    <link href="https://blog.mylab.cc/2022/02/26/How-to-resize-the-root-LVM-partition-of-Ubuntu/"/>
    <id>https://blog.mylab.cc/2022/02/26/How-to-resize-the-root-LVM-partition-of-Ubuntu/</id>
    <published>2022-02-26T09:32:57.000Z</published>
    <updated>2022-02-26T09:36:05.312Z</updated>
    
    <content type="html"><![CDATA[<p>When we resize the virtual hard disk of a virtual machine or restore a disk image to a larger disk, the free space of the partition detected by Ubuntu will not increase because the partition table is unchanged. In the past, we could easily resize the ext4 root partition with the help of <code>resize2fs</code>. However, things get complex when Ubuntu utilizes LVM partition as their default root partition.</p><h2 id="quick-intro-to-lvm">Quick Intro to LVM</h2><p>Logical Volume Manager (LVM) is similar to Dynamic Disks under Windows, which can take several GPT / MBR partitions on different hard disks as a storage pool (LVM call it Volume Groups, VG), and allocate spaces from this pool, then Linux will recognize each space (LVM call it Logical Volume, LV) as an useable partition.</p><figure><img data-src="/images/pasted-99.png" alt="Lvm Layout" /><figcaption>Lvm Layout</figcaption></figure><p>Thus, we should modify not only <strong>the GPT / MBR partition table</strong>, but also <strong>the LVM configuration</strong>.</p><h2 id="update-gpt-mbr-partition-table">Update GPT / MBR partition table</h2><p><strong>I suggest all the operations should be done under live CD environment to avoid the occurrence of unpredictable problems.</strong> I didn't test online resizing on the root partition so far.</p><ol type="1"><li>The following instructions in this section assume <strong>the last partition on your disk is the LVM Physical Volume</strong>. You could verify this with the command <code>lsblk --fs</code>. <code>nvme0n1p3</code> is the last GPT partition on the disk <code>nvme0n1</code>, and it is easy to identify this partition is a LVM PV, and <code>ubuntu--vg-ubuntu--lv</code> is the corresponding LV.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ lsblk --fs</span><br><span class="line">NAME                      FSTYPE      FSVER        </span><br><span class="line">loop3</span><br><span class="line">└─loop3p1                 LVM2_member LVM2 001 </span><br><span class="line">  └─test--vg-test--lv     ext4        1.0      </span><br><span class="line">nvme0n1</span><br><span class="line">├─nvme0n1p1               vfat        FAT32    </span><br><span class="line">├─nvme0n1p2               ext4        1.0      </span><br><span class="line">└─nvme0n1p3               LVM2_member LVM2 001 </span><br><span class="line">  └─ubuntu--vg-ubuntu--lv ext4        1.0      </span><br></pre></td></tr></table></figure><blockquote><p>Note that <code>ubuntu--vg-ubuntu--lv</code> is the root partition of the system here.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ df -Th</span><br><span class="line">Filesystem                        Type   Size  Used Avail Use% Mounted on</span><br><span class="line">/dev/mapper/ubuntu--vg-ubuntu--lv ext4    78G  7.7G   67G  11% /</span><br><span class="line">/dev/nvme0n1p2                    ext4   974M   87M  820M  10% /boot</span><br><span class="line">/dev/nvme0n1p1                    vfat   511M  3.6M  508M   1% /boot/efi</span><br><span class="line">/dev/mapper/test--vg-test--lv     ext4   464M   24K  429M   1% /home/tonny/mnt</span><br></pre></td></tr></table></figure><p>Also you could check the LVM Volume Group status by <code>vgs</code>.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo vgs</span><br><span class="line">  VG        <span class="comment">#PV #LV #SN Attr   VSize   VFree</span></span><br><span class="line">  test-vg     1   1   0 wz--n- 496.00m    0</span><br><span class="line">  ubuntu-vg   1   1   0 wz--n- &lt;78.50g    0</span><br></pre></td></tr></table></figure></blockquote><ol start="2" type="1"><li>Update the GPT / MBR partition table using <code>fdisk</code>. I will use an emulated disk <code>/dev/loop3</code> to demonstrate the whole process. Don't worry, you won't loss your data under normal circumstances. These commands will only modify the partition table, but make sure <strong>DO NOT remove the LVM's signature</strong>, otherwise the system may no longer recognize your LVM PV.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo fdisk /dev/loop3 <span class="comment"># replace with your hard disk, such as /dev/nvme0n1p3</span></span><br><span class="line"></span><br><span class="line">Welcome to fdisk (util-linux 2.36.1).</span><br><span class="line">Changes will remain <span class="keyword">in</span> memory only, until you decide to write them.</span><br><span class="line">Be careful before using the write <span class="built_in">command</span>.</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): p</span><br><span class="line">Disk /dev/loop3: 1 GiB, 1073741824 bytes, 2097152 sectors</span><br><span class="line">Units: sectors of 1 * 512 = 512 bytes</span><br><span class="line">Sector size (logical/physical): 512 bytes / 512 bytes</span><br><span class="line">I/O size (minimum/optimal): 512 bytes / 512 bytes</span><br><span class="line">Disklabel <span class="built_in">type</span>: gpt</span><br><span class="line">Disk identifier: C5F55056-8C56-5448-81E4-567F59AD93ED</span><br><span class="line"></span><br><span class="line">Device       Start     End Sectors  Size Type</span><br><span class="line">/dev/loop3p1  2048 1026047 1024000  500M Linux filesystem</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): d</span><br><span class="line">Selected partition 1</span><br><span class="line">Partition 1 has been deleted.</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): n</span><br><span class="line">Partition number (1-128, default 1):</span><br><span class="line">First sector (2048-2097118, default 2048):</span><br><span class="line">Last sector, +/-sectors or +/-size&#123;K,M,G,T,P&#125; (2048-2097118, default 2097118):</span><br><span class="line"></span><br><span class="line">Created a new partition 1 of <span class="built_in">type</span> <span class="string">&#x27;Linux filesystem&#x27;</span> and of size 1023 MiB.</span><br><span class="line">Partition <span class="comment">#1 contains a LVM2_member signature.</span></span><br><span class="line"></span><br><span class="line">Do you want to remove the signature? [Y]es/[N]o: N</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): p</span><br><span class="line"></span><br><span class="line">Disk /dev/loop3: 1 GiB, 1073741824 bytes, 2097152 sectors</span><br><span class="line">Units: sectors of 1 * 512 = 512 bytes</span><br><span class="line">Sector size (logical/physical): 512 bytes / 512 bytes</span><br><span class="line">I/O size (minimum/optimal): 512 bytes / 512 bytes</span><br><span class="line">Disklabel <span class="built_in">type</span>: gpt</span><br><span class="line">Disk identifier: C5F55056-8C56-5448-81E4-567F59AD93ED</span><br><span class="line"></span><br><span class="line">Device       Start     End Sectors  Size Type</span><br><span class="line">/dev/loop3p1  2048 2097118 2095071 1023M Linux filesystem</span><br><span class="line"></span><br><span class="line">Command (m <span class="keyword">for</span> <span class="built_in">help</span>): w</span><br><span class="line">The partition table has been altered.</span><br><span class="line">Calling ioctl() to re-read partition table.</span><br><span class="line">Syncing disks.</span><br></pre></td></tr></table></figure><h2 id="update-lvm-configuration">Update LVM Configuration</h2><ol type="1"><li>Notify LVM there is an update on the partition table.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo partprobe <span class="comment"># ask kernel to read the new partition table</span></span><br><span class="line">tonny@vm:~$ sudo pvresize /dev/loop3p1 <span class="comment"># replace with your partition</span></span><br><span class="line">  Physical volume <span class="string">&quot;/dev/loop3p1&quot;</span> changed</span><br><span class="line">  1 physical volume(s) resized or updated / 0 physical volume(s) not resized</span><br></pre></td></tr></table></figure><blockquote><p>At this moment, the LVM Volume Group status has changed to,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo vgs</span><br><span class="line">  VG        <span class="comment">#PV #LV #SN Attr   VSize    VFree</span></span><br><span class="line">  test-vg     1   1   0 wz--n- 1020.00m 524.00m</span><br><span class="line">  ubuntu-vg   1   1   0 wz--n-  &lt;78.50g      0</span><br></pre></td></tr></table></figure><p>Observe that <code>VFree</code> of <code>test-vg</code> has increased by 524.00 MB.</p></blockquote><ol start="2" type="1"><li>Resize LVM Logical Volume. The following command will allocate all the free space of VG to the LV.</li></ol><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo lvextend -l +100%FREE /dev/mapper/test--vg-test--lv</span><br><span class="line">  Size of logical volume test-vg/test-lv changed from 496.00 MiB (124 extents) to 1020.00 MiB (255 extents).</span><br><span class="line">  Logical volume test-vg/test-lv successfully resized.</span><br></pre></td></tr></table></figure><blockquote><p>The free space of VG is used up now.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo vgs</span><br><span class="line">  VG        <span class="comment">#PV #LV #SN Attr   VSize    VFree</span></span><br><span class="line">  test-vg     1   1   0 wz--n- 1020.00m    0</span><br><span class="line">  ubuntu-vg   1   1   0 wz--n-  &lt;78.50g    0</span><br></pre></td></tr></table></figure></blockquote><h2 id="resize-ext4-file-system">Resize ext4 File System</h2><p>Up to now, although LVM LV is resized, the ext4 file system is not aware of the extra available space. Simply run <code>resize2fs</code> to let it know.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ sudo resize2fs /dev/mapper/test--vg-test--lv</span><br><span class="line">resize2fs 1.46.3 (27-Jul-2021)</span><br><span class="line">Filesystem at /dev/mapper/test--vg-test--lv is mounted on /home/tonny/mnt; on-line resizing required</span><br><span class="line">old_desc_blocks = 1, new_desc_blocks = 1</span><br><span class="line">The filesystem on /dev/mapper/test--vg-test--lv is now 261120 (4k) blocks long.</span><br></pre></td></tr></table></figure><blockquote><p>We can see the available space of <code>test--vg-test--lv</code> has been enlarged.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">tonny@vm:~$ df -h</span><br><span class="line">Filesystem                         Size  Used Avail Use% Mounted on</span><br><span class="line">/dev/mapper/test--vg-test--lv      973M  1.3M  917M   1% /home/tonny/mnt</span><br></pre></td></tr></table></figure></blockquote><h2 id="references">References</h2><ul><li><a href="https://www.linuxtechi.com/extend-lvm-partitions/">https://www.linuxtechi.com/extend-lvm-partitions/</a></li><li><a href="https://www.thegeekdiary.com/centos-rhel-how-to-extend-physical-volume-in-lvm-by-extending-the-disk-partition-used/">https://www.thegeekdiary.com/centos-rhel-how-to-extend-physical-volume-in-lvm-by-extending-the-disk-partition-used/</a></li><li><a href="https://i0.wp.com/manjaro.site/wp-content/uploads/2017/08/lvm-layout.png">https://i0.wp.com/manjaro.site/wp-content/uploads/2017/08/lvm-layout.png</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;When we resize the virtual hard disk of a virtual machine or restore a disk image to a larger disk, the free space of the partition detected by Ubuntu will not increase because the partition table is unchanged. In the past, we could easily resize the ext4 root partition with the help of &lt;code&gt;resize2fs&lt;/code&gt;. However, things get complex when Ubuntu utilizes LVM partition as their default root partition.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>博客年度总结既 Hexo 第三次魔改记录</title>
    <link href="https://blog.mylab.cc/2022/01/31/%E5%8D%9A%E5%AE%A2%E5%B9%B4%E5%BA%A6%E6%80%BB%E7%BB%93%E6%97%A2-Hexo-%E7%AC%AC%E4%B8%89%E6%AC%A1%E9%AD%94%E6%94%B9%E8%AE%B0%E5%BD%95/"/>
    <id>https://blog.mylab.cc/2022/01/31/%E5%8D%9A%E5%AE%A2%E5%B9%B4%E5%BA%A6%E6%80%BB%E7%BB%93%E6%97%A2-Hexo-%E7%AC%AC%E4%B8%89%E6%AC%A1%E9%AD%94%E6%94%B9%E8%AE%B0%E5%BD%95/</id>
    <published>2022-01-30T18:18:52.000Z</published>
    <updated>2022-01-30T18:29:15.000Z</updated>
    
    <content type="html"><![CDATA[<p>在除夕前那么几天终于有一点闲空了，早在半年前，我就对原来的博客很不爽了，原来的主题丑到我了，博客系统也废了很久（不过怎么还有人能成功评论了，我自己都不能登录上去）。突然又发现 NexT 主题悄悄换了个仓库，早就更新了一个大版本了，连渲染后端都换成了 Nunjucks 了。总之，是时候爆改我博客的 Remix 主题了。</p><h2 id="更新日志">更新日志</h2><figure><img data-src="/images/pasted-96.png" alt="upload successful" /><figcaption>upload successful</figcaption></figure><figure><img data-src="/images/pasted-97.png" alt="upload successful" /><figcaption>upload successful</figcaption></figure><figure><img data-src="/images/pasted-98.png" alt="upload successful" /><figcaption>upload successful</figcaption></figure><ul><li>2022.1 NexT.Remix v3 (Preview)：目前的版本，基于 NexT.Gemini v8，融合了 <a href="https://dnocm.com/cake/">Hexo Cake</a> 和 <a href="https://github.com/CaiJimmy/hugo-theme-stack">Hugo Stack</a> 两个主题的风格</li><li>2021.8 NexT.Gemini (Remix v2)：速度优化</li><li>2020.5 NexT.Gemini (Remix v1)：初始的魔改版本，基于 NexT.Gemini v7，风格参考了 Hexo Terminal 和某个 Markdown Resume 模板</li></ul><p>由于NexT v8较v7有大量的代码改动，<del>加上我原来改代码的方式充满野性</del>，NexT.Remix 的代码并不继承于之前的魔改主题，且代码的改动遵守了 NexT 的魔改规范，使用了 Theme Inject 来魔改主题，仅增加数个文件，并未对原有文件进行修改，能够（相对）方便的合并 upstream 的新代码。</p><p>除了更新 upstream 和 Hexo 的版本以及界面风格调整外，还把烂掉了的 utterance 换成了 giscus。</p><h3 id="todo">TODO</h3><ul><li>暂时使用 jsdelivr 分发部分公共 JS 库，之后改成用博客的 CDN 来分发这些文件以提升国内访问速度和稳定性。</li><li>目前 giscus 的 integration 做的很粗糙，等到改好了就给 upstream 交个 PR 。</li></ul><h2 id="演进方向">演进方向</h2><h3 id="现代">现代</h3><p>NexT 作为一个有着丰富历史（比如换了两次仓库）的主题，它仍然不忘初心，到现在还保持着最初的模样。然而我更喜欢时下流行的 <strong>后·扁平化</strong> 风格，但又馋 NexT 丰富的功能，同时也懒得迁移平台，所以只能去把 NexT 变成我喜欢的样子了。</p><p>由于我并不是什么设计带师，就只好 ”参考“ 已有的优秀样式。</p><ul><li>Hexo Cake：在 NexT v7 上魔改的一个主题。总体很棒，由于都是 NexT·改，”参考“起来更方便了</li><li>Hugo Stack：喜欢它的阴影和配色</li></ul><p>实际上，NexT 的底子非常不错，随便改改就能完全满足我的审美。</p><h3 id="简洁">简洁</h3><p>如果配置得当，NexT本身的界面并不臃肿。这是 Remix v1 开始就在追求的目标。这一次进一步的删除掉不必要的元素，比如到处都是的下划线，友链上那一堆，文章目录上那一堆，还有日期上的下划线。另外 Pagination 也成了我重拳出击的对象。</p><p>此外，从很久以前开始，我就在弱化标签和分类这两个功能，因为我自己的习惯是从来不看博客文章的标签和分类，读者也都是从搜索引擎跳转过来的，搜索引擎也不需要标签就能自己从文中提取关键词，<del>当然更重要的原因就是我懒得加这些东西</del>，所以界面上关于标签和分类的元素也减少了。</p><h3 id="个性">个性</h3><p>这也是为什么要自己魔改主题的原因。一个显而易见的原因当然就是不希望自己的博客主题和其他大路货撞车。其次就是博客主题要符合自己的写作风格，不同于 <a href="https://www.whexy.com/">这位</a>，追求读者阅读的极致体验与获得感，我希望我能写出来：</p><ul><li>仅期望我自己，有时也包括事件相关者阅读的回忆类内容</li><li>（最好是独一无二的）技术类文章<ul><li>在自己忘掉的时候给自己参考</li><li><strong>顺便</strong>给<strong>找不到其他资料</strong>的人参考</li></ul></li></ul><p>也就是说，我并不会花很多心思在提升读者阅读体验上。对我来说，我不喜欢把很多精力放在我不关心的东西上，一些不那么重要的问题就怎么省力怎么来。比如配图，别说统一配图的风格了，如果这个配图只是为了美观，我选择不配图。因此，没有文章配图也很好看的主题就是我需要的。那种不需要写摘要，会自动把文章第一段当作摘要的主题就是我需要的。</p><p>我希望我的文章能侧重于回答那些暂时无解或者没人总结答案的但很多人关心的问题上，<del>这样读者在救命稻草前肯定不会对阅读体验挑三拣四</del>。当然基本的阅读体验还是要有的，魔改主题提升文章可读性也是改善阅读体验的一部分。</p><p>还有就是，我指望魔改博客主题这件事能够一定程度的体现出博主的水平……什么，你说 dalao 都是自己造博客框架的？我又不是前端专业，我不揽这个瓷器活。</p><p>哦对了，现在博客使用一种叫 Neko 语的东西，这语言一部分是中英双语，一部分是被我改掉的 NexT 的塑料英语。</p><h2 id="总结">总结</h2><p>自我感觉这一年这个博客还是取得了显著的进步，看起来更 Professional 了。</p><h3 id="文章数量">文章数量</h3><p>似乎没有维持出一个月一篇的节奏。Anyway，我自认为文章质量比去年的还是强了一丢丢。（<del>可能是我太摸了，所以没有踩到什么坑。</del>）</p><h3 id="博客主题">博客主题</h3><p>改完之后我舒服了，从表面到代码实现都比原来美观了不少。</p><h3 id="访问速度">访问速度</h3><p>以前的方案又贵又拉，Azure CDN + Github Page 这套太强了。</p><h3 id="访问量">访问量</h3><p>只要写文章的速度比文章过气的速度快，访问量一定是会增长的。只不过百度死活还是只收录了主页，辣鸡玩意。</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;在除夕前那么几天终于有一点闲空了，早在半年前，我就对原来的博客很不爽了，原来的主题丑到我了，博客系统也废了很久（不过怎么还有人能成功评论了，我自己都不能登录上去）。突然又发现 NexT 主题悄悄换了个仓库，早就更新了一个大版本了，连渲染后端都换成了 Nunjucks 了。总之，是时候爆改我博客的 Remix 主题了。&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Unattended Ubuntu 20.04 Server Offline Installation</title>
    <link href="https://blog.mylab.cc/2022/01/16/Unattended-Ubuntu-20-04-Server-Offline-Installation/"/>
    <id>https://blog.mylab.cc/2022/01/16/Unattended-Ubuntu-20-04-Server-Offline-Installation/</id>
    <published>2022-01-15T16:45:11.000Z</published>
    <updated>2022-09-03T11:22:28.638Z</updated>
    
    <content type="html"><![CDATA[<p>Last year, I wrote <a href="/2021/04/27/How-to-make-an-unattended-Ubuntu-Server-Installation-ISO-with-cloud-init/">a post</a> about how to install Ubuntu 18.04 Server automatically. The major reason why I choose to install the older version is I failed to make Ubuntu 20.04 install without pressing any key at that time while the approach for the offline installation recommended by the official is not working.</p><p>There is <a href="https://www.pugetsystems.com/labs/hpc/How-To-Make-Ubuntu-Autoinstall-ISO-with-Cloud-init-2213/">an article</a> already described the detailed steps about the automatic installation of Ubuntu 20.04 Server, but according what its author said, their blog system ripped out some important characters. So, by checking with <a href="https://gist.github.com/s3rj1k/55b10cd20f31542046018fcce32f103e">this script</a>, I figured out the correct way to achieve our goal.</p><h2 id="download-the-image">Download the image</h2><p>Download the live CD image in whichever way you prefer. For the user locates in China, I would suggest you download from <a href="https://mirrors.tuna.tsinghua.edu.cn/ubuntu-releases/focal/ubuntu-20.04.3-live-server-amd64.iso">https://mirrors.tuna.tsinghua.edu.cn/ubuntu-releases/focal/ubuntu-20.04.3-live-server-amd64.iso</a>.</p><h2 id="update-some-files-in-iso-image">Update some files in ISO image</h2><p>Only thing we need to do is updating several files. And here are some recommended editors.</p><ul><li>For Windows user, I strongly recommend you use <code>Ultraiso</code> to edit the ISO file.</li><li>For Linux user, <code>ISO Master</code> should work. (I didn't try it before.)</li><li>For the user who wants to deeply customize the ISO file, including unpacking <code>rootfs</code> image, <code>cubic</code> is everything you need. You can refer to <a href="(/2021/04/27/How-to-make-an-unattended-Ubuntu-Server-Installation-ISO-with-cloud-init/)">my previous post</a> and learn how it works.</li></ul><h3 id="add-kernel-arguments">Add Kernel Arguments</h3><p>Assume the root directory of ISO image is <code>/cdrom</code>. There are two bootloader configuration files need to modify, one for UEFI system, one for the legacy one. Append the kernel arguments like this,</p><ul><li><code>/cdrom/isolinux/txt.cfg</code></li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">label live</span><br><span class="line">  menu label ^Install Ubuntu Server</span><br><span class="line">  kernel /casper/vmlinuz</span><br><span class="line">  append   initrd=/casper/initrd quiet autoinstall ds=nocloud;s=/cdrom/  ---</span><br></pre></td></tr></table></figure><ul><li><code>/cdrom/boot/grub/grub.cfg</code></li></ul><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">menuentry <span class="string">&quot;Install Ubuntu Server&quot;</span> &#123;</span><br><span class="line"><span class="built_in">set</span> gfxpayload=keep</span><br><span class="line">linux/casper/vmlinuz   quiet autoinstall ds=nocloud\;s=/cdrom/ ---</span><br><span class="line">initrd/casper/initrd</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>If you would like to skip the integrity check, you could try to append the kernel argument <code>fsck.mode=skip</code> as the following example shows.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># File: /cdrom/isolinux/txt.cfg</span></span><br><span class="line"></span><br><span class="line">label live</span><br><span class="line">  menu label ^Install Ubuntu Server</span><br><span class="line">  kernel /casper/vmlinuz</span><br><span class="line">  append   initrd=/casper/initrd quiet fsck.mode=skip autoinstall ds=nocloud;s=/cdrom/ ---</span><br><span class="line"></span><br><span class="line"><span class="comment"># File: /cdrom/boot/grub/grub.cfg</span></span><br><span class="line"></span><br><span class="line">menuentry <span class="string">&quot;Install Ubuntu Server&quot;</span> &#123;</span><br><span class="line"><span class="built_in">set</span> gfxpayload=keep</span><br><span class="line">linux/casper/vmlinuz   quiet fsck.mode=skip autoinstall ds=nocloud\;s=/cdrom/ ---</span><br><span class="line">initrd/casper/initrd</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><blockquote><p>Note: HWE Kernel is a higher version of Linux kernel compared to the default one, which is shipped with the newer drivers. Theoretically, it has a better support for the latest hardware.</p></blockquote><h3 id="add-auto-install-configurations">Add Auto-install Configurations</h3><p>Two new files are also required for automatic answering.</p><ul><li><code>/cdrom/user-data</code></li></ul><p>This configuration is what I am using now, and it is for the machine without Internet. I have verified that it can make the installation procedure fully automatic.</p><figure class="highlight yaml"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">#cloud-config</span></span><br><span class="line"><span class="attr">autoinstall:</span></span><br><span class="line">  <span class="attr">version:</span> <span class="number">1</span></span><br><span class="line">  <span class="attr">storage:</span>  <span class="comment"># should set the interactive default but doesn&#x27;t seem to work??</span></span><br><span class="line">    <span class="attr">layout:</span></span><br><span class="line">      <span class="attr">name:</span> <span class="string">direct</span></span><br><span class="line">  <span class="attr">locale:</span> <span class="string">en_US.UTF-8</span></span><br><span class="line">  <span class="attr">keyboard:</span></span><br><span class="line">    <span class="attr">layout:</span> <span class="string">us</span></span><br><span class="line">  <span class="attr">identity:</span></span><br><span class="line">    <span class="attr">hostname:</span> <span class="string">ubuntu-server</span></span><br><span class="line">    <span class="attr">password:</span> <span class="string">&quot;$6$exDY1mhS4KUYCE/2$zmn9ToZwTKLhCw.b4/b.ZRTIZM30JZ4QrOQ2aOXJ8yk96xpcCof0kxKwuX1kqLG/ygbJ1f8wxED22bTL4F46P0&quot;</span></span><br><span class="line">    <span class="attr">username:</span> <span class="string">tonny</span></span><br><span class="line">  <span class="attr">ssh:</span></span><br><span class="line">    <span class="attr">allow-pw:</span> <span class="literal">true</span></span><br><span class="line">    <span class="attr">install-server:</span> <span class="literal">true</span></span><br><span class="line">  <span class="attr">package_update:</span> <span class="literal">false</span></span><br><span class="line">  <span class="attr">package_upgrade:</span> <span class="literal">false</span></span><br></pre></td></tr></table></figure><blockquote><p>Note:</p><ul><li><code>direct</code> storage layout means using and erasing the whole disk. (The default option provided by the interactive installer.)</li><li>The password is <code>ubuntu</code>. This can be generated by <code>mkpasswd</code>.</li></ul></blockquote><p>For more usages, check <a href="https://gist.github.com/dbkinghorn/c236aea31d76028b2b6ccdf6d3c6f07e">this example</a>.</p><ul><li><code>/cdrom/meta-data</code></li></ul><p>Just create an empty file and put it there.</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;Last year, I wrote &lt;a href=&quot;/2021/04/27/How-to-make-an-unattended-Ubuntu-Server-Installation-ISO-with-cloud-init/&quot;&gt;a post&lt;/a&gt; about how to install Ubuntu 18.04 Server automatically. The major reason why I choose to install the older version is I failed to make Ubuntu 20.04 install without pressing any key at that time while the approach for the offline installation recommended by the official is not working.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>SC21回顾 - 赢了，但只赢了一点点</title>
    <link href="https://blog.mylab.cc/2022/01/01/SC21%E5%9B%9E%E9%A1%BE-%E8%B5%A2%E4%BA%86%EF%BC%8C%E4%BD%86%E5%8F%AA%E8%B5%A2%E4%BA%86%E4%B8%80%E7%82%B9%E7%82%B9/"/>
    <id>https://blog.mylab.cc/2022/01/01/SC21%E5%9B%9E%E9%A1%BE-%E8%B5%A2%E4%BA%86%EF%BC%8C%E4%BD%86%E5%8F%AA%E8%B5%A2%E4%BA%86%E4%B8%80%E7%82%B9%E7%82%B9/</id>
    <published>2021-12-31T19:55:10.000Z</published>
    <updated>2021-12-31T19:56:08.000Z</updated>
    
    <content type="html"><![CDATA[<p>SC21又又又又是在线上打的。第二年痛失美帝免费旅游机会了！！！第二年了！！！<del>没有机票，酒店，和大吃大喝的比赛能叫比赛吗！</del>不过结果还是不错的，远远的超出我的预期（原因请看下文分解）。</p><h2 id="sc20没有什么比这更绝望的了">SC20：没有什么比这更绝望的了</h2><p>时间回到去年，我并没有给SC20写回顾，因为这场比赛打得实在是——太！烂！了！烂到以至于取得了倒数第二的优异成绩，烂到了我都不好意思把这个比赛写进简历里，烂到了我到SC21开赛的时候才去问SC20的成绩。尽管SC20是我们第一次参加的SCC，但不至于烂的这么有特色。上一场比赛出现了包括但不限于以下这些情况，</p><ul><li>完全不存在的赛前准备，虽然时任队长已经尽力在准备Cyclecloud的环境了（甚至去研究了<code>cluster-init</code>），但其他队员似乎就完全没碰过云环境（伏笔1）</li><li>两个16级的跑路了（不过当时拉他们过来的时候，也说好了他们可以当placeholder）</li></ul><p>我自己那时候负责的是复现挑战，不过我当时根本就不知道复现挑战要干嘛，甚至把精力放在了改代码上。我用了GDR来优化GPU P2P通信来着，还觉得自己挺牛批的来着（伏笔2）。Webinar完全没有看，到了比赛的时候才知道要“在46小时的时间里写出具有发表在国际刊物水平的report”。我作为一个无paper选手（截至目前也还是没paper），用Excel都画不明白的那种，到了场上才发现——坏了。</p><p>要复现的论文里提出的程序MemXCT有CPU和GPU的版本，比赛要求用两个版本跑出来的数据画图写报告。GPU版本这边，由于之前完全没用过云集群，我完全不知道云上预装了什么东西。队长在强力推销他的HPCX OpenMPI库，而我自己在测试的时候用的是NVIDIA HPC SDK里的OpenMPI。我一想，HPCX是NVIDIA的，HPC SDK也是NVIDIA的，它们的OpenMPI应该是同一个东西吧。结果很显然，MPI炸得很灿烂，程序就是死活都跑不起来。在无穷无尽的MPI调参和重新编译，MCA的各种参数各种排列组合都试了一圈以后，嗯，没有什么效果。套GDB单步慢慢调，去找爆炸源（那一会我还不知道GDB可以直接Traceback），发现就是我改的GDR P2P那几行代码炸了，但HPCX的OpenMPI不是自称是CUDA Aware还支持GDR的吗，年少无知的我并没有对HPCX产生怀疑，从第一天开始，一直瞎勾巴试到第二天晚上，才想到用回NVIDIA HPC SDK试试。MPI一换，什么问题都没有了。得益于这段时间里，我享受了长达4个小时的精致睡眠，我的大脑throttle到完全没有意识到“放弃这个优化”这个操作（且复现题压根就不需要优化），哪怕是其他Application，把程序跑起来才是重中之重。</p><p>CPU版本这边，<del>总所周知（并没有）</del>，我们可以用<code>mpirun</code>的各种各样奇奇怪怪的参数（如<code>ppn</code>，<code>map-by</code>）或者Job Scheduler（如Slurm）来绑核。然而当时我既不懂Slurm，又不懂<code>mpirun</code>的参数，完了这程序还是Hybrid的（MPI + OpenMP），需要给MPI Rank分配多几个CPU核，而不是1个Rank一个核。我一顿操作猛如虎。好消息是，程序跑起来了。坏消息是，程序以一种很奇怪的姿势运行在多个节点上，比如明明要跑四节点，却出现了一节点有难，三个节点围观的情况。</p><p>到了比赛快结束的时候，两边才正常跑起来，由于根 本 没 有 提 前 写 论 文，也 不 知 道 怎 么 画 Academic的图，要不是有一个临时安排的帮手，在比赛结束前report写不完也就算了，估计连张图都画不出来。由于交的实在是太晚，这堆学术垃圾在提交到一半的时候比赛就结束了，这样也好，这篇完成度极低的黑历史就再也不会有其他人看到了。</p><p>其他人那边也没好到哪去，由于赛前准备什么的基本不存在（人还跑路了），Applications不能说是一塌糊涂，只能说是亿塌糊涂，基本上拿不到几个分，CESM跑不出，GROMACS只跑了一点，倒是MiniVite似乎还行。最后只能寄希望于Benchmarks，把剩下的Funding All in到Benchmarks上，说不定还能捞一个单项奖。当时他们刷了半天HPL，最后刷到了120TFlops，混到了<strong>暂时的</strong>第一，反手就被ETH的129T打成灰了，反手反手就被半夜想搞大新闻的THU搞出了大新闻，300T打得灰都不剩了。与此同时T队的HPCG，IO500分也把榜打爆了，这两玩意的分数是当时榜一（可怜的靶子）的3.89倍，5.76倍，似乎看到了摩尔定律复活的希望。在这种分数差面前，我们并没有任何反抗的余地，只能对其深夜5am炸鱼的行为表示强烈的谴责。</p><p>赛后才知道，HPL和HPCG方面，T队几个月前向组委会提交的plan里，早就盘算着把Azure机房洗劫一空，只要我GPU堆得越多，跑分速度就能快到其他队连尾气都闻不到，只要我操作够快，火速开机火速关机，一大堆GPU也花不掉几个钱。并且充分考虑到V100节点不够的问题，也准备了转用其他GPU节点的预案。组委会许可了他们的方案<del>，并搬出板凳坐等看戏</del>。反倒是我们的HPL因为超规格的问题，被扣分扣成了倒数。IO500更是离大谱，自研了打榜专用文件系统MadFS（详见<a href="https://www.youtube.com/watch?v=NRZhFoBC_Ak&amp;t=8s">金枪鱼之夜——IO500 S: There is rjgg behind MadFS - YouTube</a>），科研成果下放到学生超算竞赛，直接形成降维打击，不得不说THU的System方向真是tql，TUNA里个个都是人才，说话又好听。</p><p>总之，要不是SC21打得不错，我是再也不会去提SC20的事情了。</p><h2 id="sc21-phase-0队友人呢">SC21 Phase 0：队友人呢？</h2><p>今年招新的结果非常意外，忽悠到了一些《建议直接颁发硕士毕业证》的20级新人。今年招新是让感兴趣的入队的人做一份笔试题，目的当然是选出愿意去了解这个领域并去做一些搜索的人，由于计系没有哪门课介绍了超算，我对于他们的预期就是《言之有理即可》，aka《有字就行》，结果出现了这种情况，</p><blockquote><p>Neko.d&gt; 同学你好，由于你以一己之力让我怀疑题目出得太简单了，所以我希望今天能和你当面聊一下...</p></blockquote><p>怎么会有人开局（大一）就几乎啥都懂啊，那还要我们老油条干嘛？</p><p>但另一个问题出现了，这次招新一个妹子都没忽悠到，不像去年又有Female还有Transgender，Diversity直接拉满，四舍五入直接等于保送进决赛。今年我们队里现在全是臭男人，proposal就只能尬吹我们辉煌的过去。</p><p>在写proposal的时候，本想把任务拆分，先写个中文的大纲，offload给其他队友，让他们输出English paragraphs，再merge成一个完整的proposal。结果真到offload的时候，赶作业的赶作业，赶paper的赶paper，这也就算了，还有去外面嗨然后装死的。当队长这个大锅抛到我头上的时候，距离交proposal的ddl已经没几天了，还得去拉浪潮的赞助。发现队友都指望不上还没时间自己搞定的时候，我只能无能狂怒，</p><blockquote><p>Neko.d&gt; I AM REALLY ANGRY!</p></blockquote><p>但我angry并没有什么用，proposal还是得写，最后想到了一个惊为天人的解决方案：用谷歌翻译把提纲翻译一下，我看两眼改一改就交上去了，反正我只要坚信reviewer不喜欢看一大段一大段的屁话，我的良心就不会痛。</p><p>几个月以后收到消息，令人意外，我们用脚写的proposal过了，倒是T队挂了（不过靠着ISC冠军又回来赛场了，你大爷还是你大爷），听说还有几个国内的强校也挂了，目测多数死于Diversity。虽然我们Diversity吃了一个reviewer的低分，但其他的reviewer好像成功的被忽悠过去了。<del>参赛前务必把美式政治正确玩明白了。</del></p><h2 id="sc21-phase-1还能抢救一下">SC21 Phase 1：还能抢救一下</h2><p>众所周知，我是摸鱼之王，摸鱼从来就没输过，摸到组委会专门发邮件催我们上号，<del>摸到连队友不敢摸了。</del></p><blockquote><p>组委会&gt; Azure说你们从来没有上过号。马上比赛了，想问问你们是不是网络不好，登不上号啊？有问题的话要跟我们说啊。</p><p>组员1&gt; 所以什么时候练习啊</p><p>组员2&gt; 所以什么时候练习啊</p></blockquote><p>Oracle集群的试用期也就五天，到账号激活的第二天我才想起来还有这事（我紫菜）。集群配置比较常规，主要是不存在伸缩问题，在试用的时候没发现太多的问题（主要是因为很多东西没来得及试）。</p><p>一个不大不小的事故是，因为我想让队员在练习的时候熟悉Linux的账号机制，就让他们改<code>passwd</code>和<code>sudoer.d</code>。然而我低估了修改<code>sudoer.d</code>的风险，因为这个东西改炸了<code>sudo</code>就用不了，想还原<code>sudoer.d</code>的配置来抢救<code>sudo</code>本身又需要<code>sudo</code>权限，好家伙死锁了。不得已，只能寻求场外援助，Oracle的Technical Leader，Marcin老哥。老哥一看我的问题，见怪不怪，熟练的向我们推销serial console下grub改<code>bootargs init=/bin/bash</code>之术，想必他的客户自己（哦是我啊，那没事了）也没少搞炸系统。当grub的界面显示在Windows Terminal下的时候，我大受震撼，这还是我第一次看到在serial console下的grub，而且还保留了原汁原味的TUI。改好参数，顺利以root身份登录系统，直接把写坏的配置删掉就完事了。</p><blockquote><p>Marcin&gt; :) Sysadmin Sunday</p><p>Neko.d&gt; Oh. Sorry to disturb your beautiful Sunday morning lol.</p></blockquote><p>Marcin老哥是个大好人（会救我们的人是好人，周末来救我们的人就是大好人啦！），此外他还教我挺多实用的东西，比如，</p><ul><li>一键关掉超线程</li><li>一键添加集群账户（集群预装了LDAP的东西）</li><li><code>playbook</code>自动化工具（据说可用于重装部分软件）</li><li><code>ssh-agent</code></li><li>...</li></ul><p>到了Azure这边就没那么幸运了，赛前中后和其他队的运维瞎聊，大家无一例外的碰到了，</p><ul><li><code>cloud-init</code>更新的配置不被应用。公认的workaround是，要想改<code>cloud-init</code>脚本，重建集群吧！也就等十多分钟就好了！很快的！这导致我不敢写很复杂的init脚本</li><li>image的一堆坑<ul><li>VM的世代数和Image支持的世代数不一致，需要手动指定隐藏镜像<code>OpenLogic:CentOS-HPC:7_9-gen2:latest</code>（后来发现这还是个陈年已知问题，上一任运维知道，但不说）<ul><li>有个好心人给了个魔法PowerShell指令来捅出镜像列表</li></ul></li><li>只有CentOS 7.9的镜像带了驱动，8.1的没有<ul><li>可以自己装驱动，GPU部分不仅需要GPU本身的驱动，还需要NVSwitch的驱动，叫Fabric Manager。如果NVSwitch的驱动不正常，会报一个<code>cudaErrorSystemNotReady</code>的错<ul><li>补充：驱动的安装程序和安装目录不要都在NFS盘上，建议把全家桶拷到本地盘再安装，能大幅提升安装速度</li></ul></li><li>IB部分可以装OFED全家桶（还有坑，伏笔）</li></ul></li></ul></li><li>魔改过的Slurm会在VM发生不明原因初始化配置超时的时候把好不容易allocate到的VM释放掉，尽管VM能正常使用。（每次开节点都要排十多分钟的队，然后大概率配置超时，这谁顶得住啊）</li></ul><p>值得一提的是，我们几个运维一致认为Azure的技术支持Andy是个装死带师，我们三个都被Andy无视了。跑去跟SCC主席Kathleen complain，</p><blockquote><p>Kathleen&gt; 啊，Andy早在Webinar里说过自己最近忙了，你是不是没去听啊</p><p>Kathleen&gt; 还有你们Stand up meeting跑哪去了</p></blockquote><p>彳亍口巴。Andy告诉我唯一有用的东西就是，CycleCloud Console的<code>cloud-init</code>脚本可以不是<code>cloud-init</code>脚本。</p><p>实际试用的时候，Azure上来就是开幕雷击，先是遇到了世代数的那个问题。到比赛快开始的时候才开GPU节点练习，（因为A100很贵，27刀一小时，而且队员之前没准备好，还没把CPU版折腾清楚），这个时候才发现CentOS 8的镜像要啥驱动都没有，自己一装驱动又踩了NVSwitch的坑。搞了半天还搞不定，仔细一想每次开VM都要装驱动，难顶，只能碰运气看看7.9带不带驱动。还好驱动是全的（似乎也是唯一一个带全驱动的镜像）。降系统版本又造成了一些小问题，什么缺Lmod导致module load intel全家桶出锅啊，什么GCC版本太老导致C++ ABI出锅啊，好在这些问题还是能解决掉的。</p><p>除了被两边集群的“特性”折腾以外，似乎没有太多问题了，也就记得有</p><ul><li>Spack会把tmp目录挤爆，改TMPDIR环境变量就可以了</li><li>Spack不会自己更新和确认过时的编译器信息，然后还把错误的信息缓存了，需要clean一下bootstrap就好了（<code>spack clean -b</code>）<ul><li>NVIDIA HPC SDK也有类似的问题，Any changes to your gcc compilers requires you to reinstall the HPC SDK</li></ul></li><li>默认情况下<code>nfslock</code>没启动，导致No locks available</li></ul><p>其他队友那边，一开始大家还自信得一批，仿佛人均编译带师，到了自己编译应用的时候，尤其是用Spack编译一些依赖（e.g., 比如OpenBLAS和FFTW）的时候，或者要编译GPU版代码的时候，编译器就能给你炸得妈都不认识。只好各种换编译器换姿势编译，什么GCC，ICC，ICX，NVC都用了一圈。年幼无知的队友甚至还对AMD有一丝信任，想用AOCC和AOCL平替，结果我就不说了，懂得都懂。最后发现还是老一套，ICC+MKL稳如老狗。（其实是一开始我忘装了MKL，所以让他们先试OpenBLAS，然后就欣赏编译器烟花了）还有NVCC经常会选错Host编译器，得加<code>--ccbin</code>啥的。总之，你永远不知道最终生成的看上去能运行但大概率会爆炸的二进制文件是几个编译器生成的代码缝合在一起的产物。</p><h2 id="sc21-phase-2演我们">SC21 Phase 2：演我们？</h2><p>Azure试用到没钱的后一天就正式比赛了，<del>可见我们什么时候才开始赛前准备</del>。按照基本上等于废话的计划，第一天开局打算先让复现和Cardioid跑，总之就是先把Oracle集群用起来，反正没有预算限制，QE和神秘应用这种要花Azure钱的晚一点再上也不迟。9.30am有个meeting，还以为要公布instructions了，结果除了打一个尬飞了的招呼以外，什么事都没发生，instructions一个字都没看到（后来才意识到这个meeting是用来把我们骗进breakout room然后给其他人直播比赛事故现场的）。</p><p>不得不说今年的instructions有点随意啊，不仅公布时间随意，而且，</p><ul><li>Submission的instruction甚至直接用的去年的，一个字都没改</li><li>有一题的instruction放在GitHub上，放出来不久又改回private返工了</li></ul><p>这潦草的instructions让本来就因为remote而显得不正经的比赛看起来更不正经了，好想打场正儿八经的SC现场赛啊<del>，但没机会了</del>。Anyway，比赛还得打。</p><h3 id="cardioid">Cardioid</h3><p>assign的同学巨屌无比（还是20级的），基本上可以称为修bug自动机，因为他有NVRTC和LLVM+PTX瞎搞的经验，这题就扔给他做了，总体来说没有遇到太多的问题（因为bug都瞬间被修掉了导致我对其并没有什么太深刻的印象）。</p><ul><li>在测试的时候就发现跑GPU多节点会出问题，但Oracle上只有一个GPU节点，那没事了。Cardioid似乎有几个variants，但只有一个能跑，但比赛就只用那一个，那又没事了</li><li>Cardioid看上去可以直接用Spack整体编译，但，试了的都说sucks。似乎有的队一直卡在Spack上，其实直接不用Spack，手动编译就好了</li><li>Oracle的集群用的是Oracle Linux 7，based on REHL 7，<del>也就是说和CentOS 7的环境差不多一样老</del>。所以也出了C++ ABI的问题，好在临时抱佛脚了</li></ul><p>比赛要求跑几十套参数，大概要换几种网格尺寸，精度，计算方法跑，总共要跑几十种组合。跑了一轮下来，发现某一类参数组合会炸，总结一下有几种爆炸的姿势。第一种炸法，用gdb Backtrace一查，哦就是文件权限没设对，低级错误。第二种炸法，算出了NaN。但仔细读instruction，发现出题人似乎早就预料到选手们能跑出NaN，因为有一个问题就在问”你们见过NaN了吗“之类的，那肯定就是出题人故意设的坑。年轻的队友看不懂人性的险恶，甚至萌生了改代码去修NaN的想法。第三种炸法，发现backtrace不出来，怀疑是栈炸了。先改了ulimit，无限栈空间，不行。单步调试到一个for循环读取数组，程序就异常了。我大脑一瘫，居然怀疑这个读取操作炸了栈（读取操作不会写入栈啊喂！），因为数组在高地址的堆里（<code>0x7ff</code>），离栈很近，说不定越界越到栈上了，甚至去算了base+size有没有大于esp的地址，结果是完全没事。继续debug，又遇到了一些匪夷所思的事情，比如不可复现的爆炸，第一次看到值异常，第二次就值正常了。我直接进行一个思考的放弃，1am了，到点睡觉了，day2还得通宵，于是就留这个守夜的队友就自己继续debug。</p><p>第一天晚上有三个在学校的队友守夜，主因是组委会完全无视50%队伍来自中国的事实，要求我们在12am-8am开摄像头供人围观（<del>直播睡觉？</del>）。跟Chair complain，</p><blockquote><p>Kathleen&gt; 你咋不早说</p></blockquote><p>我还指望其他中国队会比我更早发现这个问题然后去argue，怎么只有我被喷了。Chair最后还在Warp up meeting还说，有些队伍不喜欢social只想去睡觉。（<del>不会吧，不会吧，不会还有人打比赛还想睡觉吧。</del>）</p><p>洗完澡刚躺平的时候，那个队友就发消息说，de出来了，gdb不靠谱，还是printf大法好，最后查出来是integer overflow了。好家伙overflow和underflow都齐了，那肯定是出题的人故意的。这个overflow有一个tricky的workaround，让MPI跑在更多节点上就好了，因为会发生这个overflow的算式会除以节点数，节点数一搞大就不overflow了。八卡跑八进程，不能再多了？不存在的，一张卡跑两个MPI进程就完事了。（不过我估计还是栈烂了导致GDB行为诡异，说不定是RDMA把栈写坏了，因为这个overflow的结果似乎会propagate到MPI的参数）。</p><h3 id="复现">复现</h3><p>搞复现的同学因为饱读各种AI paper就被我抓过来做这题，最初他以为现场赛的难点是在写paper上，因为之前在Azure上测试的时候一切都稳如老狗，虽然没时间测Oracle，但问题应该不大吧......吗？</p><p>真到了在Oracle上就，不得不说真是充满了惊喜（指新bug）。Oracle这边不像Azure，没有自带MVAPICH，只能自己用Spack编译一个，IB卡配置得也比较复杂，似乎是双口100G的RoCE，不像Azure是单口IB。程序似乎能正常编译运行下来，能跑出一些结果，只不过嘛......偶尔会在计算末期爆炸一下啦。根据我没几年的丰富经验，只要MPI程序能够跑出结果，问题就不是很大。一开始盲猜是运气问题，爆炸是随机的，重跑一次就好了。重跑了几次发现，是特定组合下的参数100%会炸，而且似乎是MPI的内部错误，挂在了同一个MPI函数上。由于程序总是在几十分钟后炸，用gdb调试几次可能一两个小时就没了，就打算先试各种民间偏方，同时先把能跑的点跑下来。</p><p>诡异的是，改无限栈空间，换MVAPICH的版本，使用Slurm，加MPI运行参数，都<strong>有时候</strong>能让程序跑起来，之后又失效了。也发现一些方法根本就没用，比如改编译MVAPICH的fabric参数，或者换HPCX MPI（然而自带的这玩意就没成功运行过）。从白天调到大半夜，程序一直处于能运行与不能运行的叠加态。Day2早上回到会议室，看得出那个队友被折磨了一晚上，身体已经完全被掏空了，大脑完全停止了思考。在我睡觉期间，队友去问了大好人Oracle的工程师Marcin，他给了<a href="https://blogs.oracle.com/cloud-infrastructure/post/running-applications-on-oracle-cloud-using-cluster-networking">一个写满了各种魔法MPI参数的博文</a>， 然而这个魔法博文里提供了OpenMPI的参数，提供了Intel MPI的参数，提供了Platform MPI的参数，就是没有提供MVAPICH的，队友试着照着博文的参数复刻出一套MVAPICH的参数，然而并没有什么用。</p><blockquote><p>Fun Fact：大好人Marcin在第二天晚上也帮我们试了一下MVAPICH，并给了一套参数，</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mpirun -hosts hpc-node-1,hpc-node-2 -env MV2_IBA_HCA=mlx5_2 -env MV2_USE_RoCE=1 /nfs/cluster/osu-micro-benchmarks-5.8/mpi/one-sided/osu_get_latency</span><br></pre></td></tr></table></figure><p>我队友除了没指明<code>MV2_IBA_HCA</code>以外，其他参数都是这么写的。（不过其实我怀疑加了<code>MV2_IBA_HCA</code>这个就好了，只不过到最后都没时间测）</p><p>于是，</p><blockquote><p>Neko.d&gt; 这个MPI大多数的时候能跑，少数时候会炸，说不定是MVAPICH的bug</p><p>Marcin&gt; we can take a look and report to Dr. DK Panda.</p></blockquote><p>古有简历直达boss直聘，现有bug report直达author。</p></blockquote><p>到了第二天中午，debug还是没什么进展。我还是觉得好像也不是非得用MVAPICH才行，一看instruction没要求，应该可以换吧。问了其他队，好家伙，有的队打一开始就是拿OpenMPI跑的。于是准备换Intel MPI，这玩意久经考验。虽然说不定Intel MPI能以一己之力扭转程序的性能特征，改变性能曲线的trend，进而颠覆了原论文的结论，但继续卡在这个问题上也不是个办法。用Intel MPI + 自己编译的GCC 9重新编译ramBLe，第一次运行的时候不加MPI参数，程序直接就爆炸了，甚至连计算都还没开始。我还以为又凉了。抱着死马当活马医的心态，试了一下博文里的参数，程序居然能跑起来了！还成功的跑完了！还第一次看到Intel MPI不加奇奇怪怪的参数还跑不了的情况（一般我觉得加太多参数，叠太多buff会让程序炸得更惨）。</p><blockquote><p>buff组合，只能说全是魔法，只有一点代码：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">-iface ens800f0 -genv UCX_TLS rc,self,sm -genv UCX_NET_DEVICES mlx5_2:1 -genv I_MPI_FABRICS shm:ofi -genv I_MPI_FALLBACK 0</span><br></pre></td></tr></table></figure></blockquote><p>再次重跑之前不能跑的点，诶，能跑出结果了。然而这个时候，比赛时间不太够了，之前用MVAPICH跑的结果又不能用。为了赶上进度，队友还想着一个节点同时跑两个点，然而我这个学期的project就在研究co-located programs之间相互干扰的问题，所以这个想法被我毙掉了，反正大不了就把原来的MVAPICH的数据加个说明交上去凑数，不过其实最后绝大部分结果都赶上了。此外，还有绑核的问题，MVAPICH绑核默认会绑成一个非常奇怪的姿势，但instruction又没提绑核的事情，干脆让Intel MPI自由发挥，用默认绑核的顺序就完事了。最后与MVAPICH的结果相比较，绑核和换Intel MPI都没有造成太大的影响，trend还是一致的。</p><h3 id="qe-神秘应用">QE &amp; 神秘应用</h3><p>这两题都是要花钱跑的，作为一个常年抠门，特别是常年在云服务器上抠门的人（<del>但其实我作为AWS的Intern烧掉了AWS不少钱</del>），自然是不太舍得大手大脚的让QE和神秘应用在八CPU节点或者八卡上做测试的。抠门如我，给了他们一个有一张V100的Intel Haswell的节点，在这个便宜的节点上先测试好，再换到正式环境跑。虽然是Intel的CPU，但Haswell只支持到AVX2，它能跑的代码AMD平台应该也能跑。</p><p>其实这两个才是赛前我比较担心的应用，因为这两道题目在赛前就看上去还没有另外两题准备的充分的样子，特别是神秘应用的同学人在香港，只能线上沟通，但结果是，这两道题反而没有遇到太多问题（或者说队友靠自己就闷声把问题修好了），出乎我的意料。QE的同学先开始说CPU版的单元测试跑不过，修了一下，好像是缺了什么库之类的，过了没多久，就说CPU的单元测试跑通了。开始叠优化选项，CPU版的单元测试又跑不过了，他自己鼓捣一下，又能跑过了。之后又开始折腾GPU版，这次遇到问题总算棘手了那么一点，这个问题让他修到了第二天。保险起见，就让他在Day 1到Day 2的半夜用四个节点把保底的结果跑出来，然后去思考用什么机器能刷高分数，跑完benchmarks有闲钱了就去刷高分数（然而最后是有闲钱但没时间了）。如果GPU版本能跑了，可以考虑明天和神秘应用一起时分复用GPU节点。为了省钱，还不让他开八节点来跑（不过仔细一想，其实好像花不了几个钱，但应该能提高不少分数）。</p><p>神秘应用那边，Day 1，他上来就列好了用来编译安装程序的<strong>所有的</strong>命令（他是怎么把命令列的这么全的？他能预知未来？）我看了两眼，给他补充了一些东西，他就自己去折腾了。我预期他会出现各种各样的状况，然而他安静得就像跑路了一样，之间就问过一次MPI相关的（HPCX又跑不起来，换MPI重新编译就好了），Python软件包缺失，NVCC和MPI的环境变量问题，吓得我以为他跑不出来就开始摆烂了。到了Day 2，他遇到了只有在多卡节点上才能测试的东西，给他开了A100节点后没多久，他下一句话就是，</p><blockquote><p>Neko.d&gt; 应该可以连进去（八卡节点）了 <span class="citation" data-cites="12:25">@12:25</span></p><p>队友&gt; 现在应该是跑完了 <span class="citation" data-cites="14:48">@14:48</span></p></blockquote><p>这顺利的就离谱。想到跑多节点可能会各种出锅，需要花不少时间<strong>用两个GPU节点</strong>debug，加上我对我多机多GPU的debug经验和速度都不太有自信，不如把钱省下来给benchmark。神秘应用在A100上跑的时候，QE的GPU版似乎还是不太行的样子。</p><blockquote><p>所有的task在八卡单节点都搞定了。</p></blockquote><h2 id="sc21-phase-3演你们">SC21 Phase 3：演你们！</h2><p>跑benchmarks的老哥人在英国，大概是Day 2中午的时候终于想起来要上号了，此前训练的时候，</p><blockquote><p>队友&gt; 现在是在训练还是比赛啊？</p></blockquote><p><del>看起来中国到英国网络不通。</del></p><p>跑IO500不烧钱，他就先跑这个玩玩。起初我们还对CycleCloud自带的分布式BeeGFS抱有期望，先跑了一个五节点的分。第一次，8分，不知道是什么水平，只知道是T队去年的10%不到的分数。虽然对这届IO500来说，我们直接进行一个烂的摆才是最优策略，浪费时间提升10分，在某些打榜专用FS面前，跟没有提升没有区别，但交个8分似乎也太丢人了。又跑了一下单节点的，10分。好家伙，上了分布式还负优化了（大概是通信的锅？或者集群配置有问题？）</p><p>跑完IO500热热身，他就......去睡觉了。他那边已经是当地时间的早上了，他再次上号的时候已经10个小时过去了，我们这边已经是Day 2的晚上了，看起来他还挺自信的，睡觉睡得很踏实。</p><p>QE和神秘应用跑的非常的经济，愣是省了1k多刀给benchmarks烧。我们的原本想到了A100根本开不出来的情况，就准备退而求其次开V100，然后发现了...V100也开不出来。（估计是T队此时也在拿V100跑benchmark）。好不容易开出两台V100，又发现坑爹Azure的Image里，有IB驱动，但只有完全用不了的IB驱动。Image里只带了支持新网卡的驱动，但机器上只有旧网卡，赛前根本就没试过V100节点的我们防不胜防，想不到还有这种事（然而队友aka前运维去年也是用的V100节点，咋啥都不记得了）。总之，如果硬要用V100节点，我们只能准备手动装OFED。（然而T队后来说他们早在测试的时候就发现了这个问题，估计自动化脚本都写好了）</p><p>然而，我从来没见过队友用过clusterssh和其他能broadcast input的terminal，对他自称能在短时间内给10+台节点配好OFED的说法深感怀疑。而且我自己装OFED的时候遇到过升级Linux内核的同时把GPU内核模块整没了的情况，说不定好不容易配好了OFED了，CUDA又坏了，所以不是很放心让他继续把时间花在V100上。</p><p>折腾V100的期间，ShanghaiTech的运维问我们要不要接盘他们的两台A100，他们准备释放了，但我那个时候还不知道IB驱动的坑，就没接盘。等到发现IB驱动的坑的时候，我只有非常的后悔，不过我又发现我顺手一开就开出两台A100，就是排了大半天的队。感觉V100那边跑不出来了，干脆不如能开出几台A100就跑几台。算了一下钱，发现，</p><blockquote><p>Neko.d&gt; 出现了有钱花不出去的问题</p></blockquote><p>我们有足够的钱等A100开出来，哪怕最后开出了8台A100，我们的钱也不一定能花的完。并且我大胆假定，大部分的队伍会在比赛末期因为经费不够把A100让出来（后来发现，还得感谢某些搞事情的队高抬贵手，没有跑去抢A100），于是我就让队友一边在A100集群上调参一边等机子。从第三台机子开始，Azure开始0连排队都不让排了，但只要多试几次就能排上队，平均下来进入排队的状态要花10分钟，排队再花10分钟，也就是说20分钟能开出一个节点。反正大家的进度整体良好，没有我运维什么事，我就去当一个没有感情的点鼠标机器好了。</p><blockquote><p>本来想写个脚本自动轮询，发现公司给的Mac不让Chrome访问不安全的网站（CycleCloud Web Console的HTTPS的证书没配好），用不了开发者工具来生成模拟请求的curl命令，所以只能人工点鼠标了。</p></blockquote><p>用了差不多两个小时，总算开出了6台A100节点，但从此再也排不上队了。虽然我觉得没有哪个队钱多到占着A100到最后一秒（然而真的有这种队），但之后真的就一台也开不出来了。</p><p>队友那边，他掏出了不知道哪来的<strong>祖传HPL和HPCG二进制文件</strong>，复制粘贴，执行<strong>含有祖传参数的祖传脚本</strong>（怕不是ASC那会用的）就开始跑分了！现在Linux的二进制兼容性这么好了吗？不过6节点HPL一开始只能跑出100T左右的成绩，用htop看CPU占用，红红一大片，非常的不对劲。结果就是MPI进程/线程数设置得不对而已，经典Context Switch了。改了改参数，跑分期间看到了不错的预测结果，但居然跑炸了。队友突发奇想，改低了线程数，好了，太怪了。（某队还碰到了非常神奇的Verification failed，开眼界了）</p><p>队友自称在校内集群跑HPL的时候成功的把节点直接跑崩，这里没把Azure的物理机跑崩真是谢天谢地了。</p><p>后面的故事就比较简单了，48张A100加持下，钱到位了，金钱的力量绝不让人失望。狗贼T队先扔了一个很低的HPL分卖弱，到比赛快结束的时候，终于把真正的成绩放出来了，HPL，HPCG，IO500全部领先当时榜一一大截，仿佛SC20重演。看了一眼我们刚跑出来的分数，那没事了，就让T队开心那么几十分钟吧，然后再让他们感受资本主义的险恶（大雾）。</p><p>抖S的队友总觉得没有榨干A100的性能，从理论性能上看，HPL还是有希望跑上300T的，可惜到了最后几分钟也只跑到了284T。但我脑子一抽，居然同意了让他们在最后几秒用这个成绩override掉原来提交的280T的成绩，差点SC20重演，这是这一次比赛距离翻车最近的一次。还好负责提交的同学足够聪明，是先提交再删除，不是先删除再提交。结果是在ddl前提交成功了，但来不及删掉已经提交的成绩了，要是这个update的顺序反了就emmm。但赛方Grafana还是傻掉了，不能处理多份成绩的样子，显示不出我们的HPL成绩，于是我们创造了SC21最高LINPACK分高达0 GFLOPS的奇迹！（大雾）</p><p>另外，IO500原先是跑出了10分的“好”成绩，但忘了保存了。再跑单节点的时候又只有8分了，队友直到最后一刻都还想装Lustre来刷高IO500，但就算分数高了一点，也并没有什么*用。</p><h2 id="sc21-phase-3.5interview">SC21 Phase 3.5：Interview</h2><p>去年需要我们在最后做一个完整的答辩，今年就变成了Poster+各赛题单独interview。Poster答辩前我还在吹水，<strong>完全没有意识到Poster答辩还占分</strong>，结果iPad（用来在比赛期间挂着线上会议）突然冒出来一句声音让我答辩，我《完全没有任何准备》<del>（自豪）</del>，自然是讲得稀烂，我甚至都不记得我做的Poster里写了啥，只能对着评委说，</p><blockquote><p>Neko.d&gt; 嗯嗯你们自己看看吧，反正这个也不是最终比赛的配置，看看就好</p></blockquote><p>结果发现最后Poster分并不高（我紫菜）。</p><p>其他队友的分赛题Interview比我靠谱多了，基本上问题都答得上来（你们是什么时候知道这些赛题的物理背景的？）。特别的是，负责Cardioid的的大二队友不太能说英语，但在我们另一个队友，托福100+的大师的协助下，加上interviewer照顾我们把问题打在了聊天框里，这场interview就在</p><ul><li>我听不懂Cardioid队友跟翻译说了什么</li><li>我听不懂翻译跟interviewer说了什么</li></ul><p>的情况下完成了。我只知道</p><ul><li>评委很满意我们的回答</li><li>他们笑得很开心，但我不知道他们在笑什么</li><li>评委问：“假如你患了绝症，你愿意用我的模拟器造的心脏吗？”<ul><li>回答：”去tm的simulation，我直接成为赛博人，肉体是不需要的“</li></ul></li></ul><h2 id="ending">Ending</h2><p>其实这1w字的流水账早就在10月的时候写完了，只不过拖到了第二年才发出来，欸嘿。说起来这也是我最后一次以正式队员/队长的身份打SC比赛了，想想好气啊，因为这破疫情，痛失2x美帝 2x新加坡 2x北京 1x厦门免费旅游机会，没能帮学校花钱，我感到非常的愧疚。果然还是想续一续我的学生身份，继续努力的帮学校花钱。希望今年能以Year 0 PhD Student的身份打ASC22（<del>offer不会下不来了吧</del>），然后在未来某一年以Presenter的身份回到SC的会场，并成为目击证人，亲眼见证在nike的successor在SC SCC把其他队锤爆的那一刻。</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;SC21又又又又是在线上打的。第二年痛失美帝免费旅游机会了！！！第二年了！！！&lt;del&gt;没有机票，酒店，和大吃大喝的比赛能叫比赛吗！&lt;/del&gt;不过结果还是不错的，远远的超出我的预期（原因请看下文分解）。&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Everything you need to know about Splitting NCCL Communicators</title>
    <link href="https://blog.mylab.cc/2021/12/15/Everything-you-need-to-know-about-Splitting-NCCL-Communicators/"/>
    <id>https://blog.mylab.cc/2021/12/15/Everything-you-need-to-know-about-Splitting-NCCL-Communicators/</id>
    <published>2021-12-14T18:47:00.000Z</published>
    <updated>2021-12-17T10:57:24.000Z</updated>
    
    <content type="html"><![CDATA[<p>MPI allows to create a new communicator by splitting an existing one into a sub-communicator, which can make our program dynamically select a subset of computing nodes to involve in the collective communication operations, such as all-reduce and all-gather operations. NCCL also has a similar feature, but it is not well-documented yet.</p><h2 id="tldr">TL;DR</h2><p>Since NCCL relies on MPI to run on multiple nodes, the following example code is based on MPI Programming Model. Assume there are 4 CUDA GPUs and 4 corresponding MPI ranks. This code performs all-reduce operation within the first two and the last two ranks simultaneously.</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;nccl.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;mpi.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdio.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;stdint.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;thrust/device_ptr.h&gt;</span></span></span><br><span class="line"><span class="meta">#<span class="meta-keyword">include</span> <span class="meta-string">&lt;thrust/fill.h&gt;</span></span></span><br><span class="line"></span><br><span class="line"><span class="keyword">using</span> <span class="keyword">namespace</span> std;</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">int</span> <span class="title">main</span><span class="params">()</span> </span>&#123;</span><br><span class="line">  <span class="built_in">MPI_Init</span>(<span class="literal">NULL</span>, <span class="literal">NULL</span>);</span><br><span class="line">  <span class="keyword">int</span> world_size, world_rank;</span><br><span class="line">  <span class="built_in">MPI_Comm_size</span>(MPI_COMM_WORLD, &amp;world_size);</span><br><span class="line">  <span class="built_in">MPI_Comm_rank</span>(MPI_COMM_WORLD, &amp;world_rank);</span><br><span class="line">  <span class="built_in">assert</span>(world_size == <span class="number">4</span>);</span><br><span class="line">  <span class="built_in">cudaSetDevice</span>(world_rank); <span class="comment">// GPU N binds to MPI rank N</span></span><br><span class="line"></span><br><span class="line">  ncclUniqueId nccl_id, nccl_ids[<span class="number">4</span>];</span><br><span class="line">  <span class="keyword">size_t</span> id_size = <span class="built_in"><span class="keyword">sizeof</span></span>(ncclUniqueId);</span><br><span class="line"></span><br><span class="line">  <span class="comment">/* Generate Unique ID */</span></span><br><span class="line">  <span class="comment">// nccl_id is a simple struct with the size of exact 128 bytes</span></span><br><span class="line">  <span class="comment">// so it can be transferred over MPI</span></span><br><span class="line">  <span class="built_in">ncclGetUniqueId</span>(&amp;nccl_id);</span><br><span class="line">  <span class="built_in">MPI_Allgather</span>(&amp;nccl_id, id_size, MPI_UINT8_T,</span><br><span class="line">                &amp;nccl_ids[<span class="number">0</span>], id_size, MPI_UINT8_T, MPI_COMM_WORLD);</span><br><span class="line"></span><br><span class="line">  <span class="comment">/* Create a sub-communicator */</span></span><br><span class="line">  ncclComm_t nccl_comm;</span><br><span class="line"></span><br><span class="line">  <span class="keyword">if</span> (world_rank &lt;= <span class="number">1</span>) &#123;</span><br><span class="line">    <span class="built_in">ncclCommInitRank</span>(&amp;nccl_comm, <span class="number">2</span>, nccl_ids[<span class="number">0</span>], world_rank);</span><br><span class="line">  &#125; <span class="keyword">else</span> <span class="keyword">if</span> (world_rank &gt;= <span class="number">2</span>) &#123;</span><br><span class="line">    <span class="built_in">ncclCommInitRank</span>(&amp;nccl_comm, <span class="number">2</span>, nccl_ids[<span class="number">2</span>], world_rank - <span class="number">2</span>);</span><br><span class="line">  &#125;</span><br><span class="line"></span><br><span class="line">  <span class="comment">/* Test */</span></span><br><span class="line">  <span class="keyword">constexpr</span> <span class="keyword">size_t</span> N = (<span class="keyword">size_t</span>)<span class="number">1e3</span>;</span><br><span class="line">  <span class="keyword">constexpr</span> <span class="keyword">size_t</span> arr_size = <span class="built_in"><span class="keyword">sizeof</span></span>(<span class="keyword">int64_t</span>) * N;</span><br><span class="line">  <span class="keyword">void</span> *arr, *arr_host;</span><br><span class="line">  <span class="built_in">cudaMalloc</span>(&amp;arr, arr_size);</span><br><span class="line">  <span class="built_in">cudaMallocHost</span>(&amp;arr_host, arr_size);</span><br><span class="line">  </span><br><span class="line">  <span class="comment">/* Init the array on local GPU */</span></span><br><span class="line">  <span class="function">thrust::device_ptr&lt;<span class="keyword">int64_t</span>&gt; <span class="title">arr_ptr</span><span class="params">((<span class="keyword">int64_t</span>*)arr)</span></span>;</span><br><span class="line">  thrust::<span class="built_in">fill</span>(arr_ptr, arr_ptr + N, world_rank);</span><br><span class="line"></span><br><span class="line">  <span class="built_in">ncclAllReduce</span>(arr, arr, N, ncclInt64, ncclSum, nccl_comm, <span class="literal">NULL</span>);</span><br><span class="line">  <span class="built_in">cudaMemcpy</span>(arr_host, arr, arr_size, cudaMemcpyDeviceToHost);</span><br><span class="line">  <span class="built_in">printf</span>(<span class="string">&quot;[rank%d] result: %ld\n&quot;</span>, world_rank, ((<span class="keyword">int64_t</span>*)arr_host)[<span class="number">0</span>]);</span><br><span class="line"></span><br><span class="line">  <span class="built_in">MPI_Finalize</span>();</span><br><span class="line">  <span class="keyword">return</span> <span class="number">0</span>;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>This code can be compiled and run on my machine with these commands,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">nvcc -ccbin mpic++ test.cu -o <span class="built_in">test</span> -L/usr/<span class="built_in">local</span>/cuda/lib -lnccl</span><br><span class="line">mpirun -n 4 ./<span class="built_in">test</span></span><br></pre></td></tr></table></figure><blockquote><p>Note: Using <code>nvcc</code> to compile MPI code is not a common practice. It is recommended to compile it with <code>mpic++</code> from a CUDA-Aware MPI variant.</p></blockquote><p>The output of this program should be,</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">[rank0] result: 1 <span class="comment"># 0 + 1 = 1</span></span><br><span class="line">[rank1] result: 1</span><br><span class="line">[rank2] result: 5 <span class="comment"># 2 + 3 = 5</span></span><br><span class="line">[rank3] result: 5</span><br></pre></td></tr></table></figure><p><strong>The key is <code>ncclCommInitRank</code>. Suppose only a subset of ranks initializes the communicator with the same unique ID belonging to one of them. In that case, this communicator will ignore other ranks that are not in this subset.</strong></p><h2 id="usage-of-ncclcomminitrank">Usage of ncclCommInitRank</h2><blockquote><p>Official API explanation:</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line"><span class="function">ncclResult_t <span class="title">ncclCommInitRank</span><span class="params">(ncclComm_t *comm, <span class="keyword">int</span> nranks, ncclUniqueId commId, <span class="keyword">int</span> rank)</span></span></span><br></pre></td></tr></table></figure><p>Creates a new communicator (multi thread/process version). rank must be between <code>0</code> and <code>nranks-1</code> and unique within a communicator clique. Each rank is associated to a CUDA device, which has to be set before calling <code>ncclCommInitRank</code>. <code>ncclCommInitRank</code> implicitly synchronizes with other ranks, so it must be called by different threads/processes or use <code>ncclGroupStart</code>/<code>ncclGroupEnd</code>.</p></blockquote><p>In addition to the official instructions, we should also know,</p><ul><li>Each unique ID should only be used once.</li><li><code>ncclGetUniqueId</code> can be invoked multiple times, and it will return a different unique ID each time. Meanwhile, the unique ID generated before is still working.</li><li>It is safe to communicate within disjoint subsets of nodes simultaneously.</li><li>Using NCCL to perform inter-GPU communication concurrently with CUDA-aware MPI may create deadlocks.</li></ul><h2 id="performance">Performance</h2><p>Moreover, I also evaluate the influence on performance bring by sub-grouping.</p><p>The testbed is,</p><ul><li>AWS <code>g4dn.metal</code> instance with 8x NVIDIA Tesla T4 GPUs.</li><li>Shipped with AWS Deep Learning AMI<ul><li>OS: Ubuntu 18.04 (Kernel Version: Linux 5.4)</li><li>CUDA Toolkit: 11.0 (Driver Version: 450.119.03 )</li></ul></li></ul><p>First of all, I would like to emphasize the GPU topology of this bare-metal machine.</p><blockquote><p>Note: We should extract the topology information from physical machines instead of virtual machines since the hypervisor may fuzz the result due to security reasons.</p></blockquote><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># nvidia-smi topo -m</span></span><br><span class="line">        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity</span><br><span class="line">GPU0     X      PHB     NODE    NODE    SYS     SYS     SYS     SYS     0-23,48-71      0</span><br><span class="line">GPU1    PHB      X      NODE    NODE    SYS     SYS     SYS     SYS     0-23,48-71      0</span><br><span class="line">GPU2    NODE    NODE     X      PHB     SYS     SYS     SYS     SYS     0-23,48-71      0</span><br><span class="line">GPU3    NODE    NODE    PHB      X      SYS     SYS     SYS     SYS     0-23,48-71      0</span><br><span class="line">GPU4    SYS     SYS     SYS     SYS      X      PHB     NODE    NODE    24-47,72-95     1</span><br><span class="line">GPU5    SYS     SYS     SYS     SYS     PHB      X      NODE    NODE    24-47,72-95     1</span><br><span class="line">GPU6    SYS     SYS     SYS     SYS     NODE    NODE     X      PHB     24-47,72-95     1</span><br><span class="line">GPU7    SYS     SYS     SYS     SYS     NODE    NODE    PHB      X      24-47,72-95     1</span><br></pre></td></tr></table></figure><p>It looks like a balanced tree topology. We could expect two neighbor GPUs will have higher communication efficiency.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">UPI                   </span><br><span class="line"> |--CPU0              </span><br><span class="line"> |   |--PCIe Switch   </span><br><span class="line"> |   |   |--GPU0      </span><br><span class="line"> |   |   |--GPU1      </span><br><span class="line"> |   |--PCIe Switch   </span><br><span class="line"> |       |--GPU2      </span><br><span class="line"> |       |--GPU3      </span><br><span class="line"> |--CPU1              </span><br><span class="line">     |--PCIe Switch   </span><br><span class="line">     |   |--GPU4      </span><br><span class="line">     |   |--GPU5      </span><br><span class="line">     |--PCIe Switch   </span><br><span class="line">         |--GPU6      </span><br><span class="line">         |--GPU7      </span><br></pre></td></tr></table></figure><p>The result below is measured on the root rank, and each experiment is repeated 5 times. Meanwhile, the environment <code>CUDA_VISIBLE_DEVICES</code> was set to reorder GPUs binded to MPI ranks. CPU binding remains unset.</p><p>And the meaning of the notations on communicators is,</p><ul><li><code>0/1</code>: Only one communicator performing all-reduce on physical GPU 0/1.</li><li><code>0/1 + 2/3</code>: Two communicators are working at the same time, and each of them perform all-reduce on two GPUs independently.</li><li><code>0-7</code>: Equivalent to <code>0/1/2/.../6/7</code>.</li></ul><figure><img data-src="/images/pasted-95.png" alt="upload successful" /><figcaption>upload successful</figcaption></figure><p>From the result above, we can conclude that,</p><ul><li>GPUs are working at PCIe Gen3 x8 mode as the PCIe Switch splits one PCIe x16 slot into two x8 slots.<ul><li>Double checked by <code>nvidia-smi --query-gpu=pcie.link.gen.current --format=csv</code> and <code>sudo lspci -vvv</code></li></ul></li><li>The GPU Topology will significantly affect the performance of all-reduce.<ul><li>The topology that NVIDIA DGX adopt should obviously accelerate collective communication operations.</li></ul></li><li>The interference between two concurrent communicators is not quite noticeable.</li><li>UPI bus is not a bottleneck when two PCIe Gen3 x16 devices (PCIe Switches) transmit a large data chunk over UPI bus.</li></ul><h2 id="reference">Reference</h2><ul><li><a href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/mpi.html#inter-gpu-communication-with-cuda-aware-mpi">NCCL and MPI — NCCL 2.11.4 documentation (nvidia.com)</a></li><li><a href="https://developer.nvidia.com/blog/fast-multi-gpu-collectives-nccl/">Fast Multi-GPU collectives with NCCL</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;MPI allows to create a new communicator by splitting an existing one into a sub-communicator, which can make our program dynamically select a subset of computing nodes to involve in the collective communication operations, such as all-reduce and all-gather operations. NCCL also has a similar feature, but it is not well-documented yet.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>滥用Docker容器当作虚拟机的方法</title>
    <link href="https://blog.mylab.cc/2021/09/06/%E6%8A%8ADocker%E5%AE%B9%E5%99%A8%E5%BD%93%E8%99%9A%E6%8B%9F%E6%9C%BA%E7%94%A8/"/>
    <id>https://blog.mylab.cc/2021/09/06/%E6%8A%8ADocker%E5%AE%B9%E5%99%A8%E5%BD%93%E8%99%9A%E6%8B%9F%E6%9C%BA%E7%94%A8/</id>
    <published>2021-09-06T07:01:43.000Z</published>
    <updated>2021-10-14T16:04:39.000Z</updated>
    
    <content type="html"><![CDATA[<p>把Docker当虚拟机用，虽然真的很不优雅，做出来的镜像又糙又肮脏，但是这真的很方便啊。</p><h2 id="注意事项">注意事项</h2><h3 id="best-practice">Best Practice</h3><p>理想情况下Docker Image最好使用Dockerfile来构建。把Docker Container当做虚拟机来构建Docker Image这个方法虽然非常省事，但该方法很容易做出来很大一坨镜像，很不轻量，所以仅推荐在测试时使用，不推荐在正式场合（如企业的生产环境）使用。</p><h3 id="权限问题">权限问题</h3><p>Docker的安装和使用（创建销毁容器等）都需要超级用户权限。若非系统管理员，务必确认环境里已经安装了Docker和拥有Docker的使用权限（已加入<code>docker</code>用户组）</p><p>注：Docker也有Rootless模式，但需要额外的配置。</p><h2 id="基础知识">基础知识</h2><h3 id="容器与镜像">容器与镜像</h3><p>镜像可以说是容器在某一个时刻的所有文件数据，包括运行环境，程序，临时文件等。而容器才是能产生进程运行程序的东西。所以镜像是静态的，容器是动态的。他们的生命周期和转换关系如下</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">      |---------push--&gt; (Docker Hub)</span><br><span class="line">      |---------save--&gt; (tar File)</span><br><span class="line">      |</span><br><span class="line">Docker Image----run---&gt; Docker Container    </span><br><span class="line">      ↑                       |</span><br><span class="line">      |---------commit--------|</span><br><span class="line">      |---------pull----(Docker Hub)</span><br><span class="line">      |---------load----(tar File)</span><br><span class="line">      |---------build---(Dockerfile)</span><br></pre></td></tr></table></figure><h3 id="命名">命名</h3><p>容器的名字没有太多讲究，镜像名字的构成是：<code>镜像名:Tag</code>，如<code>Ubuntu:18.04</code>的镜像名是<code>Ubuntu</code>，Tag是<code>18.04</code>。</p><h3 id="其他实用命令">其他实用命令</h3><ul><li><code>docker ps</code> 查看运行中的容器<ul><li><code>docker ps -a</code> 查看所有容器（包含未运行的容器）</li></ul></li><li><code>docker rm -f</code> 删除容器</li><li><code>docker images</code> 查看已下载的镜像</li><li><code>docker pull</code> 下载镜像</li><li><code>docker rmi</code> 删除镜像</li></ul><h3 id="dockerhub下载加速">DockerHub下载加速</h3><p>暂未找到什么很好的加速方法</p><h2 id="创建容器">创建容器</h2><p>推荐命令：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run -id --name your_ct_name --privileged --network host --restart always ubuntu:18.04 bash</span><br></pre></td></tr></table></figure><blockquote><p>参数含义：</p><ul><li><code>-d</code> + <code>-i</code> + <code>bash</code> 组合会启动容器里的<code>bash</code>，目的是让容器挂在后台</li><li><code>--restart always</code> 主机重启后自动启动容器，挂在后台</li><li><code>ubuntu:18.04</code> 推荐使用Ubuntu 18.04镜像</li><li><code>--privileged</code> 允许容器使用更多的内核功能</li><li><code>--network host</code> 使用主机网络（禁用网络空间隔离）</li></ul></blockquote><p>也可以使用老黄家的CUDA开发环境镜像：</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run -id --name your_ct_name --privileged --network host --restart always --gpus all nvidia/cuda:11.0.3-devel-ubuntu18.04 bash</span><br></pre></td></tr></table></figure><blockquote><p>参数含义：</p><ul><li><code>nvidia/cuda:11.0.3-devel-ubuntu18.04</code> 是包含CUDA 11.0.3对应工具链的Ubuntu 18.04镜像<ul><li>镜像的CUDA版本需要和驱动支持的版本对应，<code>nvidia-smi</code> 右上角会显示最高支持的CUDA版本</li><li><code>devel</code>版镜像包含<code>nvcc</code>编译器等工具链，<code>runtime</code>版不含工具链</li></ul></li><li><code>--gpus all</code> 使用所有可用的GPU</li></ul></blockquote><p>所有黄家容器列表：<a href="https://hub.docker.com/r/nvidia/cuda/tags?page=1&amp;ordering=last_updated">https://hub.docker.com/r/nvidia/cuda/tags?page=1&amp;ordering=last_updated</a></p><h2 id="进入容器交互式bash">进入容器（交互式bash）</h2><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker <span class="built_in">exec</span> -it your_ct_name bash</span><br></pre></td></tr></table></figure><blockquote><p>参数含义：</p><ul><li><code>-i</code> + <code>-t</code> 启动交互式模式</li></ul></blockquote><blockquote><p>进入容器后如果想换apt源，建议使用下面的命令来换，因为镜像为了节约空间，往往不包含文字编辑器</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sed -i <span class="string">&quot;s#archive.ubuntu.com#mirrors.sustech.edu.cn#g&quot;</span> /etc/apt/sources.list</span><br></pre></td></tr></table></figure></blockquote><h2 id="容器镜像文件的转换">容器、镜像、文件的转换</h2><p>容器到镜像</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker commit your_ct_name your_images_name:your_tag</span><br></pre></td></tr></table></figure><p>镜像到文件</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker save your_images_name:your_tag -o your_file.tar</span><br></pre></td></tr></table></figure><p>文件到镜像</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker load -i your_file.tar</span><br></pre></td></tr></table></figure>]]></content>
    
    
    <summary type="html">&lt;p&gt;把Docker当虚拟机用，虽然真的很不优雅，做出来的镜像又糙又肮脏，但是这真的很方便啊。&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>个人博客CDN选型和进阶玩法指北</title>
    <link href="https://blog.mylab.cc/2021/08/12/%E4%B8%AA%E4%BA%BA%E5%8D%9A%E5%AE%A2CDN%E9%80%89%E5%9E%8B%E5%92%8C%E8%BF%9B%E9%98%B6%E7%8E%A9%E6%B3%95%E6%8C%87%E5%8C%97/"/>
    <id>https://blog.mylab.cc/2021/08/12/%E4%B8%AA%E4%BA%BA%E5%8D%9A%E5%AE%A2CDN%E9%80%89%E5%9E%8B%E5%92%8C%E8%BF%9B%E9%98%B6%E7%8E%A9%E6%B3%95%E6%8C%87%E5%8C%97/</id>
    <published>2021-08-11T18:49:03.000Z</published>
    <updated>2021-08-11T18:50:44.000Z</updated>
    
    <content type="html"><![CDATA[<p>网上几乎所有的文章都直接忽悠上CDN的车，难道上CDN就是提升速度的最优解？在CDN这条弯路上折腾了快两年，玩了一圈免备案的CDN，踩了各种各样的坑以后，恍然大悟，茅厕顿开，便在此大放厥词写下此文。本文主要介绍CDN的正确用法，以及性价比爆炸，便宜又效果好的网站加速方案。</p><h2 id="tldr">TL;DR</h2><p>在以下前提下，上CDN不如把网站搬到24块一个月的腾讯云香港轻量服务器</p><ul><li>回源稀烂</li><li>不考虑国外访问</li><li>CDN本身不太行</li><li>腾讯云香港还没有被玩坏</li></ul><p>已知的性价比最高的方案是Azure CDN (Microsoft Standard) + GitHub Pages。但网站本身才是影响访问速度的大头。</p><p>注：本文讨论的前提站长懒得备案。如果备案了，毫无疑问国内回源+国内CDN能吊打上述方案。</p><h2 id="上cdn一定提速吗">上CDN一定提速吗？</h2><p>啊这里有人可能会问了：“这个问题有必要问吗？”但仔细思考一下，免备案的CDN边缘节点最近就在香港，考虑到边缘节点是个公交车，还要服务别人，线路不一定比腾讯云的强。更令人智熄的是，基本上个人博客可以和访问量大不到哪去画个等号，你的网站被挤出Cache可太正常了，Cache Miss一下延迟就更爆炸了。无脑上CDN后发生的大概率事件就是CDN打不过三网直连的24块钱的腾讯云香港。</p><p>那怎么样才能让CDN搞快点呢？当然是头痛医头脚痛医脚，</p><ul><li>回源问题：弄一个好点的源站就好了，当然CDN本身的Cache策略和能力也非常关键</li><li>边缘节点承载能力的问题：弄一个好点的CDN就好了嘛</li></ul><p>所以问题就被简化为回源和CDN的选择问题了。</p><h2 id="源站的选择">源站的选择</h2><p>弄一个好点的源站，说着容易，实际上源站的选择还是很讲究的。</p><h3 id="源站的类型">源站的类型</h3><p>对于静态网站（如Hexo）来说，源站除了可以搭建在VPS上，更建议扔在对象存储上，因为</p><ul><li>更高的SLA：自己的VPS维护不当崩崩崩可太常见了，对象存储的SLA动不动就是99.9%以上</li><li>（可能）更高的性价比：个人博客一般不会太多空间，除非存了一堆视频。性能上可能有大厂的神秘优化<ul><li>Azure Blob Storage：实测每天大概0.1-0.2美元，小贵</li><li>AWS S3 Bucket：免费额度应该能cover，计费也比Azure便宜</li></ul></li></ul><p>对象存储的缺点主要是</p><ul><li>第一次配置比较复杂，企业级的云通常需要反复折腾IAM权限</li><li>为了忽悠你买CDN，对象存储对换用自己的域名和HTTPS的支持多少会有问题</li><li>有坑（如果用一个厂家的全家桶的话坑会少一点）<ul><li>Azure的官方文档就不会告诉你上传到<span class="math inline">\(web文件夹（容器）的\)</span>在Linux下要转义</li><li>AWS的S3 Bucket不兼容GeoDNS</li><li>...</li></ul></li></ul><p>不过调通了以后同步对象存储数据就像用网盘一样简单（因为这就是个网盘）。</p><h3 id="地理位置选择">地理位置选择</h3><p>之前也提到了，个人博客上CDN就要时刻准备好Cache Miss回源，所以缩短边缘节点从源站下载数据的时间非常的关键。解决这个问题最好的思路应该就是缩短边缘节点到源站的地理距离，最好在同一个地区，因为</p><ul><li>更短的延迟：这个没啥好说的，光速再快，理想情况下数据在中美之间走一圈都140毫秒起步</li><li>更大的带宽：一般来说城域网之间通信的带宽比国际线路的带宽大多了</li></ul><p>另一个好处就是，因为只需要考虑同地区内的通信，所以源站的国际线路质量完全不需要考虑，什么CN2 GIA都完全不需要，源站能通网就行。</p><p>由于一个源站只能照顾一个地区，如果只考虑国内访问的话，一个香港源站应该就足够了。但如果要</p><ul><li>照顾全世界的人民</li><li>照顾开着美国梯子的自己</li><li>刷高PageSpeed分数来优化SEO（谷歌应该是从美国访问你的网站）</li></ul><p>，就可能需要不止一个源站了。</p><h3 id="多地区延迟优化">多地区延迟优化</h3><p>多地区的优化是玩具级解决方案和企业级方案的分水岭之一，为啥这么说呢，因为从相关服务的定价来看，基本上云厂家就没考虑过个人玩家的死活。对于CDN来说，就是配置多个源站（这既是为了降低延迟，也是为了容灾），让CDN能根据访客的位置选择最近的源站。这大致有两种实现</p><ul><li>CDN自身支持多个源站并能选择最优的<ul><li>Azure Front Door直接支持多个后端，并且可以自动根据延迟选择后端</li><li>Azure CDN (Standard Microsoft)的Rule Engine可以为不同地区指定一个源站</li></ul></li><li>（GeoDNS）让DNS根据地理位置将域名解析到不同后端，CDN通过这个域名回源。支持这个功能的DNS有<ul><li>Azure Traffic Manager：大概4港币增加一个源站，30港币每百万解析</li><li>AWS Route 53：看到每月几十美元一个Policy Record后就没继续了解了</li><li>DNSPod（腾讯云）：360rmb每年，不乐意了</li><li>阿里云：免费版能按国内外区分（可香港也算国外，这没有区分度啊），从198rmb一年的企业版开始可以细分国外的国家地区</li></ul></li></ul><h3 id="我都要但我没钱咋办">我都要！但我没钱咋办</h3><p>多个源站，GeoDNS都是烧钱的东西（企业人傻钱多不在意），那普通人咋办？这里就要介绍这个无敌的存在了，GitHub Pages。这玩意除了能免费给你存东西以外，还安排上了Fastly CDN。GitHub Pages的架构我们不得而知，但从测速结果来看，很多地方的测速点测出的访问延迟都很低，应该是有做数据的geo-replication。也就是说DNS不用买，多地区存储也不要钱，唯一的毛病就是国内访问比较随缘，但作为CDN的源，这个毛病无伤大雅。而且像Hexo这种静态博客，甚至有插件能一键同步博客到GitHub Pages上。</p><h2 id="cdn的选择">CDN的选择</h2><h3 id="国内访问速度">国内访问速度</h3><p>根据我这两年来的观察，我<strong>主观</strong>的将我用过的CDN按照国内访问速度分为几个等级。</p><ul><li>T0：能和腾讯云香港五五开<ul><li>Azure CDN：反正就不知道为什么它的香港节点又稳又快</li></ul></li><li>T1：不一定能干过腾讯云，但可能跑得赢CN2美国VPS的<ul><li>AWS CloudFront</li><li>UDomain</li><li>CloudCone</li><li>这三家都有香港节点，但是表现属于时好时坏的那种</li></ul></li><li>T1.5：可能跑得赢辣鸡线路美国VPS的<ul><li>Cloudflare：免费的还要什么自行车，主要是免费版没给香港节点，但美国节点的表现不算差</li></ul></li></ul><p>至于国外网站访问速度估计大家都差不太多。</p><h3 id="定价">定价</h3><ul><li>T0：看看就好<ul><li>Azure Front Door：背靠Azure CDN (Standard Microsoft)，一条Rule也就每月170港币（至少会有一条，躲不掉的）</li></ul></li><li>T0.9：勉强可以接受<ul><li>AWS Lightsail Distribution：背靠CloudFront，5美元50GB，但可惜用不完</li></ul></li><li>T1：穷人友好<ul><li>Azure CDN (Standard)：真正的按量计费，1港币1GB，5条免费Rules，Azure少数不贵的东西</li><li>CloudFront：按量计费，0.12美元1GB</li><li>UDomain：按量计费，1.2港币1GB，充值的方式很怪，非常不现代</li><li>CloudCone：按量计费，0.045美元1GB，需要首充20美元的样子</li></ul></li><li>T2：博爱<ul><li>Cloudflare：套餐0元起步</li></ul></li></ul><h3 id="结论">结论</h3><p>结论其实很明显，我肯定首推Azure CDN (Standard Microsoft)，因为Front Door这个价格就离谱，其他家的CDN会让你怀疑为什么要花这个钱买个减速器（当然CloudFlare配合廉价美国VPS能省钱）。当然Azure确实比较高冷，首先得有张外币卡，然后就是各种问就是企业级的设计，以及莫名其妙的设计，比如说对根域名不友好，CNAME验证各种不通过，官网文档只会让你去买他家DNS，用<code>cdnverify</code>绕过的方法就是不说，也不给根域名自动签TLS证书（AWS就可以）；不可以CDN前端用HTTPS后端HTTP（Front Door倒是可以）等等。但没办法谁让他家CDN国内访问就是快，看在价格不贵的份上原谅他了，免费5条规则也算良心，可以拿来配HTTPS强制跳转和HSTS，虽然这些东西可能在别家CDN面板上一键就能配好。</p><h2 id="summary">Summary</h2><p>Azure CDN (Microsoft Standard) + GitHub Pages这套方案可能比较绕，但一个月花不了几个钱（估计5rmb不到）速度又倍棒。不过还有一个问题值得思考，上这套方案就能让网站访问速度无人能敌？其实不是，从谷歌PageSpeed的分数看来，我从单回源（新加坡）+AWS CDN换到上述这套方案，PageSpeed也就提升了3分左右（国外访问速度）。另外提升的20多分靠的是对网站自身的调整，如减少了外部文件的加载数量。我曾今遇到过一个高度优化的网站，哪怕用的是Cloudflare，走国内国外网络的PageSpeed都是满分（用Chrome Lightroom测试）。</p><p>写到这里我才意识到这套方案最大的意义是给我省了一点钱，比起上腾讯云还便宜了不少，顺便提升了国内访问速度。</p>]]></content>
    
    
    <summary type="html">&lt;p&gt;网上几乎所有的文章都直接忽悠上CDN的车，难道上CDN就是提升速度的最优解？在CDN这条弯路上折腾了快两年，玩了一圈免备案的CDN，踩了各种各样的坑以后，恍然大悟，茅厕顿开，便在此大放厥词写下此文。本文主要介绍CDN的正确用法，以及性价比爆炸，便宜又效果好的网站加速方案。&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Design of TensorFlow XLA Sharding System</title>
    <link href="https://blog.mylab.cc/2021/08/04/Design-of-TensorFlow-XLA-Sharding-System/"/>
    <id>https://blog.mylab.cc/2021/08/04/Design-of-TensorFlow-XLA-Sharding-System/</id>
    <published>2021-08-04T13:47:59.000Z</published>
    <updated>2021-08-05T01:38:56.000Z</updated>
    
    <content type="html"><![CDATA[<p>Recently, a SOTA sharding approach, GSPMD/GShard, was proposed and it provides an intuitive interface to partition a large array on arbitrary dimensions, while utilizing sharding propagation algorithms to automatically infer the partitioning strategy for tensors without user-specified sharding specifications. This document introduces the design and the implementation of XLA Sharding System.</p><figure><img data-src="/images/pasted-92.png" alt="upload successful" /><figcaption>upload successful</figcaption></figure><h2 id="hlosharding-object"><code>HloSharding</code> Object</h2><p>First of all, <strong>we need a way to represent sharding specifications</strong> using programming language. XLA designed an object to do such a thing, and this object contains numerous variables and a set of supporting functions to configure itself. Some attributes of <code>HloSharding</code> are listed below.</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment">// File: tensorflow/compiler/xla/service/hlo_sharding.h</span></span><br><span class="line"></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">HloSharding</span> &#123;</span></span><br><span class="line">  <span class="keyword">bool</span> replicated_;</span><br><span class="line">  <span class="keyword">bool</span> maximal_;</span><br><span class="line">  <span class="keyword">bool</span> tuple_;</span><br><span class="line">  <span class="keyword">bool</span> manual_;</span><br><span class="line">  <span class="comment">// This field is only used if replicated_ is false. If maximal_ is true, then</span></span><br><span class="line">  <span class="comment">// the field contains a rank 1 array with a single element, which is the</span></span><br><span class="line">  <span class="comment">// device the HLO is assigned to. If maximal_ is false, the field contains an</span></span><br><span class="line">  <span class="comment">// array with the same rank as the corresponding HLO. The dimension sizes of</span></span><br><span class="line">  <span class="comment">// the array describe the number of ways the HLO is partitioned along each</span></span><br><span class="line">  <span class="comment">// dimension. The values of the array specify which device each tile of</span></span><br><span class="line">  <span class="comment">// the HLO is assigned to. The index of each value determines which tile it</span></span><br><span class="line">  <span class="comment">// takes.</span></span><br><span class="line">  <span class="comment">// For example, &#123;&#123;&#123;2, 3&#125;&#125;, &#123;&#123;5, 7&#125;&#125;&#125; (whose ToString representation is</span></span><br><span class="line">  <span class="comment">// &quot;&#123;devices=[2,1,2]2,3,5,7&#125;&quot;), means that dimension 1 is split two way and</span></span><br><span class="line">  <span class="comment">// dimension 3 is split 2 way. Core 5, whose index is [2,1,1] will take the</span></span><br><span class="line">  <span class="comment">// tile that contains the 2nd half of dimension 1 and the 1st half of</span></span><br><span class="line">  <span class="comment">// dimension 3.</span></span><br><span class="line">  Array&lt;int64&gt; tile_assignment_;</span><br><span class="line">  <span class="comment">// Only non-empty when tuple_ is true. If a tuple is empty then one entry is</span></span><br><span class="line">  <span class="comment">// present for the root. This is a flattened list of all the leaf shardings in</span></span><br><span class="line">  <span class="comment">// a tuple shape, by pre-order walk (ShapeTree iterator order).</span></span><br><span class="line">  std::vector&lt;HloSharding&gt; tuple_elements_;</span><br><span class="line">  <span class="comment">// This flag is to support partial replication and partial sharding. If it is</span></span><br><span class="line">  <span class="comment">// true, tile_assignment_ will have an extra dimension in addition to the data</span></span><br><span class="line">  <span class="comment">// shape rank, and the added last dimension represents the subgroups of</span></span><br><span class="line">  <span class="comment">// replications, i.e., elements in slice [..., :] will be replicated.</span></span><br><span class="line">  <span class="keyword">bool</span> replicate_on_last_tile_dim_;</span><br><span class="line">  <span class="comment">// This field is used to track the source of this sharding, usually derived</span></span><br><span class="line">  <span class="comment">// from instructions. Multiple metadata may be populated if sharding is</span></span><br><span class="line">  <span class="comment">// combined with other shardings. Metadata are to not be populated when</span></span><br><span class="line">  <span class="comment">// tuple_ == true and instead metadata should be set on individual tuple</span></span><br><span class="line">  <span class="comment">// elements.</span></span><br><span class="line">  std::vector&lt;OpMetadata&gt; metadata_;</span><br><span class="line">&#125;;</span><br></pre></td></tr></table></figure><p><code>Array&lt;int64&gt; tile_assignment_</code> here is multi-dimensional with arbitrary shape. <code>&#123;devices=[2,1,2]2,3,5,7&#125;</code> means the shape of <code>tile_assignment_</code> is <code>[2,1,2]</code>, while the values are <code>&#123;2,3,5,7&#125;</code>.</p><p><code>std::vector&lt;HloSharding&gt; tuple_elements_</code> probably was designed to specify the sharding specifications of outputs.</p><p><em>I am not aware of what the roles of <code>maximal_</code>, <code>tuple_elements_</code> are. Is there any body know that?</em></p><p>Note that each single object could be shared by multiple instructions. By doing this, the cost of creating and maintaining several instances with the exact same contents could be eliminated.</p><h2 id="extended-hlo-ir-attribute">Extended HLO IR Attribute</h2><p>The original implementation of XLA added the attribute <code>std::shared_ptr&lt;const HloSharding&gt; sharding_</code> to the class <code>xla::HloInstruction</code>, which is declared in <code>tensorflow/compiler/xla/service/hlo_instruction.h</code>. A common usage of this HLO Instruction Attribute is to <strong>declare sharded tensors</strong>. Here is a sample HLO IR code with sharding attributes. Note that the Propagation Algorithm may fill in this attribute for those instructions without it.</p><figure class="highlight rust"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">primitive_computation_add.<span class="number">6</span> &#123;</span><br><span class="line">  parameter.<span class="number">7</span> = <span class="built_in">f32</span>[] parameter(<span class="number">0</span>)</span><br><span class="line">  parameter.<span class="number">8</span> = <span class="built_in">f32</span>[] parameter(<span class="number">1</span>)</span><br><span class="line">  ROOT add.<span class="number">9</span> = <span class="built_in">f32</span>[] add(parameter.<span class="number">7</span>, parameter.<span class="number">8</span>)</span><br><span class="line">&#125;</span><br><span class="line"></span><br><span class="line">ENTRY xmap__lambda_.<span class="number">12</span> &#123;</span><br><span class="line">  constant.<span class="number">2</span> = pred[] constant(<span class="literal">false</span>)</span><br><span class="line">  parameter.<span class="number">1</span> = <span class="built_in">f32</span>[<span class="number">8</span>]&#123;<span class="number">0</span>&#125; parameter(<span class="number">0</span>), parameter_replication=&#123;<span class="literal">false</span>&#125;, sharding=&#123;replicated&#125;</span><br><span class="line">  custom-call.<span class="number">3</span> = <span class="built_in">f32</span>[<span class="number">8</span>]&#123;<span class="number">0</span>&#125; custom-call(parameter.<span class="number">1</span>), custom_call_target=<span class="string">&quot;Sharding&quot;</span>, sharding=&#123;devices=[<span class="number">4</span>]<span class="number">0</span>,<span class="number">1</span>,<span class="number">2</span>,<span class="number">3</span>&#125;</span><br><span class="line">  sine.<span class="number">4</span> = <span class="built_in">f32</span>[<span class="number">8</span>]&#123;<span class="number">0</span>&#125; sine(custom-call.<span class="number">3</span>)</span><br><span class="line">  constant.<span class="number">5</span> = <span class="built_in">f32</span>[] constant(<span class="number">0</span>)</span><br><span class="line">  reduce.<span class="number">10</span> = <span class="built_in">f32</span>[] reduce(sine.<span class="number">4</span>, constant.<span class="number">5</span>), dimensions=&#123;<span class="number">0</span>&#125;, to_apply=primitive_computation_add.<span class="number">6</span></span><br><span class="line">  ROOT tuple.<span class="number">11</span> = (<span class="built_in">f32</span>[]) tuple(reduce.<span class="number">10</span>), sharding=&#123;&#123;replicated&#125;&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Note: this HLO IR code is compiled from this JAX Frontend code</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line"><span class="meta">@jtu.with_mesh(<span class="params">[(<span class="params"><span class="string">&#x27;x&#x27;</span>, <span class="number">4</span></span>)]</span>)</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">test</span>():</span></span><br><span class="line">    f = pjit(<span class="keyword">lambda</span> x: jnp.sin(x).<span class="built_in">sum</span>(),</span><br><span class="line">             in_axis_resources=(P(<span class="string">&#x27;x&#x27;</span>),),</span><br><span class="line">             out_axis_resources=<span class="literal">None</span>)</span><br><span class="line">    x = jnp.arange(<span class="number">8</span>, dtype=jnp.float32)</span><br><span class="line">    f(x)</span><br></pre></td></tr></table></figure><p>This example illustrates a lambda function takes a replicated tensor as the input, and splits this tensor by invoking <code>custom-call</code>, then performs the calculation.</p><h2 id="spmd-partitioner">SPMD Partitioner</h2><p>You might notice that in the previous example, the instructions invoking operators (e.g. reduce.10) don’t contain sharding attributes. That leads to a critical question, <strong>how a regular operator reacts to sharded tensors</strong>. The solution of XLA is introducing SPMD Partitioner, which is mainly responsible for converting a full-sized operator into a partition-sized operator by adding necessary collective communication primitives to lower-layer IR code, and the partitioner also converts the inputs of operators from global tensor symbols with sharding to local tensor symbols without sharding specifications.</p><p>We could find some clues in <code>tensorflow/compiler/xla/service/spmd/spmd_partitioner_test.cc</code>.</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">TEST_F</span>(SpmdPartitioningTest, DotPartialContracting2) &#123;</span><br><span class="line">  absl::string_view hlo_string = <span class="string">R&quot;(</span></span><br><span class="line"><span class="string">HloModule module</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">ENTRY entry &#123;</span></span><br><span class="line"><span class="string">  %lhs = f32[24,100] parameter(0),</span></span><br><span class="line"><span class="string">    sharding=&#123;devices=[1,2,2]0,1,2,3 last_tile_dim_replicate&#125;</span></span><br><span class="line"><span class="string">  %rhs = f32[32,100] parameter(1),</span></span><br><span class="line"><span class="string">    sharding=&#123;devices=[1,2,2]0,1,2,3 last_tile_dim_replicate&#125;</span></span><br><span class="line"><span class="string">  ROOT %dot = f32[24,32] dot(%lhs, %rhs),</span></span><br><span class="line"><span class="string">    lhs_batch_dims=&#123;&#125;, rhs_batch_dims=&#123;&#125;,</span></span><br><span class="line"><span class="string">    lhs_contracting_dims=&#123;1&#125;, rhs_contracting_dims=&#123;1&#125;,</span></span><br><span class="line"><span class="string">    sharding=&#123;devices=[2,1,2]0,2,1,3 last_tile_dim_replicate&#125;</span></span><br><span class="line"><span class="string">&#125;)&quot;</span>;</span><br><span class="line"></span><br><span class="line">  <span class="built_in">TF_ASSERT_OK_AND_ASSIGN</span>(<span class="keyword">auto</span> <span class="keyword">module</span>,</span><br><span class="line">                          <span class="built_in">PartitionComputation</span>(hlo_string, <span class="comment">/*num_devices=*/</span><span class="number">4</span>));</span><br><span class="line">  <span class="built_in">VLOG</span>(<span class="number">1</span>) &lt;&lt; <span class="keyword">module</span>-&gt;<span class="built_in">ToString</span>();</span><br><span class="line"></span><br><span class="line">  <span class="keyword">auto</span> lhs = <span class="built_in">AllOf</span>(op::<span class="built_in">Shape</span>(<span class="string">&quot;f32[24,50]&quot;</span>), op::<span class="built_in">Parameter</span>(<span class="number">0</span>));</span><br><span class="line">  <span class="keyword">auto</span> rhs = <span class="built_in">AllOf</span>(op::<span class="built_in">Shape</span>(<span class="string">&quot;f32[32,50]&quot;</span>), op::<span class="built_in">Parameter</span>(<span class="number">1</span>));</span><br><span class="line">  <span class="keyword">auto</span> dot =</span><br><span class="line">      <span class="built_in">AllOf</span>(op::<span class="built_in">Shape</span>(<span class="string">&quot;f32[12,32]&quot;</span>),</span><br><span class="line">            op::<span class="built_in">Dot</span>(<span class="built_in">AllOf</span>(op::<span class="built_in">Shape</span>(<span class="string">&quot;f32[12,50]&quot;</span>), op::<span class="built_in">DynamicSlice</span>(lhs, _, _)),</span><br><span class="line">                    rhs));</span><br><span class="line">  <span class="keyword">auto</span> root = <span class="keyword">module</span>-&gt;<span class="built_in">entry_computation</span>()-&gt;<span class="built_in">root_instruction</span>();</span><br><span class="line">  <span class="built_in">EXPECT_THAT</span>(root, op::<span class="built_in">AllReduce</span>(dot));</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Two inputs, <code>lhs</code> and <code>rhs</code>, are tensors partitioned in the way that the figure describes. Thus, after partitioning the computation, the <code>lhs</code> is unwarpped, and its shape changed from <code>f32[24, 100]</code> to <code>f32[24,50]</code>. And at the end of file, <code>AllReduce</code> was added to collect the partial results.</p><figure><img data-src="/images/pasted-93.png" alt="upload successful" /><figcaption>upload successful</figcaption></figure><h2 id="sharding-propagation-algorithm">Sharding Propagation Algorithm</h2><p>The system should be able to figure out an optimal sharding specifications for the remaining tensors without user’s annotations. An ideal partitioning plan can reduce the communication amount, reduce memory footprint, and improve the performance.</p><figure><img data-src="/images/pasted-94.png" alt="upload successful" /><figcaption>upload successful</figcaption></figure><p>Some unit tests written in <code>tensorflow/compiler/xla/service/sharding_propagation_test.cc</code> are intuitive examples.</p><figure class="highlight c++"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line"><span class="built_in">TEST_P</span>(ParameterizedMetadataTest, BroadcastForwardPass) &#123;</span><br><span class="line">  <span class="keyword">const</span> <span class="keyword">char</span>* <span class="keyword">const</span> hlo_string = <span class="string">R&quot;(</span></span><br><span class="line"><span class="string">HloModule module</span></span><br><span class="line"><span class="string">ENTRY %broadcast &#123;</span></span><br><span class="line"><span class="string">  %param0 = f32[3,2048,2048]&#123;2,1,0&#125; parameter(0),</span></span><br><span class="line"><span class="string">    sharding=&#123;devices=[1,2,2]0,1,2,3 metadata=&#123;op_name=&quot;a&quot;&#125;&#125;</span></span><br><span class="line"><span class="string">  %broadcast = f32[3,2048,2048,3]&#123;3,2,1,0&#125; broadcast(%param0), dimensions=&#123;0,1,2&#125;</span></span><br><span class="line"><span class="string">  ROOT %copy = f32[3,2048,2048,3]&#123;3,2,1,0&#125; copy(%broadcast)</span></span><br><span class="line"><span class="string">&#125;)&quot;</span>;</span><br><span class="line">  <span class="built_in">TF_ASSERT_OK_AND_ASSIGN</span>(<span class="keyword">auto</span> <span class="keyword">module</span>,</span><br><span class="line">                          <span class="built_in">ParseAndReturnVerifiedModule</span>(hlo_string));</span><br><span class="line">  <span class="keyword">if</span> (<span class="built_in">GetParam</span>().clear_metadata) &#123;</span><br><span class="line">    <span class="built_in">ClearMetadata</span>(<span class="keyword">module</span>.<span class="built_in">get</span>());</span><br><span class="line">  &#125;</span><br><span class="line">  <span class="built_in">TF_ASSERT_OK_AND_ASSIGN</span>(</span><br><span class="line">      <span class="keyword">bool</span> changed,</span><br><span class="line">      <span class="built_in">ShardingPropagation</span>(<span class="comment">/*is_spmd=*/</span><span class="literal">false</span>, <span class="built_in">GetParam</span>().propagate_metadata)</span><br><span class="line">          .<span class="built_in">Run</span>(<span class="keyword">module</span>.<span class="built_in">get</span>()));</span><br><span class="line">  <span class="built_in">EXPECT_TRUE</span>(changed);</span><br><span class="line">  <span class="keyword">auto</span>* instruction = <span class="built_in">FindInstruction</span>(<span class="keyword">module</span>.<span class="built_in">get</span>(), <span class="string">&quot;broadcast&quot;</span>);</span><br><span class="line">  <span class="built_in">ASSERT_NE</span>(instruction, <span class="literal">nullptr</span>);</span><br><span class="line">  <span class="built_in">EXPECT_THAT</span>(instruction, op::<span class="built_in">Sharding</span>(<span class="string">&quot;&#123;devices=[1,2,2,1]0,1,2,3&#125;&quot;</span>));</span><br><span class="line">  <span class="keyword">if</span> (<span class="built_in">GetParam</span>().propagate_metadata &amp;&amp; !<span class="built_in">GetParam</span>().clear_metadata) &#123;</span><br><span class="line">    <span class="built_in">EXPECT_THAT</span>(instruction-&gt;<span class="built_in">sharding</span>(),</span><br><span class="line">                <span class="built_in">ShardingMetadata</span>(&#123;<span class="built_in">CreateMetadata</span>(<span class="string">&quot;a&quot;</span>)&#125;));</span><br><span class="line">  &#125; <span class="keyword">else</span> &#123;</span><br><span class="line">    <span class="built_in">EXPECT_THAT</span>(instruction-&gt;<span class="built_in">sharding</span>(), <span class="built_in">ShardingMetadata</span>(&#123;&#125;));</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>It clearly shows that the system inferred the sharding specification of <code>broadcast</code> is <code>&#123;devices=[1,2,2,1]0,1,2,3&#125;</code>according to its input with the attribute <code>&#123;devices=[1,2,2]0,1,2,3&#125;</code>. Note that this test is called <code>BroadcastForwardPass</code>, there also exists a test named <code>BroadcastBackwardPass</code>, which is to say the propagation should be on both directions.</p><h2 id="reference">Reference</h2><ul><li><p>GShard: https://arxiv.org/abs/2006.16668</p></li><li><p>GSPMD: https://arxiv.org/abs/2105.04663</p></li><li><p>Julia DistributedArrays.jl: https://juliaparallel.github.io/DistributedArrays.jl/latest/index.html</p></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;Recently, a SOTA sharding approach, GSPMD/GShard, was proposed and it provides an intuitive interface to partition a large array on arbitrary dimensions, while utilizing sharding propagation algorithms to automatically infer the partitioning strategy for tensors without user-specified sharding specifications. This document introduces the design and the implementation of XLA Sharding System.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
  <entry>
    <title>Easy way to debug TensorFlow XLA Compiler using VSCode</title>
    <link href="https://blog.mylab.cc/2021/08/04/Easy-way-to-debug-TensorFlow-XLA-Compiler-using-VSCode/"/>
    <id>https://blog.mylab.cc/2021/08/04/Easy-way-to-debug-TensorFlow-XLA-Compiler-using-VSCode/</id>
    <published>2021-08-04T13:26:12.000Z</published>
    <updated>2021-08-04T13:29:06.000Z</updated>
    
    <content type="html"><![CDATA[<p>It would be easier to read the source code if we are aware of the runtime information, including call stacks and variable values. This tutorial introduces how to utilize our powerful VSCode to trace XLA Compiler.</p><figure><img data-src="/images/pasted-82.png" alt="upload successful" /><figcaption>upload successful</figcaption></figure><h2 id="preparing-environment">Preparing Environment</h2><p>Of course we need to download the source code of TensorFlow, and install all the dependencies. I suggest to use Conda to manage the environment, and use build-in GCC on Ubuntu 18.04 (or above, maybe) to build the code. Note that building from source requires about 50GiB of free space.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line"><span class="comment"># Fetch Source Code</span></span><br><span class="line">git <span class="built_in">clone</span> https://github.com/tensorflow/tensorflow.git</span><br><span class="line"><span class="built_in">cd</span> tensorflow</span><br><span class="line"></span><br><span class="line"><span class="comment"># Install dependencies</span></span><br><span class="line">conda create -n tf_dev python numpy wheel -y</span><br><span class="line">conda activate tf_dev</span><br><span class="line">pip install keras_preprocessing</span><br><span class="line">conda install -c conda-forge bazel -y</span><br></pre></td></tr></table></figure><h2 id="compile-the-source-code">Compile the source code</h2><p>First of all, configure the project and build it.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">./configure</span><br><span class="line">bazel build --config=dbg //tensorflow/tools/pip_package:build_pip_package</span><br></pre></td></tr></table></figure><p>During the configuration process, it is recommended to choose <strong>ALL</strong> the default options if it is not a must to debug on GPU, since enabling GPU support needs additional configuration (Please refer to <a href="https://github.com/tensorflow/tensorflow/blob/master/CONTRIBUTING.md">this article</a>) and much more time to compile.</p><p>As for the bazel build flag,</p><ul><li><code>--config=dbg</code> adds debugging symbols. Required.</li><li><code>--config=monolithic</code> should generate the binary code as a single dynamic library. But this option seems to be buggy. Not recommended.</li></ul><p>Compiling TensorFlow is quite time-consuming, and it took about 20min using 48 CPU threads on my server. Time for coffee now.</p><h2 id="pick-a-unit-test-to-compile">Pick a unit test to compile</h2><p>In fact, we don't have to write something in Python frontend to trigger breakpoints inside XLA compiler, as there are already tons of unit tests that covers most of codes and demonstrates the capability of the compiler.</p><p>Let pick a simple test first to validate the code is compiled correctly.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">bazel <span class="built_in">test</span> --config=dbg //tensorflow/compiler/xla/tests:tuple_test_cpu</span><br></pre></td></tr></table></figure><p>From the compiling log, we could find the executable file locates at <code>bazel-bin/tensorflow/compiler/xla/tests/tuple_test_cpu</code>. Execute it! If everything works well, the program will print out the message below.</p><figure class="highlight bash"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">[----------] Global <span class="built_in">test</span> environment tear-down</span><br><span class="line">[==========] 25 tests from 2 <span class="built_in">test</span> suites ran. (3618 ms total)</span><br><span class="line">[  PASSED  ] 25 tests.</span><br></pre></td></tr></table></figure><p>Then pick a test you interest, and repeat the steps above.</p><h2 id="fix-broken-dependency-optional">Fix broken dependency (Optional)</h2><p>Take <code>spmd_partitioner_test</code> as an example. This unit test can be compiled without any error message, but when you directly run the executable, you will see this message.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">[ RUN      ] SpmdPartitioningTest.BroadcastAsReplicate3</span><br><span class="line">2021-08-04 10:44:13.324501: I tensorflow/compiler/xla/service/platform_util.cc:72] platform Host present but no XLA compiler available: could not find registered compiler for platform Host -- check target linkage (hint: try adding tensorflow/compiler/jit:xla_cpu_jit as a dependency)</span><br><span class="line">[       OK ] SpmdPartitioningTest.BroadcastAsReplicate3 (6 ms)</span><br></pre></td></tr></table></figure><p>This is because this executable is not linked to a valid backend, which means this executable doesn't contain the code of JIT Execution Environment. The solution is modifying the <code>BUILD</code> file manually to fix the dependency as the message suggests.</p><p>Open the <code>BUILD</code> file in the directory where the unit test locates. In this example, the test <code>tensorflow/compiler/xla/service/spmd/spmd_partitioner_test.cc</code> corresponds to <code>tensorflow/compiler/xla/service/spmd/BUILD</code>. And add this dependency <code>//tensorflow/compiler/jit:xla_cpu_jit</code>.</p><figure class="highlight python"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">tf_cc_test(</span><br><span class="line">    name = <span class="string">&quot;spmd_partitioner_test&quot;</span>,</span><br><span class="line">    srcs = [<span class="string">&quot;spmd_partitioner_test.cc&quot;</span>],</span><br><span class="line">    deps = [</span><br><span class="line">        <span class="string">&quot;:spmd_partitioner&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla:util&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla:xla_data_proto_cc&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla/service:hlo&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla/service:hlo_casting_utils&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla/service:hlo_matchers&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla/service:hlo_parser&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla/service:hlo_pass_pipeline&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla/service:hlo_verifier&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla/tests:hlo_test_base&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/xla/tests:xla_internal_test_main&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/compiler/jit:xla_cpu_jit&quot;</span>,</span><br><span class="line">        <span class="string">&quot;//tensorflow/core:test&quot;</span>,</span><br><span class="line">    ],</span><br><span class="line">)</span><br></pre></td></tr></table></figure><h2 id="configuring-vscode">Configuring VSCode</h2><p>Since the unit test was built as an executable with debugging symbols, there is nothing special about the configuration of VSCode. Install <code>C/C++</code> Extension, and write the following lines to <code>.vscode/launch.json</code>.</p><blockquote><p>You could open that json file by clicking <code>ctrl/command</code>+<code>shift</code>+<code>p</code>, typing <code>launch.json</code>, and selecting <code>Add Configuration</code> -&gt; <code>C/C++: (gdb) Launch</code></p></blockquote><figure class="highlight json"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">&#123;</span><br><span class="line">  <span class="attr">&quot;name&quot;</span>: <span class="string">&quot;(gdb) Launch&quot;</span>,</span><br><span class="line">  <span class="attr">&quot;type&quot;</span>: <span class="string">&quot;cppdbg&quot;</span>,</span><br><span class="line">  <span class="attr">&quot;request&quot;</span>: <span class="string">&quot;launch&quot;</span>,</span><br><span class="line">  <span class="attr">&quot;program&quot;</span>: <span class="string">&quot;$&#123;workspaceFolder&#125;/bazel-bin/tensorflow/compiler/xla/service/spmd/spmd_partitioner_test&quot;</span>,</span><br><span class="line">  <span class="attr">&quot;args&quot;</span>: [],</span><br><span class="line">  <span class="attr">&quot;stopAtEntry&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">  <span class="attr">&quot;cwd&quot;</span>: <span class="string">&quot;$&#123;workspaceFolder&#125;&quot;</span>,</span><br><span class="line">  <span class="attr">&quot;environment&quot;</span>: [],</span><br><span class="line">  <span class="attr">&quot;externalConsole&quot;</span>: <span class="literal">false</span>,</span><br><span class="line">  <span class="attr">&quot;MIMode&quot;</span>: <span class="string">&quot;gdb&quot;</span>,</span><br><span class="line">  <span class="attr">&quot;setupCommands&quot;</span>: [</span><br><span class="line">    &#123;</span><br><span class="line">      <span class="attr">&quot;description&quot;</span>: <span class="string">&quot;Enable pretty-printing for gdb&quot;</span>,</span><br><span class="line">      <span class="attr">&quot;text&quot;</span>: <span class="string">&quot;-enable-pretty-printing&quot;</span>,</span><br><span class="line">      <span class="attr">&quot;ignoreFailures&quot;</span>: <span class="literal">true</span></span><br><span class="line">    &#125;</span><br><span class="line">  ]</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Everything is all set! Press <code>F5</code> to start debugging.</p><h2 id="reference">Reference</h2><ul><li><a href="https://www.tensorflow.org/install/source#ubuntu">https://www.tensorflow.org/install/source#ubuntu</a></li><li><a href="https://github.com/tensorflow/tensorflow/blob/master/CONTRIBUTING.md">https://github.com/tensorflow/tensorflow/blob/master/CONTRIBUTING.md</a></li></ul>]]></content>
    
    
    <summary type="html">&lt;p&gt;It would be easier to read the source code if we are aware of the runtime information, including call stacks and variable values. This tutorial introduces how to utilize our powerful VSCode to trace XLA Compiler.&lt;/p&gt;</summary>
    
    
    
    
  </entry>
  
</feed>
