hxselectでhtmlからaタグの中身とhrefの値を取得

Nid: 1418

hxselct 要素抽出 gzip is failing for js / css...

Posted on: 30 September 2022
By: claire_gill

curl

hxselect

sed

paste

入力データ

$ head -20 links.html
<html>
<body>
<table >
  <tbody>
      <tr>
              <td>1</td>
              <td><a class="text-nowrap" href="https://list.of.urls/youtube.com"><img class="mr-2" src="https://youtube.com/favicon.ico" width="16" height="16" loading="lazy" onerror="this.classList.add(&quot;d-none&quot;)">Youtube</a><a class="ml-2" href="https://youtube.com" title="Go to youtube.com" target="_blank"><svg class="icon small"><use xlink:href="#icon-external"></use></svg></a></td>
              <td>22,504,566</td>
              <td><div class="progress"><div class="progress-bar bg-success" style="width: 100%;">100</div></div></td>
          </tr>
      <tr>
              <td>2</td>
              <td><a class="text-nowrap" href="https://list.of.urls/apple.com"><img class="mr-2" src="https://apple.com/favicon.ico" width="16" height="16" loading="lazy" onerror="this.classList.add(&quot;d-none&quot;)">Apple</a><a class="ml-2" href="https://apple.com" title="Go to apple.com" target="_blank"><svg class="icon small"><use xlink:href="#icon-external"></use></svg></a></td>
              <td>6,454,109</td>
              <td><div class="progress"><div class="progress-bar bg-success" style="width: 100%;">100</div></div></td>
          </tr>
      <tr>
              <td>3</td>
              <td><a class="text-nowrap" href="https://list.of.urls/www.google.com"><img class="mr-2" src="https://www.google.com/favicon.ico" width="16" height="16" loading="lazy" onerror="this.classList.add(&quot;d-none&quot;)">Google</a><a class="ml-2" href="https://www.google.com" title="Go to www.google.com" target="_blank"><svg class="icon small"><use xlink:href="#icon-external"></use></svg></a></td>
              <td>14,299,269</td>

コマンドと結果

$ paste <(cat links.html | hxnormalize -x | hxselect -c -s '\n' 'tbody a.text-nowrap' | sed -e 's/<[^>]*>//g') <(cat links.html | hxnormalize -x | hxselect -s '\n' 'tbody a.text-nowrap::attr(href)' | grep -oP '(?<=href=")[^"]*') | head -3
Youtube https://list.of.urls/youtube.com
Apple   https://list.of.urls/apple.com
Google  https://list.of.urls/www.google.com

クラス名text-nowrapのaタグから、「hxselect -c」を使用して中身を取得し、sedで不要なhtmlタグを削除。再度hxselectでhtml属性のみを取得して、regexのPositive lookbehind(?<=a)bを利用してgrepでurl部分を取り出し、pasteコマンドでtab区切りで出力。cygwinで、> /dev/clipboardとすればEXCELにペーストできる。

IT notes

hxselectでhtmlからaタグの中身とhrefの値を取得

関連記事

You are here

hxselectでhtmlからaタグの中身とhrefの値を取得

関連記事