【Python】BeautifulSoupでのselectメソッドで要素を抽出するチートシート

2024年9月18日

PythonでBeautifulSoupクラスを用いてWebスクレイピングを行う際にfind系メソッドとselect系メソッドの2種類があります。select系メソッドは汎用性の高いCSSセレクタで要素を抽出できることからselect系メソッドを使用する方が多いと思います。

CSSセレクタで指定するselect系メソッドの覚えておくべき要素の抽出方法をまとめました。

select系メソッドの基本的な使い方

先ほどからselect系と説明していますが、何があるのかというとselectメソッドとselect_oneメソッドの2つとなります。

どちらのメソッドもCSSセレクタを引数ととります。

違いはというとselectメソッドはCSSセレクタに該当する全ての要素をリストで返しますが、select_oneメソッドは該当する最初の1つの要素を返します。

また、selectメソッドはCSSセレクタに該当する要素がない場合は空のリストを返しますが、select_oneメソッドはNoneを返します。

スクロールできます

メソッド名	返り値	該当する要素がない場合
select	該当する全ての要素をリストで返す	空のリスト []
select_one	該当する最初の1つの要素を返す	None

サンプルコードを以下に示します。それぞれのメソッドで存在するpタグと存在しないspanタグを指定しています。

from bs4 import BeautifulSoup

html = """
<p>p1</p>
<p>p2</p>
<p>p3</p>
"""

soup = BeautifulSoup(html, 'lxml')

print(soup.select('p'))
# [<p>p1</p>, <p>p2</p>, <p>p3</p>]

print(soup.select_one('p'))
# <p>p1</p>

print(soup.select('span'))
# []

print(soup.select_one('span'))
# None

要素を抽出するチートシート

スクロールできます

指定方法	コード例	説明
タグ名で指定	soup.select(‘h2’)	h2タグを抽出
ID名で指定	soup.select(‘#example’)	ID名がexampleの要素を抽出
クラス名で指定	soup.select(‘.example’)	クラス名がexampleの要素を抽出
属性で指定	soup.select(‘[name=”example”]’)	name属性がexampleの要素を抽出
子孫要素の指定	soup.select(‘div p’)	divタグの子孫要素のpタグを抽出
子要素の指定	soup.select(‘div > p’)	divタグの子要素のpタグを抽出
兄弟要素の指定	soup.select(‘div ~ p’)	divタグの兄弟要素のpタグを抽出
次の兄弟要素の指定	soup.select(‘div + p’)	divタグの兄弟要素の次のタグがpタグであれば抽出
最初の要素を指定	soup.select(‘div:first-of-type’)	兄弟要素内の最初のdivタグを抽出
最後の要素を指定	soup.select(‘div:last-of-type’)	兄弟要素内の最後のdivタグを抽出
n番目の要素を指定	soup.select(‘div:nth-of-type(n)’)	兄弟要素内のn番目のdivタグを抽出
AND条件で指定	soup.select(‘div.example’)	divタグかつクラス名がexampleの要素を抽出
	soup.select(‘a[rel=”.example”]’)	aタグかつrel属性がexampleの要素を抽出
	soup.select(‘.example1.example2’)	クラス名がexample1かつexample2の要素を抽出
OR条件で指定	soup.select(‘div, p’)	divタグまたはpタグを抽出
NOT条件で指定	soup.select(‘div:not(#example)’)	divタグでID名がexample以外の要素を抽出
含まれるテキストで指定	soup.select(‘div:-soup-contains(“example”)’)	divタグでテキストにexampleを含む要素を抽出 ※ BeautifulSoup専用の指定方法

使用例

チートシートだけではイメージがつかないと思いますので以下のような適当なHTMLファイルを用意し、使用例を示します。

<h1>2023年プロ野球順位表</h1>
<section>
    <h2>セリーグ</h2>
    <img src="img/central.jpg" alt="central-logo">
    <table class="rank-table" id="central-table">
        <thead>
            <tr>
                <th>順位</th><th>チーム名</th><th>ゲーム差</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="rank">1</td><td>阪神</td><td>-</td>
            </tr>
            <tr>
                <td class="rank">2</td><td>広島</td><td>11.5</td>
            </tr>
            <tr>
                <td class="rank">3</td><td>DeNA</td><td>12.0</td>
            </tr>
            <tr>
                <td class="rank">4</td><td>巨人</td><td>15.5</td>
            </tr>
            <tr>
                <td class="rank">5</td><td>ヤクルト</td><td>29.0</td>
            </tr>
            <tr>
                <td class="rank">6</td><td>中日</td><td>29.0</td>
            </tr>
        </tbody>
    </table>
</section>
<section>
    <h2>パリーグ</h2>
    <img src="img/pacific.jpg" alt="pacific-logo">
    <table class="rank-table" id="pacific-table">
        <thead>
            <tr>
                <th>順位</th><th>チーム名</th><th>ゲーム差</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="rank">1</td><td>オリックス</td><td>-</td>
            </tr>
            <tr>
                <td class="rank">2</td><td>ロッテ</td><td>15.5</td>
            </tr>
            <tr>
                <td class="rank">3</td><td>ソフトバンク</td><td>15.5</td>
            </tr>
            <tr>
                <td class="rank">4</td><td>楽天</td><td>17.0</td>
            </tr>
            <tr>
                <td class="rank">5</td><td>西武</td><td>22.5</td>
            </tr>
            <tr>
                <td class="rank">6</td><td>日本ハム</td><td>27.5</td>
            </tr>
        </tbody>
    </table>
</section>

from bs4 import BeautifulSoup


with open('index.html', 'r', encoding='utf-8') as f:
    soup = BeautifulSoup(f.read(), 'lxml')


# ページタイトルを取得
soup.select_one('h1')

# セリーグ順位表を取得
soup.select_one('#central-table')

# 順位表を全て取得
soup.select('rank-table')

# パリーグのロゴを取得
soup.select_one('h2:-soup-contains("パリーグ") + img')
soup.select_one('img[alt="pacific-logo"]')

# セリーグ順位表のtrタグを全て取得
soup.select('#central-table tr')

# セリーグ順位表のヘッダを除いたtrタグを全て取得
soup.select('#central-table > tbody > tr')
soup.select('#central-table tr:not(:first-of-type)')

# 楽天のゲーム差を取得
soup.select_one('td:-soup-contains("楽天") + td')

# 4位のチームのtrタグを全てを取得
soup.select('tbody > tr:nth-of-type(4)')

この記事が気に入ったら
フォローしてね！

Follow @scraproace

シェアしていただけると嬉しいです！

URLをコピーしました！

【Python】BeautifulSoupでのselectメソッドで要素を抽出するチートシート

select系メソッドの基本的な使い方

要素を抽出するチートシート

使用例

コメント

コメントするコメントをキャンセル

【Python】BeautifulSoupでのselectメソッドで要素を抽出するチートシート

select系メソッドの基本的な使い方

要素を抽出するチートシート

使用例

コメント

コメントする コメントをキャンセル

コメントするコメントをキャンセル