JupyterLabで画像をクローリングする - フリーランス　技術調査ブログ

参考ページ

icrawlerを利用して画像をクローリングする icrawler.readthedocs.io

インストール

JupyterLabのコンソール画面からicrawlerをインストールする。
pip install icrawlerコマンドを実行する

実行

下記のサンプルコードを実行する

from icrawler.builtin import GoogleImageCrawler

google_crawler = GoogleImageCrawler(storage={'root_dir': 'test'})
google_crawler.crawl(keyword='海', max_num=100)

下記のエラーが発生してクローリングに失敗する

Exception in thread parser-001:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
TypeError: 'NoneType' object is not iterable

画像に関しては下記のパッケージをインポートする必要らしく icrawler.readthedocs.io
下記のように実行するとテストフォルダに海の画像を取得することができました。

from icrawler.builtin import BingImageCrawler

# downloader_threads: ダウンローダーのスレッド数
# storage: ダウンロード先のディレクトリ名
bing_crawler = BingImageCrawler(downloader_threads=1, storage={'root_dir': 'test'}) 

# max_num: ダウンロードする画像の最大枚数
# keyword: 検索キーワード
bing_crawler.crawl(keyword="海",max_num=100)

実行結果

f:id:PX-WING:20210205082047p:plain