Python 웹 스크레핑 예제

2022. 1. 23. 22:25

스크레이퍼(Scraper) : html 데이터를 가져와서, 파싱, 원하는 정보를 저장, 반복하는것

파이썬 지원 모듈 : requests, BeautifulSoup4, lxml (html 파싱 및 정형화)

# 크롬 개발자 도구, ... - copy selector 이용하여 경로 복사

@ BeautifulSoup4
- html(xml)을 파싱하게 좋게 파이썬 객체로 돌려준다.
- 잘못된 html을 수정하여 반환해줌

@ 태그 찾기
- find(), find_all()

@ CSS 셀렉터로 찾기
- soup.select("head > title") : 리턴값은 list

네이버 도서

https://book.naver.com 에서 첫번째 div tag에 있는 값을 가져오기

import requests
import pprint
from bs4 import BeautifulSoup

res = requests.get('https://book.naver.com')
# res.text : 모든 텍스트, res.headers : 헤더, res.encoding : 인코딩 방식(UTF-8)
# soup = BeautifulSoup(res.text, 'html.parser')
soup = BeautifulSoup(res.text, 'lxml') # lxml이 더 강력 
# pprint.pprint(soup)
div_tag = soup.div  # 첫번째 div tag
print(div_tag)

pip install lxml 로 설치 필요

테이블 표에 있는 정보 가져오기

https://pythonscraping.com/pages/page3.html 의 표에 있는 정보 가져와서 csv 파일로 저장하는 예제

import requests
from bs4 import BeautifulSoup
import csv
import pprint

BASE_URL = 'https://www.pythonscraping.com'

def create_list_from_table(table_tag):
# 이미지는 포함하지 않는다. 
# Description 의 값은 최초 5글자만 포함한다.

  # CSV 파일로 만들기 위해서 2중 리스트 생성
  gifts = []

  # 헤더에 해당하는 1번째 로우 작성
  headers = []
  header_tag = table_tag.find('tr')
  for th_tag in header_tag.find_all('th'):
    if th_tag.text.strip() == 'Image':
      continue
    headers.append(th_tag.text.strip()) # 좌우 공백 등 제거 
  gifts.append(headers)

  # 선물 레코드 작성
  # 좌우 공백을 제거하고 텍스트 속에 \n문자를 공백으로 변경
  for tr_tag in table_tag.find_all('tr'):
    gift = []
    i = 0
    for td_tag in tr_tag.find_all('td'):
      if (headers[i] == 'Item Title' or headers[i] == 'Cost') and td_tag.text.strip() != '':
        gift.append(td_tag.text.strip().replace('\n',' '))
      elif headers[i] == 'Description' and td_tag.text.strip() != '':  # Description 인 경우 최초 5글자만 포함
        gift.append(td_tag.text.strip().replace('\n',' ')[:5])

      i = i + 1
      if i == 3: # 이미지 포함하지 않음
        break

    if not gift:
      continue

    gifts.append(gift)
  pprint.pprint(gifts)
  return gifts


def create_csv_file(lol, filename):
  # 이중 리스트의 내용을 CSV 파일로 저장
  with open(filename, 'w', encoding='utf-8', newline='') as file:
      writer = csv.writer(file)
      for l in lol:
          writer.writerow(l)


def main():
  res = requests.get(BASE_URL + '/pages/page3.html')
  soup = BeautifulSoup(res.text, 'lxml') # lxml 인스톨 필요, html.parser도 가능

  # 테이블 태그 확보
  table_tag = soup.find(id='giftList')

  gifts = create_list_from_table(table_tag)
  # create_csv_file(gifts, 'gifts.csv')

  print('job completed..')

if __name__ == '__main__':
    main()

https://replit.com/@dhshin38/Tutorial-Python#web_scraping.py

Tutorial Python

A Python repl by dhshin38

replit.com

'Python' 카테고리의 다른 글

Python Open API와 json (0)	2022.01.23
Python 이메일 보내기, 가져오기 (0)	2022.01.23
Python csv 파일 (0)	2022.01.23
Python 엑셀 읽기, 쓰기 (0)	2022.01.23
Python 클래스, 상속, Override (0)	2022.01.23

안드로이 스토리

Python 웹 스크레핑 예제

'Python' 카테고리의 다른 글

+ Recent posts

티스토리툴바