์ตœ๋Œ€ 1 ๋ถ„ ์†Œ์š”

Crawler์˜ ๋œป

  • ๊ธฐ๋Š” ๊ฒƒ
  • ํŒŒ์ถฉ๋ฅ˜

๊ธฐ์–ด๋‹ค๋‹ˆ๋Š”๊ฒŒ ์™œ ํฌ์ฝœ๋ง?

ํฌ๋กค๋ง์€ ์ธํ„ฐ๋„ท์„ ๊ธฐ์–ด๋‹ค๋‹ˆ๋ฉด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค. ๊ทธ๋ž˜์„œ ํฌ๋กค๋ง!

์›น ํฌ๋กค๋Ÿฌ

์›น ํฌ๋กค๋Ÿฌ ๋Š” ์›น ํŽ˜์ด์ง€์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„์ฃผ๋Š” ์†Œํ”„ํŠธ์›จ์–ด์ž„

๊ทธ๋Ÿฌ๋ฉด? ์›น ํฌ๋กค๋ง์€ ํฌ๋กค๋Ÿฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์›น ํŽ˜์ด์ง€์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•ด ๋‚ด๋Š” ํ–‰์œ„๋ฅผ ๋งํ•œ๋‹ค!

url์—์„œ html ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

import requests

url = "http://www.daum.net"
response = requests.get(url)

print(response.text)

BeautibulSoup ์‚ฌ์šฉํ•˜๊ธฐ

import requests
from bs4 import BeautifulSoup

url = "http://www.daum.net/"
response = requests.get(url)
# print(response.text)

print(BeautifulSoup(response.text, 'html.parser'))
  • response.text๋ž‘ ๊ฐ™์€ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ด!
  • ๊ทผ๋ฐ? response.text์™€ beautifulSoup๋กœ ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์ž„

html์—์„œ ํƒœ๊ทธ ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

import requests
from bs4 import BeautifulSoup

url = "http://www.daum.net/"
response = requests.get(url)
# print(response.text[:500])

soup = BeautifulSoup(response.text, 'html.parser')

print(soup.title)
print(soup.title.string)

span ํƒœ๊ทธ ํŒŒ์‹ฑํ•˜๊ธฐ

print(soup.span)
  • ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๊ฐ€์žฅ ์ƒ๋‹จ์˜ span๋งŒ ๊ฐ€์ ธ์˜ด

๋ชจ๋“  span ํŒŒ์‹ฑํ•˜๊ธฐ

print(soup.findAll('span'))

๋ชจ๋“  ํƒœ๊ทธ ํŒŒ์‹ฑํ•˜๊ธฐ

from bs4 import BeautifulSoup
import requests

url = "http://www.daum.net/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# file = open("daum.html","w")
# file.write(response.text)
# file.close()

# print(soup.title)
# print(soup.title.string)
# print(soup.span)
# print(soup.findAll('span'))

# html ๋ฌธ์„œ์—์„œ ๋ชจ๋“  aํƒœ๊ทธ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ์ฝ”๋“œ
print(soup.findAll("a","link_favorsch"))
  • ๋ฌธ์„œ์—์„œ ๋ชจ๋“  aํƒœ๊ทธ ์ค‘์— link_favorsch๋ฅผ ๊ฐ€์ง„ ๊ฒƒ๋งŒ ๊ฐ€์ ธ์™€๋ผ

์นดํ…Œ๊ณ ๋ฆฌ:

์—…๋ฐ์ดํŠธ:

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ