네이버뉴스 크롤링 하기 (2)

티스토리 뷰

🐍파이썬/🌐웹크롤링

네이버뉴스 크롤링 하기 (2)

와이드윤 2020. 12. 21. 17:34

728x90

2020/12/16 - [🐍파이썬 입문기/🌐웹크롤링] - 네이버뉴스 크롤링 하기 (1) - 모듈 불러오기, 검색 설정하기

네이버뉴스 크롤링 하기 (1) - 모듈 불러오기, 검색 설정하기

예전에 진행된 프로젝트 때 쓰려고 짠 크롤링 코드다. 나같은 초보자들에게 도움이 됐으면 해서 한 번 써본다. 0. 들어가기 전에 네이버에 검색어를 입력하고 뉴스를 보면 알겠지만 각기 다른 신

notanymoremungwa.tistory.com

이전 글을 보고 온 후에 해당 포스팅을 보는 것을 추천합니당

(1)에서는 네이버 뉴스를 크롤링하는 이유, 모듈 불러오기, 검색 조건 설정하기까지 진행했다.

이제 본격적으로 크롤링을 위한 코드를 작성해보자.

2. crawler function

def crawler(maxpage, query, s_date, e_date):
    s_from = s_date.replace(".", "")
    e_to = e_date.replace(".", "")
    page = 1
    maxpage_t =(int(maxpage)-1)*10+1    
    f = open("파일경로/contents_text.csv", 'w', encoding = 'utf-8')

    wr = csv.writer(f)
    wr.writerow(['years','company','title','contents','link'])

main function에서 설정한 날짜의 입력 형태를 바꿔주었다.

contents_text라는 파일에 '작성일자, 언론사, 기사 제목, 본문 내용, 링크'에 맞춰 각각의 내용이 크롤링될 예정이다.

    while page < maxpage_t:
        
        url = 'https://search.naver.com/search.naver?where=news&query=' + query + '&sort=0&ds=' + s_date + '&de=' + e_date + '&nso=so%3Ar%2Cp%3Afrom' + s_from + 'to' + e_to + '%2Ca%3A&start=' + str(page)

        # ua = UserAgent()
        # headers = {'User-Agent' : ua.random}

        req = requests.get(url)
        
        cont = req.content
        soup = BeautifulSoup(cont, 'html.parser')

여기서 url 변수는 네이버에 검색어를 검색했을 때 기사 리스트들의 규칙에 맞게 만든 변수들이다.

        for urls in soup.select("a.info"):
            
            try:
                if urls["href"].startswith("https://news.naver.com"):
                        news_detail = []
                        
                        ua = UserAgent()
                        headers = {"User-Agent" : ua.random}
                        
                        breq = requests.get(urls["href"], headers = headers)
                        bsoup = BeautifulSoup(breq.content, 'html.parser')
                        
                        title = bsoup.select('h3#articleTitle')[0].text
                        news_detail.append(title)

                        pdate = bsoup.select('.t11')[0].get_text()[:11]
                        news_detail.append(pdate)

                        _text = bsoup.select('#articleBodyContents')[0].get_text().replace('\n', " ")
                        btext = _text.replace("// flash 오류를 우회하기 위한 함수 추가 function _flash_removeCallback() {}", "")
                        
                        news_detail.append(btext.strip())
                        news_detail.append(urls["href"])
                        
                        pcompany = bsoup.select('#footer address')[0].a.get_text()
                        news_detail.append(pcompany)
                                            
                        wr.writerow([news_detail[1].replace(',',''), news_detail[4].replace(',',''), news_detail[0].replace(',',''),
                                    news_detail[2].replace(',',''), news_detail[3].replace(',','')])

for문: a.info라는 클래스이름을 가지고 있는 부분을 가져와 urls 에 넣는다.

클래스는 크롬의 '개발자 도구' (단축키(맥): option + command + I)를 통해 확인할 수 있다.

'a.info'는 네이버뉴스로 가는 하이퍼링크가 걸린 부분의 클래스명이다.

그래서 만약 href가 news.naver.com으로 시작한다면 해당 반복문이 실행된다.

만약 fake_useragent를 사용하지 않는다면 네이버측에서 봇으로 인식해 오류가 발생한다.

따라서 fake_useragent를 통해 사용자의 정보를 설정해줘 우회하게끔 하도록 설정하였다.

결과를 news_detail이라는 list에 append 해준다.

            except Exception as e:
                continue
        page += 10
        
    print('Completed!')
    
    f.close()

continue 문을 통해 계속 진행되게끔 작성하였고, try ~ except문이 끝났을 때, 다음 페이지로 넘어가게끔 설정했다.

while문이 모두 끝났을 땐 Completed! 라는 문구가 출력되고 content_text.csv 파일이 저장경로에 저장이 되었을 것이다.

3. 테스트

크롤링 코드가 잘 짜였나 확인해보도록 하자.

검색어는 '코로나 백신'으로 설정하였고 12월 1일부터 12월 21일까지 작성된 뉴스 중 5페이지만 크롤링 하도록 설정하였다.

일단 Completed라는 문구가 나와서 크롤링이 됐음을 확인되었고, content_text.csv 파일에 잘 작성되어 있는지 확인해보자.

정상적으로 크롤링이 되었음을 확인했다.

utf-8 형식으로 인코딩 되었기 때문에, 메모장이나 구글 드라이브에 업로드 후 확인하면 정상적으로 확인할 수 있다.

4. 전체 코드

import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
from fake_useragent import UserAgent
import csv
import time

RESULT_PATH = '/저장하고싶은 경로/'
now = datetime.now()

def crawler(maxpage, query, s_date, e_date):
    s_from = s_date.replace(".", "")
    e_to = e_date.replace(".", "")
    page = 1
    maxpage_t =(int(maxpage)-1)*10+1    
    f = open("/저장경로/contents_text.csv", 'w', encoding = 'utf-8')

    wr = csv.writer(f)
    wr.writerow(['years','company','title','contents','link'])
    
    while page < maxpage_t:
        
        url = 'https://search.naver.com/search.naver?where=news&query=' + query + '&sort=0&ds=' + s_date + '&de=' + e_date + '&nso=so%3Ar%2Cp%3Afrom' + s_from + 'to' + e_to + '%2Ca%3A&start=' + str(page)

        # ua = UserAgent()
        # headers = {'User-Agent' : ua.random}

        req = requests.get(url)
        
        cont = req.content
        soup = BeautifulSoup(cont, 'html.parser')
        
        for urls in soup.select("a.info"):
            
            try:
                if urls["href"].startswith("https://news.naver.com"):
                        news_detail = []
                        
                        ua = UserAgent()
                        headers = {"User-Agent" : ua.random}
                        
                        breq = requests.get(urls["href"], headers = headers)
                        bsoup = BeautifulSoup(breq.content, 'html.parser')
                        
                        title = bsoup.select('h3#articleTitle')[0].text
                        news_detail.append(title)

                        pdate = bsoup.select('.t11')[0].get_text()[:11]
                        news_detail.append(pdate)

                        _text = bsoup.select('#articleBodyContents')[0].get_text().replace('\n', " ")
                        btext = _text.replace("// flash 오류를 우회하기 위한 함수 추가 function _flash_removeCallback() {}", "")
                        
                        news_detail.append(btext.strip())
                        news_detail.append(urls["href"])
                        
                        pcompany = bsoup.select('#footer address')[0].a.get_text()
                        news_detail.append(pcompany)
                                            
                        wr.writerow([news_detail[1].replace(',',''), news_detail[4].replace(',',''), news_detail[0].replace(',',''),
                                    news_detail[2].replace(',',''), news_detail[3].replace(',','')])
            except Exception as e:
                continue
        page += 10
        
    print('Completed!')
    
    f.close()
    
def main():
    maxpage = input("검색 할 페이지수: ")
    query = input("검색어: ")
    s_date = input("시작 날짜(YYYY.MM.DD): ")
    e_date = input("종료 날짜(YYYY.MM.DD): ")
    crawler(maxpage, query, s_date, e_date)
    
main()

다음 번엔 해당 csv 파일을 가지고 텍스트마이닝을 거친 후 워드클라우드를 만들어 봐야겠습니다.

728x90

저작자표시 비영리 변경금지 (새창열림)

'🐍파이썬 > 🌐웹크롤링' 카테고리의 다른 글

네이버뉴스 크롤링 하기 (1) - 모듈 불러오기, 검색 설정하기 (0)	2020.12.16

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Total

Today

Yesterday

비전공자가 데이터분석가로 살아남기

티스토리 뷰