Python爬蟲入門抓取豆瓣內(nèi)容三（附完整代碼）

114424 閱讀 0 評論 0 點贊

上一節(jié)我們通過數(shù)據(jù)分析，找到了我們想要的內(nèi)容，我們這一節(jié)就把這些內(nèi)容保存到數(shù)據(jù)庫中，來方便我們隨時查看。

本節(jié)我們采用PyMySQL數(shù)據(jù)庫以及txt文件兩種方式來保存數(shù)據(jù)。

1. 完整代碼

import re
import requests
import pymysql
from bs4 import BeautifulSoup
qy = open('C:/Users/輕煙/Desktop/db.txt',mode='a',encoding='utf-8')#這里是要存入的文件目錄
for i in range(1):
    headers = {#這里模擬瀏覽器進(jìn)行訪問
        'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) 
        AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.82 Safari/537.36',
        'Host': 'movie.douban.com'
    }
    res = 'https://movie.douban.com/top250?start='+str(25*i)#25次
    r = requests.get(res, headers=headers, timeout=10)#設(shè)置超時時間
    soup = BeautifulSoup(r.text, "html.parser")#設(shè)置解析方式,也可以使用其他方式。
    div_list = soup.find_all('div', class_='item')
    movies = []
    for each in div_list:
        movie = {}
        moviename = each.find('div', class_='hd').a.span.text.strip()
        movie['title'] = moviename
        rank = each.find('div', class_='pic').em.text.strip()
        movie['rank'] = rank
        info = each.find('div', class_='bd').p.text.strip()
        info = info.replace('\n', "")
        info = info.replace(" ", "")
        info = info.replace("\xa0", "")
        director = re.findall(r'[導(dǎo)演:].+[主演:]', info)[0]
        director = director[3:len(director) - 6]
        movie['director'] = director
        release_date = re.findall(r'[0-9]{4}', info)[0]
        movie['release_date'] = release_date
        plot = re.findall(r'[0-9]*[/].+[/].+', info)[0]
        plot = plot[1:]
        plot = plot[plot.index('/') + 1:]
        plot = plot[plot.index('/') + 1:]
        movie['plot'] = plot
        star = each.find('div', class_='star')
        star = star.find('span', class_='rating_num').text.strip()
        movie['star'] = star
        movies.append(movie)
        print(movie,file=qy)#保存到文件中
con = pymysql.connect(host = 'localhost', user = 'root',password = '123456',database ='python',
charset = 'utf8',port = 3306)
print('連接成功->')
cursor =  con.cursor()#創(chuàng)建一個游標(biāo)
print('開始創(chuàng)建表->')
cursor.execute("""create table douban
                ( title char(40),
                  ranks char(40),
                  director char(40),
                  release_date char(40), 
                  plot char(100),
                  star char(40))
               """)
print('完成表的創(chuàng)建,開始插入數(shù)據(jù)->')#下面開始插入數(shù)據(jù)
for i in movies:
    cursor.execute("insert into douban(title,ranks,director,release_date,plot,star) "
                   "values(%s,%s,%s,%s,%s,%s)",(i['title'],i['rank'],i['director'],
                   i['release_date'],i['plot'],i['star']))
print('插入數(shù)據(jù)完成')
cursor.close()
con.commit()
con.close()

2. 代碼分析

爬蟲部分的代碼我們在上一節(jié)已經(jīng)分析過了，這一節(jié)我們主要來分析數(shù)據(jù)庫部分。

首先是連接數(shù)據(jù)庫，相關(guān)信息要和自己的數(shù)據(jù)庫相對應(yīng)，詳細(xì)連接方式可以參考前面的數(shù)據(jù)庫章節(jié)。

con = pymysql.connect(host = 'localhost', user = 'root',password = '123456',database ='python',
charset = 'utf8',port = 3306)
print('連接成功->')
cursor =  con.cursor()#創(chuàng)建一個游標(biāo)

然后創(chuàng)建一個表來保存這些數(shù)據(jù)

print('開始創(chuàng)建表->')
cursor.execute("""create table douban
                ( title char(40),
                  ranks char(40),
                  director char(40),
                  release_date char(40), 
                  plot char(100),
                  star char(40))
               """)

由于我們已經(jīng)把數(shù)據(jù)保存在了名為movies的列表中，我們遍歷這個列表來插入數(shù)據(jù)即可插入的時候需要注意，前面數(shù)據(jù)庫章節(jié)中插入數(shù)據(jù)是直接在values中用引號來完成，這里因為我們插入的是變量，不是string類型，因此我們要用占位符來插入數(shù)據(jù)，插入的格式如下。

for i in movies:
    cursor.execute("insert into douban(title,ranks,director,release_date,plot,star) "
                   "values(%s,%s,%s,%s,%s,%s)",(i['title'],i['rank'],i['director'],
                   i['release_date'],i['plot'],i['star']))

這樣就完成了數(shù)據(jù)的保存，我們可以在數(shù)據(jù)庫中直接來瀏覽這些信息，這樣就完成了數(shù)據(jù)的保存。

3. 運行結(jié)果

數(shù)據(jù)庫文件：

爬取豆瓣電影信息5

txt文件中：

爬取豆瓣電影信息6

4. 總結(jié)

在這個例子中，我們結(jié)合了爬蟲、BeautifulSoup和數(shù)據(jù)庫三個部分，數(shù)據(jù)檢查使用的比較少了解一下即可，爬蟲項目大致就是這樣一個流程，當(dāng)然這個只是一個比較基礎(chǔ)的爬蟲練習(xí)，如果有興趣的同學(xué)可以參考下面的網(wǎng)站去找一個項目動手練習(xí)：https://www.jb51.net/article/164829.htm。

點贊(0)