RJ博客

python动态页面爬虫 - selenium+PhantomJS

本文目录

环境配置

selenium

pip install selenium


PhantomJS

不同的操作系统有各自对应的版本,去官网下载对应操作系统的phantomjs:

http://phantomjs.org/download.html


Windows

下载完后,解压缩后bin目录下可看到可执行文件phantomjs.exe


Linux

yum install gcc gcc-c++ make flex bison gperf ruby openssl-devel freetype-devel fontconfig-devel libicu-devel sqlite-devel libpng-devel libjpeg-deve

下载tar.bz2文件,解压:

tar -xjvf phantomjs-2.1.1-linux-x86_64.tar.bz2

也可以看到bin目录下有可执行文件phantomjs,添加到系统常量:

cp phantomjs /usr/local/bin

建立软链也行(绝对路径)

ln -s /download/phantomjs-2.1.1-linux-x86_64/bin/phantomjs /usr/local/bin/

查看是否安装成功:

phantomjs --version

image.png


测试:

#coding:utf-8
import unittest
from selenium import webdriver
from bs4 import BeautifulSoup


class seleniumTest(unittest.TestCase):
    def setUp(self):
        self.driver = webdriver.PhantomJS(executable_path='/usr/local/bin/phantomjs')

    def testEle(self):
        driver = self.driver
        driver.get('http://www.douyu.com/directory/all')
        soup = BeautifulSoup(driver.page_source, 'xml')
        while True:
            titles = soup.find_all('h3', {'class': 'ellipsis'})
            nums = soup.find_all('span', {'class': 'dy-num fr'})
            for title, num in zip(titles, nums):
                print num.get_text().strip(), ' | ',title.get_text().strip()
            if driver.page_source.find('shark-pager-disable-next') != -1:
                break
            elem = driver.find_element_by_class_name('shark-pager-next')
            elem.click()
            soup = BeautifulSoup(driver.page_source, 'xml')

    def tearDown(self):
        print 'down'

if __name__ == "__main__":
    unittest.main()


另外如果需要设置headers、proxy,可参考:

http://blog.csdn.net/tcorpion/article/details/70213435

https://www.zhihu.com/question/35547395/answer/145214771

例如设置UserAgent:

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0")
browser = webdriver.PhantomJS(desired_capabilities=dcap)




Refer:

https://www.jianshu.com/p/520749be7377

https://www.jianshu.com/p/ac7df4fe8ee3


相关推荐

发表评论