python:爬虫系列-01

看了《Learning Python》有一段时间了，差不多看到类的样子，一直没有去动手实践过。于是决定动手写点小东西。也不知道该写点什么，于是打算入手爬虫。参照网上的爬虫教程，写了一个简单爬取网页中链接的小练习。common_var.py#!/usr/bin/env python# -*- coding: utf-8 -*-# @author : cat# @date: 20

南郭竽

911人浏览 · 2017-06-25 02:06:31

南郭竽 · 2017-06-25 02:06:31 发布

看了《Learning Python》有一段时间了，差不多看到类的样子，一直没有去动手实践过。
于是决定动手写点小东西。也不知道该写点什么，于是打算入手爬虫。

参照网上的爬虫教程，写了一个简单爬取网页中链接的小练习。
- common_var.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @author : cat
# @date   : 2017/6/25.

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
headers = {"User-Agent": user_agent}

if __name__ == '__main__':
    pass

http_file.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @author : cat
# @date   : 2017/6/24.
from urllib import request
import ssl
from web.common_var import headers
import re

# regex from djiango
regex = re.compile(
    r'^(?:http|ftp)s?://'  # http:// or https://
    r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'  # domain...
    r'localhost|'  # localhost...
    r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})'  # ...or ip
    r'(?::\d+)?'  # optional port
    r'(?:/?|[/?]\S+)$', re.IGNORECASE)

csdn = 'http://www.csdn.com'


def get_urls(url_in=csdn, key="href="):
    """
    通过一个入口的URL爬取其中的全部的URL
    :param url_in: 入口的URL
    :param key: 'href='
    :return: urls set !
    """
    url_sets = set()
    ssl_context = ssl._create_unverified_context()
    req = request.Request(url_in, headers=headers)
    resp_bytes = request.urlopen(req, context=ssl_context)
    for line in resp_bytes:
        line_html = line.decode('utf-8')
        # print(line_html)
        if key in line_html:
            # print(line_html)
            index = line_html.index(key)
            sub_url = line_html[index + len(key):].replace('"', "#").split('#')[1]
            match = regex.search(sub_url)
            if match:
                # print(match.group())
                # yield match.group()
                url_sets.add(match.group())
                # print(url_sets)
    return url_sets


if __name__ == '__main__':
    # print(list(get_urls("http://news.baidu.com/?tn=news")))
    baidu_news = "http://news.baidu.com/?tn=news"
    urls = get_urls(baidu_news)
    # print(urls)
    for u in urls:
        print(u)
    print("total url size in this website({}) = {}"
          .format(baidu_news, len(urls)))

代码不算简洁，不过还算是易懂。

输出如下：

/web/http_file.py

https://baijia.baidu.com/s?id=1571043179126899
http://net.china.cn/chinese/index.htm
http://newsalert.baidu.com/na?cmd=0
http://tech.baidu.com/
http://tv.cctv.com/2017/06/24/VIDE9KYKPMTmLLENgIgdhyut170624.shtml
http://xinwen.eastday.com/a/170624122900408.html
http://shehui.news.baidu.com/
… # 后面还有很多URL，不全部贴出了。
…

total url size in this website(http://news.baidu.com/?tn=news) = 116

Process finished with exit code 0