概述

核心是网址链接的正则表达式,如下:

r'<a.*?href="([^\s]+?)">(\w+?)</a>'

源码

# !/usr/bin/env python
# -*- coding:utf-8 -*-

import urllib.request
import re


def crawl():
    url='https://www.hao123.com/'
    headers={
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
    }
    req=urllib.request.Request(url=url,headers=headers)
    resp=urllib.request.urlopen(req)
    if resp.reason.lower()=='ok':
        html=resp.read().decode('utf-8')
        pattern=r'<a.*?href="([^\s]+?)">(\w+?)</a>'
        result=re.compile(pattern,re.DOTALL).findall(html)

        data=set()
        for i in result:
            href=i[0]
            name=i[1]
            data.add('{}:{}'.format(name,href))

        with open('/home/brandon/PythonProjects/MySpider/data/hao123.txt',mode='w',encoding='utf-8') as f:
            for i in data:
                f.write(i+'\n\n')

if __name__ == '__main__':
    crawl()

运行结果


Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐