urllib库学习 - 好123网址链接爬虫
概述核心是网址链接的正则表达式,如下:r'(\w+?)'源码# !/usr/bin/env python# -*- coding:utf-8 -*-import urllib.requestimport redef crawl():url='https://www.hao123.com/'headers={'User-Agent'
·
概述
核心是网址链接的正则表达式,如下:
r'<a.*?href="([^\s]+?)">(\w+?)</a>'
源码
# !/usr/bin/env python
# -*- coding:utf-8 -*-
import urllib.request
import re
def crawl():
url='https://www.hao123.com/'
headers={
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36'
}
req=urllib.request.Request(url=url,headers=headers)
resp=urllib.request.urlopen(req)
if resp.reason.lower()=='ok':
html=resp.read().decode('utf-8')
pattern=r'<a.*?href="([^\s]+?)">(\w+?)</a>'
result=re.compile(pattern,re.DOTALL).findall(html)
data=set()
for i in result:
href=i[0]
name=i[1]
data.add('{}:{}'.format(name,href))
with open('/home/brandon/PythonProjects/MySpider/data/hao123.txt',mode='w',encoding='utf-8') as f:
for i in data:
f.write(i+'\n\n')
if __name__ == '__main__':
crawl()
运行结果
更多推荐



所有评论(0)