用urllib、BeautifulSoup抓取糗事百科段子

python版本为：chao@chao-machine:~/python_study$ pythonPython 3.4.3 (default, May 31 2015, 17:07:22)[GCC 4.9.1] on linuxType "help", "copyright", "credits" or "license" for more information.>>>代

善良超锅锅

2570人浏览 · 2015-07-24 18:13:06

善良超锅锅 · 2015-07-24 18:13:06 发布

python版本为：

chao@chao-machine:~/python_study$ python
Python 3.4.3 (default, May 31 2015, 17:07:22) 
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

代码

#_*_ coding:utf-8 _*_
from bs4 import BeautifulSoup
import urllib
import urllib.request
import urllib.error
import urllib.parse


def print_qiushi(item):
	#过滤掉有图片的段子
	if item.find('div',class_='thumb'):
		return
	#过滤掉有视频的段子
	if item.find(name="div",class_='video_holder'):
		return
	#获取发表这条段子的用户名
	author = item.find("div",class_='author')
	if author != None:
		author = author.get_text().strip()
	else:
		author = 'anonymous'
	#获取段子的发表时间
	times = item.find("div",class_='content').contents[-2]
	if times == None:
		times = ''
	else:
		times = str(times)
		times.strip()
	#获取段子内容
	content = item.find("div",class_='content').get_text().strip()


	print('-_-:',author,"  ",times,'\n')
	print(content)
	print("\n\n")


url="http://www.qiushibaike.com/text"     换成"http://www.qiushibaike.com/"也可以的，反正有过滤机制 
user_agent = 'Mozellb/4.0 (compatible;MSIE 5.5;Windows NT)'
heads = {'User-Agent':user_agent}
try:
	request = urllib.request.Request(url,headers=heads)
	response = urllib.request.urlopen(request)
	soup = BeautifulSoup(response.read())
	items = soup.find_all(name='div',class_='article block untagged mb15')
	#循环处理没一条状态，包括用户名，内容，发布的时间
	for item in items:
		print_qiushi(item)
except urllib.error.URLError as e:
	if hasattr(e,'code'):
		print(e.code)
	if hasattr(e,'reason'):
		print(e.reason)

输出结果：

这次发的有点匆忙，下次再分析过程

CSDN学习社区

CSDN联合极客时间，共同打造面向开发者的精品内容学习社区，助力成长！

更多推荐

嵌入式作业（七）：基于Ardunio的STM32串口通信

嵌入式作业（七）0作业要求1Ardunio 完成STM32的串口通信（1）安装Ardunio IDE（2）stm32串口通信2关于 stduino IDE0作业要求安装 Ardunio IDE 和相关软件支持库，在Ardunio 完成STM32板子的串口通信程序：（1）持续向串口输出“Hello world！”；（2）当接收到“stop!”时，停止输出。网上有一个国人版的MCU集成开发平台， st

CSDN学习社区

JDBC详解

JDBC文章目录JDBC什么是JDBC?JDBC驱动程序:Java使用JDBC访问数据库的步骤:设置classpath:Oracle连接字符串的书写格式:简单的例子:常用数据库的驱动程序及JDBC URL:Oracle数据库:SQL Server数据库MySQL数据库Access数据库PreparedStatement接口:JNDI-数据源（Data Source）与连接池（Connection

CSDN学习社区

“模式识别与机器学习”学习笔记no2.再谈感知机

接**上篇：上篇主要进行了PLA，Pocket算法的理论过程分析和在给定数据集上利用pocket算法对数据集进行分类学习，得到错分数量最少的分类面。上篇中pocket算法的过程已经进行了编程和测试，框架已经建立了起来，这一篇主要上篇中没有提到或涉及不深的几个问题。1.数据集的构造。上篇是直接使用了题目给的向量，这次来根据正态分布来产生数据集。np.random.normal函数可以根据均值和方差生