Beautiful Soup库

执行pip install beautifulsoup4安装Beautiful Soup库

Beautiful Soup库的简介

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库。

它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式。

Beautiful Soup库是解析、遍历、维护“标签树”的功能库。

Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。

Beautiful Soup库的引用

from bs4 import BeautifulSoup

import bs4

主要是用BeautifulSoup类

Beautiful Soup库的解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器

解析器使用方法条件
bs4的HTML解析器BeautifulSoup(mk,’html.parser’)安装bs4库
lxml的HTML解析器BeautifulSoup(mk,’lxml’)pip install lxml
lxml的XML解析器BeautifulSoup(mk,’xml’)pip install lxml
html5lib的解析器BeautifulSoup(mk,’html5lib’)pip install html5lib

不指定解析器,Beautiful Soup会选择最合适的解析器来解析这段文档,如果手动指定解析器那么Beautiful Soup会选择指定的解析器来解析文档

from bs4 import BeautifulSoup

soup1 = BeautifulSoup(open("E://index.html"),"html.parser")

soup2 = BeautifulSoup("<html>data</html>","lxml")

soup2 = BeautifulSoup("<html>data</html>")

Beautiful Soup库的对象

Tag , NavigableString , BeautifulSoup , Comment

对象说明
Tag标签,最基本的信息组织单元,分别用<>和标明开头和结尾
NavigableString其实就是python的str对象的继承子类,实际上没区别
BeautifulSoup表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象
Comment内容是文档的注释部分

Tag

Tag的属性:
Name:

标签的名字,

的名字是’p’,格式:.name。

如果改变了tag的name,那将影响所有通过当前Beautiful Soup对象生成的HTML文档

>>> r=requests.get("http://python123.io/ws/demo.html")
>>> demo=r.text
>>> soup=BeautifulSoup(demo,'html.parser')
>>> soup.a.name
u'a'
>>> soup.a.name='aaa'
>>> soup.aaa.name
'aaa'
Attributes:

Attributes 标签的属性,字典形式组织,格式:.attrs。

>>> r=requests.get("http://python123.io/ws/demo.html")
>>> demo=r.text
>>> soup=BeautifulSoup(demo,'html.parser')
>>> tag=soup.a
>>> tag.attrs
{u'href': u'http://www.icourse163.org/course/BIT-268001', u'class': [u'py1'], u'id': u'link1'}
>>> tag.attrs['class']
[u'py1']
>>> tag.attrs['href']
u'http://www.icourse163.org/course/BIT-268001'
>>> type(tag.attrs)
<type 'dict'>
>>> type(tag)
<class 'bs4.element.Tag'>
Tag的操作:

tag属性的操作办法和字典一样

tag=soup.a
#print tag
print tag['class']
print tag['id']
print tag['href']

#tag属性支持添加、删除、修改等,tag属性操作和dict一样
tag['class']='xiaodeng'
tag['id']=123

#删除
del tag['class']
print tag.get('calss')

字符串常被包含在tag内.Beautiful Soup用
NavigableString 类来包装tag中的非属性字符串,<>…

>>> soup.a
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
>>> soup.a.string
u'Basic Python'
>>> soup.p
<p class="title"><b>The demo python introduces several python courses.</b></p>
>>> soup.p.string
u'The demo python introduces several python courses.'
>>> type(soup.p.string)
<class 'bs4.element.NavigableString'>

BeautifulSoup

BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

>>> soup.name
u'[document]'

Comment

文档的注释部分,Comment 对象是一个特殊类型的 NavigableString 对象

>>> nsoup=BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>","html.parser")

>>> nsoup.b.string
u'This is a comment'
>>> type(nsoup.b.string)
<class 'bs4.element.Comment'>
>>> nsoup.p.string
u'This is not a comment'
>>> type(nsoup.p.string)
<class 'bs4.element.NavigableString'>

Beautiful Soup库人性化显示方法

prettify()

>>> soup = BeautifulSoup(demo,"html.parser")
>>> soup.p.next_sibling.next_sibling
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python f
rom novice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.ico
urse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icour
se163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
>>> print(soup.p.next_sibling.next_sibling.prettify())
<p class="course">
 Python is a wonderful general-purpose programming language. You can learn Python from novice to pro
fessional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
  Basic Python
 </a>
 and
 <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
  Advanced Python
 </a>
 .
</p>

>>>

基于Beautiful Soup库的HTML内容遍历

HTML基本格式

graph TD
html-->head
html-->body
head-->title
body-->p1
body-->p2
p1-->b
p2-->a1
p2-->a2

<>…

标签树的下行遍历

graph LR
html-->head
head-->title
属性说明
.contents子节点的列表,将所有儿子节点存入列表
.children子节点的迭代类型,与.contents类似,用于循环遍历儿子节点
.descendants子孙节点的迭代类型,包含所有子孙节点,用于循环遍历
>>> soup=BeautifulSoup(demo,"html.parser")
>>> soup.head
<head><title>This is a python demo page</title></head>
>>> soup.head.contents
[<title>This is a python demo page</title>]
>>> soup.body.contents
[u'\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, u'\n', <p cl
ass="course">Python is a wonderful general-purpose programming language. You can learn Python from n
ovice to professional by tracking the following courses:\r\n<a class="py1" href="http://www.icourse1
63.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163
.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, u'\n']
>>> len(soup.head.contents)
1
>>> len(soup.body.contents)
5
>>> soup.body.contents[1]
<p class="title"><b>The demo python introduces several python courses.</b></p>
遍历儿子节点
>>> for child in soup.body.children:
...     print(child)
遍历子孙节点
>>> for child in soup.body,descendants:
...     print(child)

标签树的上行遍历

graph LR
b-->p1
p1-->body
body-->html
属性说明
.parent节点的父亲标签
.parents节点先辈标签的迭代类型,用于循环遍历先辈节点

遍历所有先辈节点,包括soup本身

>>> soup = BeautifulSoup(demo,"html.parser")
>>> for parent in soup.a.parents:
...     if parent is None:
...             print(parent)
...     else:
...             print(parent.name)
...
p
body
html
[document]

标签树的平行遍历

属性说明
.next_sibling返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling返回按照HTML文本顺序的上一个平行节点标签
.next_siblings迭代类型,返回按照HTML文本顺序的后续所有平行节点标签
.previous_siblings迭代类型,返回按照HTML文本顺序的前续所有平行节点标签
graph LR
p1-->p2
a1-->a2
平行遍历后续节点
>>> for sibling in soup.a.next_sibling:
...     print(sibling)
...

a
n
d
平行遍历前续节点
>>>for sibling in soup.a.previous_sibling:
...print(sibling)

参考文档

Beautiful Soup 4.2.0 文档
Beautiful Soup 4.4.0 文档

Logo

CSDN联合极客时间,共同打造面向开发者的精品内容学习社区,助力成长!

更多推荐