python爬虫实战(三)——猪八戒网(xpath)

文章目录1 . 前言2 . 注意点3 . 代码1 . 前言xpath是实际项目中应用最多的方法，相比于re和bs4来说。所以xpath学好很重要2 . 注意点需要下载lxml库，不然会找不到etree的APIpython基础:strip()在字符串的首末位置去除指定的字符str = "123abcrunoob321"print (str.strip( '12' ))# 字符序列为 123abcru

WFForstar

1636人浏览 · 2022-01-13 12:06:07

WFForstar · 2022-01-13 12:06:07 发布

文章目录

1 . 前言
2 . 注意点
3 . 代码

1 . 前言

xpath是实际项目中应用最多的方法，相比于re和bs4来说。
所以xpath学好很重要

2 . 注意点

需要下载lxml库，不然会找不到etree的API
python基础: strip()
在字符串的首末位置去除指定的字符

str = "123abcrunoob321"
print (str.strip( '12' ))  # 字符序列为 12

3abcrunoob3

python基础： join()
在列表各个元素之间用特定的符号串起来

list1 = ['1','2','3','4'] 
s = "-"
s = s.join(list1) 
print(s)

1-2-3-4 # 输出

最后一个返回的是空列表，直接用判断语句跳过

3 . 代码

# xpath是在XML文档中搜索内容的一门语言
# html是XML的一个子集
# 拿页面源代码
# 提取和解析数据
import requests
from lxml import etree

url = "https://taizhou.zbj.com/search/f/?kw=saas"
res = requests.get(url)
#print(res.text)]
# 解析
html = etree.HTML(res.text)
# 拿到每一个服务商的div
divs = html.xpath("/html/body/div[6]/div/div/div[2]/div[5]/div")

for div in divs: # 每一个服务商信息
    company_name = div.xpath('./div/div[1]/div/div/a[2]/div[2]/div[1]/span[1]/text()')
    price = div.xpath('./div/div/div/a[2]/div[2]/div[1]/span[1]/text()')
    service = div.xpath('./div/div/div/a[2]/div[2]/div[2]/p/text()')
    if price == []:
        continue
    #print(company_name)
    for prices in price:
        print(prices.strip("¥"))
    #print(service)
    #print(divs)