爬取51job的职位信息

#!/usr/bin/python#encoding:utf-8#网站---源代码---python信息---匹配findall---写入文件import urllibimport reimportsysreload(sys)sys.setdefaultencoding('utf-8')#输出的内容是utf-8格式#打开源码，获取网站i=0;d

浅汐王

1250人浏览 · 2017-10-08 15:01:47

浅汐王 · 2017-10-08 15:01:47 发布

  #!/usr/bin/python 

  #encoding:utf-8 

  #网站---源代码---python信息---匹配findall---写入文件 

  import urllib 

  import re 

  import 
  sys 

  reload(sys) 

  sys.setdefaultencoding('utf-8')#输出的内容是utf-8格式 

  #打开源码，获取网站 

  i=0; 

  def get_content(page): 

  url='http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=000000%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=99&keyword=java&keywordtype=2&curr_page= 
 2&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&fromType=14&dibiaoid=0&confirmdate=9'. 
 format(page) 

  a=urllib.urlopen(url) #打开网页 

  html=a.read() #读取源代码 

  html=html.decode('gbk') #从gbk转为unicode 

  # print html 

  return html 

  #匹配到正文 

  def get(html): 

  reg =re. 
 compile(r'class="t1 ">.*?<a target="_blank" title="(.*?)".*?<span class="t2"><a target="_blank" title="(.*?)".*?<span class="t3">(.*?)</span>.*?<span class="t4">(.*?)</span>.*?<span class="t5">(.*?)</span>', 
 re.S) 

  items=re. 
 findall(reg,html) 

  # print items #列表list 

  return items 

  # 
 多页，写入文件 

  for j in range(1,2000): 

  html=get_content(j) #调用获取源码 

  for i in get(html): 

  print i[0], i[1], i[2], i[3], i[4] 

  with open('51job.txt','a') as f: 

  f.write(i[0]+'\t'+i[1]+'\t'+i[2]+'\t'+i[3]+'\t'+i[4]+'\n') 

  f.close() 

CSDN学习社区

CSDN联合极客时间，共同打造面向开发者的精品内容学习社区，助力成长！

更多推荐

嵌入式作业（七）：基于Ardunio的STM32串口通信

嵌入式作业（七）0作业要求1Ardunio 完成STM32的串口通信（1）安装Ardunio IDE（2）stm32串口通信2关于 stduino IDE0作业要求安装 Ardunio IDE 和相关软件支持库，在Ardunio 完成STM32板子的串口通信程序：（1）持续向串口输出“Hello world！”；（2）当接收到“stop!”时，停止输出。网上有一个国人版的MCU集成开发平台， st

CSDN学习社区

JDBC详解

JDBC文章目录JDBC什么是JDBC?JDBC驱动程序:Java使用JDBC访问数据库的步骤:设置classpath:Oracle连接字符串的书写格式:简单的例子:常用数据库的驱动程序及JDBC URL:Oracle数据库:SQL Server数据库MySQL数据库Access数据库PreparedStatement接口:JNDI-数据源（Data Source）与连接池（Connection

CSDN学习社区

“模式识别与机器学习”学习笔记no2.再谈感知机

接**上篇：上篇主要进行了PLA，Pocket算法的理论过程分析和在给定数据集上利用pocket算法对数据集进行分类学习，得到错分数量最少的分类面。上篇中pocket算法的过程已经进行了编程和测试，框架已经建立了起来，这一篇主要上篇中没有提到或涉及不深的几个问题。1.数据集的构造。上篇是直接使用了题目给的向量，这次来根据正态分布来产生数据集。np.random.normal函数可以根据均值和方差生