之前想到,用自己的所学的编程知识,来为自己的爱好,做一些事情,比如提供长江,嘉陵江,乌江水文信息。我用了1个星期,从数据抓取,到网站建设。

 

目前水文爬虫是从重庆水利局爬取的,这个数据是公开的,我们应该感谢zf提供如此好的,精准的数据,为我们泳友提供游泳水文信息,方便我们游泳。

 

爬虫第一步查看数据来源,进入 http://www.sxzhongrui.com/swxx/jrbssq/Pages/Default.aspx 这个网站然后F12,进入浏览器调试模式,然后,点击调试模式的选取箭头,在左边页面上,点击想要查看的元素,然后调试器,就回自动的跳转到html中的元素中,然后右键,点击copy xpath,复制元素的路径:

//*[@id="ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv"]/tbody/tr[2]/td[2]

注意,这个xpath,直接用于python的话,是获取不到江北区的。

真实的xpath 是 :

//*[@id="ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv"]/tr[2]/td[3]/font  

我也是找了好久,才试出来了。然后水文站名等,就很容易试出来了。

 

于是开始建立数据库,我用的是sqlite数据库,python自带的,不用安装其他的程序了。我把每一个水文站的数据,单独用一张表存放,然后用一张表存放映射,这样一来,方便程序代码读取,特别是后面前台程序的读取。当然你也可以全部数据放在一张表里面。

 

写这个程序,有一个难点:动态的改变 SQLAlchemy ORM 中models 类的数据库表。因为在数据库中,相同结构的数据库表,有许多,不可能每一张表都对应一个Python类,而我想法是,用一个类,对应多张表,每次使用这个类时,只需要改变类的表名字就好了。

 

所以就有了这么一段代码也就是改变,类的__table__.name 与  __tablename__,而且,在每次使用后,数据库需要close.

StationInfo.__table__.name = "station_info" + str(i - 2)StationInfo.__tablename__ = "station_info" + str(i - 2)u = models.StationInfo(station_city=station_city, station_river=station_river, station_name=station_name, water_levev=water_levev, flowrate=flowrate, warnning_levev=warnning_levev,ensure_levev=ensure_levev,levev_time=intTime,MD5=str_md5)try:db.session.add(u)db.session.commit()except Exception:print("insert error")db.session.rollback()db.session.close()db.session.flush()

一下是本爬虫程序的完整代码:

# #coding=utf-8
import json
import lxml
from lxml import etreeimport urllib
from urllib import requestfrom app import models
from app import db,g_tabelindeximport hashlib
import timefrom app.models import StationInfo"""
nStart = html.find('[{') 
nEnd = html.find('}];')
html = html[nStart:nEnd+2]
json_data = json.loads(html)
cuntan =json_data[0]['stnm']
"""if __name__=='__main__':url = "http://www.sxzhongrui.com/swxx/jrbssq/Pages/Default.aspx" response = urllib.request.urlopen(url)html = www.sxzhongrui.com()html = html.decode('utf-8')mytree = lxml.etree.HTML(html)# 时间单独列出来levev_time = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_Label3\"]/text()")[0].strip('\r\n ')# 2019年08月29日08时江河实时水情# 20190829levev_time = levev_time.replace("08时江河实时水情", "") levev_time = levev_time.replace("年","")levev_time = levev_time.replace("月","")levev_time = levev_time.replace("日","")intTime = int(levev_time)ensure_levev = '111'for i in range(2,41):# 解析数据if i == 2 or i == 8 or i==23 or i==26 or i==36:station_city = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[2]/font/text()")[0].strip('\r\n ')station_river = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[3]/font/text()")[0].strip('\r\n ')station_name = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[4]/font/text()")[0].strip('\r\n ')water_levev = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[5]/font/text()")[0].strip('\r\n ')flowrate = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[6]/font/text()")[0].strip('\r\n ')yesterday_levev = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[7]/font/text()")[0].strip('\r\n ')warnning_levev = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[8]/font/text()")[0].strip('\r\n ')ensure_levev = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[9]/font/text()")[0].strip('\r\n ')else:station_city = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[1]/font/text()")[0].strip('\r\n ')station_river = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[2]/font/text()")[0].strip('\r\n ')station_name = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[3]/font/text()")[0].strip('\r\n ')water_levev = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[4]/font/text()")[0].strip('\r\n ')flowrate = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[5]/font/text()")[0].strip('\r\n ')yesterday_levev = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[6]/font/text()")[0].strip('\r\n ')warnning_levev = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[7]/font/text()")[0].strip('\r\n ')ensure_levev = mytree.xpath("//*[@id=\"ctl00_m_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_g_89dd06cd_51bf_42cd_9c1e_144caac9529d_ASP_wpresources_jrbssqdefault_ascx_gv\"]/tr["+str(i)+"]/td[8]/font/text()")[0].strip('\r\n ')if water_levev == '--':water_levev = 0if flowrate == '--':flowrate = 0if yesterday_levev == '--':yesterday_levev = 0if warnning_levev == '--':warnning_levev = 0if ensure_levev == '--':ensure_levev = 0MD5 = station_city + station_river + station_name + str(water_levev) + str(flowrate) + str(yesterday_levev) + str(warnning_levev) + str(ensure_levev) + levev_timem = www.sxzhongrui.com5()b = MD5.encode(encoding='utf-8')m.update(b)str_md5 = m.hexdigest()StationInfo.__table__.name = "station_info" + str(i - 2)StationInfo.__tablename__ = "station_info" + str(i - 2)u = models.StationInfo(station_city=station_city, station_river=station_river, station_name=station_name, water_levev=water_levev, flowrate=flowrate, warnning_levev=warnning_levev,ensure_levev=ensure_levev,levev_time=intTime,MD5=str_md5)try:db.session.add(u)db.session.commit()except Exception:print("insert error")db.session.rollback()db.session.close()db.session.flush()

这个程序,还需要简单的Flask,Flask中的SQLAlchemy ,来配置数据库,因为这样使用数据库简直是太方便了。我会把全部代码放在QQ群文件中。

如过更多疑问与交流,请加python学习交流QQ群:475733139。