本文作者:马梦杰 中南财经政法大学统计与数学学院
本文编辑:梁莹
本文主编:兰博文
Stata and Python 数据分析
爬虫俱乐部Stata数据处理与实证研究实战、Python基础编程与文本分析进阶课程可在小鹅通平台查看,欢迎大家多多支持订阅!如需了解详情,可以通过课程链接(https://appbqiqpzi66527.h5.xiaoeknow.com/homepage/10)或课程二维码进行访问哦~
一、 引言
近期小编需要获取港股的一些信息,比如港股上市公司的营业范围,因而找了一些网站进行爬虫获取。今天小编以中商产业研究院网上的港股信息爬取信息为例,介绍python爬虫的相关思路。
二、 获取第一页港股信息
第一步,先到获取信息的网页。中商产业研究院网页—产业数据—数据库—上市企业数据—港股,网页链接https://s.askci.com/stock/h/,网页信息,共4300多个股票,每页显示20条信息,每页信息有股票代码、股票简称、公司名称等等信息。 如图所示: ![]()
第二步,获取第一页信息。右击—检查—Network—刷新页面或点击第二页再点击第一页,主要关注如图2的信息,获取第一页的网址和请求头。具体代码以及需要获取的信息(如图3)如下:
import requestsfrom lxml import etreesession = requests.session() #具有更快的效果url = 'https://s.askci.com/stock/h/0-0?reportTime=2025-06-30&pageNum=1'headers = {"authority":"s.askci.com","method":"GET","path":"/stock/h/0-0?reportTime=2025-06-30&pageNum=1","scheme":"https","accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","accept-encoding":"gzip, deflate, br, zstd","accept-language":"zh-CN,zh;q=0.9","cookie":"__bid_n=19a4ed90c95747258a96be; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2219a4ed90cdb231c-0e491cdda7ed758-1e525631-1405320-19a4ed90cdc3512%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMTlhNGVkOTBjZGIyMzFjLTBlNDkxY2RkYTdlZDc1OC0xZTUyNTYzMS0xNDA1MzIwLTE5YTRlZDkwY2RjMzUxMiJ9%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%22%2C%22value%22%3A%22%22%7D%2C%22%24device_id%22%3A%2219a4ed90cdb231c-0e491cdda7ed758-1e525631-1405320-19a4ed90cdc3512%22%7D; sajssdk_2015_cross_new_user=1","priority":"u=0, i","referer":"https://s.askci.com/stock/h/0-0?reportTime=2025-06-30&pageNum=2","sec-ch-ua":'"Google Chrome";v="141", "Not?A_Brand";v="8", "Chromium";v="141"',"sec-ch-ua-mobile":"?0","sec-ch-ua-platform":"macOS","sec-fetch-dest":"document","sec-fetch-mode":"navigate","sec-fetch-site":"same-origin","sec-fetch-user":"?1","upgrade-insecure-requests":"1","user-agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"}html = session.get(url, headers = headers)print(html.text)
![]()
图2
![]()
图3 第三步,定位所需信息。使用xpath进行定位(需要安装xpath插件),比如定位股票代码,如图4所示,利用定位的文本节点获取相关信息,具体代码以及获取的信息(图5)如下:
tree = etree.HTML(html.text)# 通过xpath提取包含股票代码、股票名称、公司名称……的文本节点stkcd_xpath = "//td[2]/a"stkcdname_xpath = "//td[3]/a"companyname_xpath = "//td[4]"shouru_xpath = "//td[6]"lirun_xpath = "//td[7]"worknumber_xpath = "//td[8]"shangshidate_xpath = "//td[9]"industry_xpath = "//td[12]"yewufanwei_xpath = "//td[@class='text_left']"stkcd_list = tree.xpath(stkcd_xpath)stkcdname_list = tree.xpath(stkcdname_xpath)companyname_list = tree.xpath(companyname_xpath)shouru_list = tree.xpath(shouru_xpath)lirun_list = tree.xpath(lirun_xpath)worknumber_list = tree.xpath(worknumber_xpath)shangshidate_list = tree.xpath(shangshidate_xpath)industry_list = tree.xpath(industry_xpath)yewufanwei_list = tree.xpath(yewufanwei_xpath)stkcd_list = [stkcd.text for stkcd in stkcd_list]stkcdname_list = [stkcdname.text for stkcdname in stkcdname_list]companyname_list = [companyname.text for companyname in companyname_list]shouru_list = [shouru.text for shouru in shouru_list]lirun_list = [lirun.text for lirun in lirun_list]worknumber_list = [worknumber.text for worknumber in worknumber_list]shangshidate_list = [shangshidate.text for shangshidate in shangshidate_list]industry_list = [industry.text for industry in industry_list]yewufanwei_list = [yewufanwei.text for yewufanwei in yewufanwei_list]for stkcd, stkcdname, companyname, shouru, lirun, worknumber, shangshidate, industry, yewufanwei in zip(stkcd_list, stkcdname_list, companyname_list, shouru_list, lirun_list, worknumber_list, shangshidate_list, industry_list, yewufanwei_list) :info = stkcd + "," + stkcdname + "," + companyname + "," + shouru + "," + lirun + "," + worknumber + "," + shangshidate + "," + industry + "," + yewufanweiprint(info)![]()
图4 ![]()
图5
最后,将获取信息导出并保存到excel中,代码和最终结果(图6)如下:
import pandas as pdfile = "/Users/mamengjie/Documents/02文件库/1梦梦电脑材料转移/statapython学习/python获取港股上市信息/港股上市信息第1页.xlsx"df = pd.DataFrame(data = [stkcd_list, stkcdname_list, companyname_list, shouru_list, lirun_list, worknumber_list, shangshidate_list, industry_list, yewufanwei_list]).Tdf.columns = ["股票代码", "股票简称", "公司名称", "主营业务收入", "净利润", "员工人数", "上市日期", "行业分类", "业务范围"]df.to_excel(file,index = None)![]()
图6
三、 获取全部港股信息
当理解并学会了获取第一页的信息后,需要寻找网页链接规律,进行循环便可获取全部的网页信息。点击第一页、第二页、第三页时可以发现网页的规律在于pageNum=后面数字的不同,如图7,因此对这里数字进行循环便可以对所有网页进行循环。
![]()
图7
在获取全部网页信息中,需要注意:一添加随机休息;二对于未成功获取信息的网页进行至少三次尝试,并对于最终都未成功获取的网页记得保存。这些注意要求的加入可以减免后期的很多工作。
具体获取全部网页信息的python代码如下:
# 获取全部港股信息import pandas as pdimport requestsfrom lxml import etreeimport timeimport os# 创建session对象session = requests.session()# 文件路径file = "/Users/mamengjie/Documents/02文件库/1梦梦电脑材料转移/statapython学习/python获取港股上市信息/港股上市信息.xlsx"# 确保目录存在os.makedirs(os.path.dirname(file), exist_ok=True)# 提取信息的xpathstkcd_xpath = "//td[2]/a"stkcdname_xpath = "//td[3]/a"companyname_xpath = "//td[4]"shouru_xpath = "//td[6]"lirun_xpath = "//td[7]"worknumber_xpath = "//td[8]"shangshidate_xpath = "//td[9]"industry_xpath = "//td[12]"yewufanwei_xpath = "//td[@class='text_left']"# 存储所有数据的列表all_stkcd = []all_stkcdname = []all_companyname = []all_shouru = []all_lirun = []all_worknumber = []all_shangshidate = []all_industry = []all_yewufanwei = []# 统计变量empty_page_count = 0max_empty_pages = 3 # 连续空页数达到此值则停止for page in range(1, 220):print(f"正在爬取第 {page} 页")# 构造URLurl = f"https://s.askci.com/stock/h/0-0?reportTime=2025-06-30&pageNum={page}#QueryCondition"headers = {"authority": "s.askci.com","method": "GET","scheme": "https","accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7","accept-encoding": "gzip, deflate, br, zstd","accept-language": "zh-CN,zh;q=0.9","cookie": "__bid_n=19a4ed90c95747258a96be; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2219a4ed90cdb231c-0e491cdda7ed758-1e525631-1405320-19a4ed90cdc3512%22%2C%22first_id%22%3A%22%22%2C%22props%22%3A%7B%7D%2C%22identities%22%3A%22eyIkaWRlbnRpdHlfY29va2llX2lkIjoiMTlhNGVkOTBjZGIyMzFjLTBlNDkxY2RkYTdlZDc1OC0xZTUyNTYzMS0xNDA1MzIwLTE5YTRlZDkwY2RjMzUxMiJ9%22%2C%22history_login_id%22%3A%7B%22name%22%3A%22%22%2C%22value%22%3A%22%22%7D%2C%22%24device_id%22%3A%2219a4ed90cdb231c-0e491cdda7ed758-1e525631-1405320-19a4ed90cdc3512%22%7D; sajssdk_2015_cross_new_user=1","priority": "u=0, i","referer": f"https://s.askci.com/stock/h/0-0?reportTime=2025-06-30&pageNum={page-1 if page > 1 else 1}#QueryCondition","sec-ch-ua": '"Google Chrome";v="141", "Not?A_Brand";v="8", "Chromium";v="141"',"sec-ch-ua-mobile": "?0","sec-ch-ua-platform": '"macOS"',"sec-fetch-dest": "document","sec-fetch-mode": "navigate","sec-fetch-site": "same-origin","sec-fetch-user": "?1","upgrade-insecure-requests": "1","user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36"}try:# 发送请求html = session.get(url, headers=headers, timeout=10)html.encoding = 'utf-8'# 检查状态码if html.status_code != 200:print(f"第 {page} 页请求失败,状态码: {html.status_code}")empty_page_count += 1if empty_page_count >= max_empty_pages:print("连续多页请求失败,停止爬取")breakcontinuetree = etree.HTML(html.text)# 检查是否获取到数据stkcd_list = tree.xpath(stkcd_xpath)if not stkcd_list:print(f"第 {page} 页没有数据")empty_page_count += 1if empty_page_count >= max_empty_pages:print(f"连续 {max_empty_pages} 页无数据,停止爬取")breakcontinueelse:empty_page_count = 0 # 重置空页计数stkcdname_list = tree.xpath(stkcdname_xpath)companyname_list = tree.xpath(companyname_xpath)shouru_list = tree.xpath(shouru_xpath)lirun_list = tree.xpath(lirun_xpath)worknumber_list = tree.xpath(worknumber_xpath)shangshidate_list = tree.xpath(shangshidate_xpath)industry_list = tree.xpath(industry_xpath)yewufanwei_list = tree.xpath(yewufanwei_xpath)# 安全地提取文本内容stkcd_list = [stkcd.text if stkcd.text else "" for stkcd in stkcd_list]stkcdname_list = [stkcdname.text if stkcdname.text else "" for stkcdname in stkcdname_list]companyname_list = [companyname.text if companyname.text else "" for companyname in companyname_list]shouru_list = [shouru.text if shouru.text else "" for shouru in shouru_list]lirun_list = [lirun.text if lirun.text else "" for lirun in lirun_list]worknumber_list = [worknumber.text if worknumber.text else "" for worknumber in worknumber_list]shangshidate_list = [shangshidate.text if shangshidate.text else "" for shangshidate in shangshidate_list]industry_list = [industry.text if industry.text else "" for industry in industry_list]yewufanwei_list = [yewufanwei.text if yewufanwei.text else "" for yewufanwei in yewufanwei_list]# 检查数据长度是否一致data_lengths = [len(stkcd_list), len(stkcdname_list), len(companyname_list),len(shouru_list), len(lirun_list), len(worknumber_list),len(shangshidate_list), len(industry_list), len(yewufanwei_list)]if len(set(data_lengths)) != 1:print(f"第 {page} 页数据长度不一致: {data_lengths}")# 取最小长度避免索引错误min_length = min(data_lengths)stkcd_list = stkcd_list[:min_length]stkcdname_list = stkcdname_list[:min_length]companyname_list = companyname_list[:min_length]shouru_list = shouru_list[:min_length]lirun_list = lirun_list[:min_length]worknumber_list = worknumber_list[:min_length]shangshidate_list = shangshidate_list[:min_length]industry_list = industry_list[:min_length]yewufanwei_list = yewufanwei_list[:min_length]# 添加到总列表all_stkcd.extend(stkcd_list)all_stkcdname.extend(stkcdname_list)all_companyname.extend(companyname_list)all_shouru.extend(shouru_list)all_lirun.extend(lirun_list)all_worknumber.extend(worknumber_list)all_shangshidate.extend(shangshidate_list)all_industry.extend(industry_list)all_yewufanwei.extend(yewufanwei_list)print(f"第 {page} 页成功获取 {len(stkcd_list)} 条数据")except requests.exceptions.Timeout:print(f"第 {page} 页请求超时")continueexcept requests.exceptions.ConnectionError:print(f"第 {page} 页连接错误")continueexcept Exception as e:print(f"第 {page} 页爬取失败: {e}")continue# 添加延迟,避免被封IPtime.sleep(1)# 每10页显示一次进度if page % 10 == 0:print(f"已爬取 {page} 页,累计获取 {len(all_stkcd)} 条数据")print("全部抓取完成,正在导出到excel")# 创建DataFramedf = pd.DataFrame({"股票代码": all_stkcd,"股票简称": all_stkcdname,"公司名称": all_companyname,"主营业务收入": all_shouru,"净利润": all_lirun,"员工人数": all_worknumber,"上市日期": all_shangshidate,"行业分类": all_industry,"业务范围": all_yewufanwei})# 保存到Exceltry:df.to_excel(file, index=False)print(f"导出完成,共 {len(df)} 条数据")print(f"文件保存位置: {file}")# 显示数据预览print("\n数据预览:")print(df.head())except Exception as e:print(f"导出失败: {e}")# 尝试保存为CSV作为备选方案try:csv_file = file.replace('.xlsx', '.csv')df.to_csv(csv_file, index=False, encoding='utf-8-sig')print(f"已备选保存为CSV文件: {csv_file}")except Exception as csv_error:print(f"CSV保存也失败: {csv_error}")最终结果如图所示:
![]()
四、 结语
在研究中,总会有很多获取信息的需求,那么对于信息具体如何抓取以及使用哪个软件,主要看个人喜欢以及网页的特殊性。对于这份港股信息的获取同样可以使用stata进行获取,感兴趣的小伙伴可以尝试一下。
声明:代码仅供学习使用,请勿用做任何商业行为!
重磅福利!为了更好地服务各位同学的研究,爬虫俱乐部将在小鹅通平台上持续提供金融研究所需要的各类指标,包括上市公司十大股东、股价崩盘、投资效率、融资约束、企业避税、分析师跟踪、净资产收益率、资产回报率、国际四大审计、托宾Q值、第一大股东持股比例、账面市值比、沪深A股上市公司研究常用控制变量等一系列深加工数据,基于各交易所信息披露的数据利用Stata在实现数据实时更新的同时还将不断上线更多的数据指标。我们以最前沿的数据处理技术、最好的服务质量、最大的诚意望能助力大家的研究工作!相关数据链接,请大家访问:(https://appbqiqpzi66527.h5.xiaoeknow.com/homepage/10)或扫描二维码:
![]()
对我们的推文累计打赏超过1000元,我们即可给您开具发票,发票类别为“咨询费”。用心做事,不负您的支持!
往期推文推荐 Stata入门:tempvar命令与tempfile命令详解
Python绘图:用matplotlib库绘制好看的折线图
Seminar | 注意力独特性与企业绩效:增长行动的中介作用
Stata | 从sum2docx到reg2docx——基于《中国工业经济》期刊文章的结果输出 Seminar | 邻避效应:内在动机与企业污染治理
Stata爬取豆瓣读书,一键获取你的读书清单
Seminar | 自动化对企业报告质量的影响
识别处理重复值duplicates命令
Pandas 数据筛选的多种方法
Python交互可视化实战:构建动态数据仪表盘 Stata查看变量信息的三个常用命令 我用 Mermaid 画了《甄嬛传》角色关系图,结果…… 爬虫俱乐部暑期Stata&Python编程训练营圆满结束! Stata入门:twoway命令—分析可视化 用Bokeh搞定大型数据集流式可视化,超实用! Stata入门:reshape命令---数据结构的“智能”变形
化繁为简:用高德地图API统计银行网点分布
探索 Stata 绘图:基于 S&P500 数据的可视化分析
Python Selenium爬取裁判文书网:从登录到批量下载全流程自动化
用正则表达式玩转混乱文本
Python实战-爬取政府采购网
关于我们 微信公众号“Stata and Python数据分析”分享实用的Stata、Python等软件的数据处理知识,欢迎转载、打赏。我们是由李春涛教授领导下的研究生及本科生组成的大数据处理和分析团队。
我们团队一直为广大用户提供数据采集和分析的服务工作,如果您有这方面的需求,请发邮件到statatraining@163.com。
此外,欢迎大家踊跃投稿,介绍一些关于Stata和Python的数据处理和分析技巧。
投稿邮箱:statatraining@163.com投稿要求:1)必须原创,禁止抄袭;2)必须准确,详细,有例子,有截图;注意事项:1)所有投稿都会经过本公众号运营团队成员的审核,审核通过才可录用,一经录用,会在该推文里为作者署名,并有赏金分成。 2)邮件请注明投稿,邮件名称为“投稿+推文名称”。3)应广大读者要求,现开通有偿问答服务,如果大家遇到有关数据处理、分析等问题,可以在公众号中提出,只需支付少量赏金,我们会在后期的推文里给予解答。


