明天开讲
导师来自:Carat
Data大神带你“打”入2020美国数据科学领先企业
美东时间 2020年1月13日9PM开讲
你将获得:
什么样的简历才能顺利拿到面试官的橄榄枝?
如何将SQL, Tableau, Python 等运用到企业项目中?
怎样规划Data Science求职时间线?
Data岗求职过程中,怎样确定适合自己的行业和工作内容?
扫码回复【DS VIP 1】报名

课程价格$150,限时学生免费
前言
数据获取
# 开始爬取数据
driver = webdriver.Chrome()
driver.maximize_window()
url = 'https://www.zhihu.com/question/291506148'
js='window.open("'+url+'")'
driver.execute_script(js)
for i in range(1000):
time.sleep(1)
js="var q=document.documentElement.scrollTop=10000000"
driver.execute_script(js)
print(i)
# 整理数据
all_html = [k.get_property('innerHTML') for k in driver.find_elements_by_class_name('AnswerItem')]
all_text = ''.join(all_html)
pat = '/space.bilibili.com/\d+'
spaces = list(set([k for k in re.findall(pat,all_text)]))
upstat = pd.DataFrame(columns=['name','fans','face','main_type','total_video',
'total_play', 'total_comment'])
for i in range(len(spaces)):
try:
time.sleep(1)
space_id = str(spaces[i].replace('/space.bilibili.com/',''))
url= 'https://api.bilibili.com/x/web-interface/card?mid={}&jsonp=jsonp&article=true'.format(space_id)
html = requests.get(url=url, cookies=cookie, headers=header).content
data = json.loads(html.decode('utf-8'))['data']
this_name = data['card']['name']
this_fans = data['card']['fans']
this_face = data['card']['face']
this_video = int(data['archive_count'])
total_page = int((this_video-1)/30)+1
video_list=[]
for j in range(total_page):
url = 'https://api.bilibili.com/x/space/arc/search?mid={}&ps=30&tid=0&pn={}&keyword=&order=click&jsonp=jsonp'.format(space_id,str(j+1))
html = requests.get(url=url, cookies=cookie, headers=header).content
data = json.loads(html.decode('utf-8'))
if j == 0 :
type_list = data['data']['list']['tlist']
this_list = data['data']['list']['vlist']
video_list = video_list + [ this_list [k] for k in range(len(this_list))]
type_list = list(type_list.values())
type_list = {type_list[k]['name']:int(type_list[k]['count']) for k in range(len(type_list))}
this_type = max(type_list,key=type_list.get)
this_play = sum([video_list[k]['play'] for k in range(len(video_list)) if video_list[k]['play'] != '--'])
this_comment = sum([video_list[k]['comment'] for k in range(len(video_list)) if video_list[k]['comment'] != '--'])
upstat = upstat.append({'name':this_name,
'fans':this_fans,
'face':this_face,
'main_type':this_type,
'total_video':this_video,
'total_play':this_play,
'total_comment':this_comment},
ignore_index=True)
print('success:'+str(i))
except:
print('fail:'+str(j))
continue
总体概览
i = 0
for i in range(upstat.shape[0]):
loc = 'D:/爬虫/惊为天人/'+upstat['name'][i]+'.jpg'
# request.urlretrieve(upstat['face'][i],loc)
img = mpimg.imread(loc)[:,:,0:3]
img = cv2.resize(img, (500,500),interpolation=cv2.INTER_CUBIC)
if i % 20 == 0:
row_img=img
elif i == 19:
row_img=np.hstack((row_img,img))
all_img = row_img
elif i % 20 == 19:
row_img=np.hstack((row_img,img))
all_img = np.vstack((all_img,row_img))
else:
row_img=np.hstack((row_img,img))
i = i+1
plt.axis('off')
plt.margins(0,0)
plt.imshow(all_img)
plt.savefig('头像.png',dpi=1000)
综合排名
分类排名
职图飞跃计划正在招生
超含金量的远程实习等你加入!



