爬虫day02

 

0、参考文章

  https://blog.csdn.net/qq_28304687/article/details/78678814

一、工作思路

  原本设想对几个音乐网站进行交叉对比,来推荐几首热门的音乐。目前来看,从技术难度以及时间成本上来说都不是一个很好的方案。通过现有文章的学习,觉得还是主要以网易云作为突破口。先做个东西出来。

所以,今天确定一下项目的主要目标:

  1、项目总体上的思路是从网易音乐上爬取信息,自己搭建一个平台,对抓取信息进行展示。这里不用于商业用途,只做一般研究使用。

  2、对爬取内容的最终确定:信息需要每天更新一次,因此选取新歌排行榜,对歌曲排行,评论进行爬取。

  3、在自己搭建的平台上,展示新歌排行榜前10的歌曲(每天更新,并对近7天的排行-100首歌曲中的排行进行统计),展示每首歌曲的前15条热门评论,付链接到网易云音乐;展示评论数排行前10的歌曲

二、爬取工作

  1、爬取排行榜信息

  在第一天工作的基础上,调整了一下爬取榜单的代码,这里主要通过抓包分析,对headers参数进行了修改。测试成功抓取榜单中歌曲的信息,包括Id(下一步抓取评论使用),歌曲名称,作者等相关信息。 

def getHtmlText(url,code='utf-8'):
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Host': 'music.163.com',
'Referer': 'http://music.163.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}
try:
r=requests.get(url,headers=headers,timeout=30)
r.raise_for_status()
r.encoding=code
return r.text
except:
return ""
def getSongInfo(url,sList):
try:
html=getHtmlText(url)
soup=BeautifulSoup(html,'html.parser')
sInfo=soup.find('div',id="song-list-pre-cache").textarea.text
InfoList=json.loads(sInfo)
for dict in InfoList:
sDict = {}
sDict["sId"]=dict["id"]
sDict["sName"]=dict["name"]
sDict["sArtists"]=dict["artists"]
sDict["sPicUrl"]=dict["album"]["picUrl"]
sList.append(sDict)
except:
sList.append("")

  2、爬取对应歌曲评论

  参考https://blog.csdn.net/CGS_______/article/details/79046520(评论可以爬)、https://www.zhihu.com/question/36081767

def getCommentText(sId,sName):
try:
url='http://music.163.com/weapi/v1/resource/comments/R_SO_4_'+sId+'?csrf_token='
headers={
'Accept':'*/*',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.9',
'Connection':'keep-alive',
'Content-Length':'480',
'Content-Type':'application/x-www-form-urlencoded',
'Cookie':'_ntes_nnid=75b675c145b42901e0750baafa0a9ed0,1514509689429; _ntes_nuid=75b675c145b42901e0750baafa0a9ed0; __f_=1517374037843; vjuids=8c4d95a5e.1614b098e27.0.b63e85e8338af; __gads=ID=56236686338c1238:T=1517382488:S=ALNI_Mam9amISyWOhRSQj9mEQXjn8PsmHA; _iuqxldmzr_=32; UM_distinctid=16178c7902fabc-034a9f6135dd6e-464a0129-1aeaa0-16178c790308b4; P_INFO=yuuwee@163.com|1519462429|0|mail163|11&17|hub&1519344599&mailsettings#hub&420100#10#0#0|&0|mailsettings|yuuwee@163.com; nts_mail_user=yuuwee@163.com:-1:1; NTES_CMT_USER_INFO=66455334%7Cyu%E8%A7%81wei%E6%9D%A5%7Chttp%3A%2F%2Fimg1.cache.netease.com%2Ft%2Fimg%2Fdefault80.png%7Cfalse%7CeXV1d2VlQDE2My5jb20%3D; vjlast=1517382373.1519464464.11; vinfo_n_f_l_n3=6727d1213b6fca94.1.1.1518149624980.1518149813330.1519464694096; WM_TID=UArr%2BifvtwHK31KHz0JO1qVXZK0I64sY; JSESSIONID-WYYY=t%2BoPsQlMYbQO%2Fx2OSs2Q9O2Ri3j0c1tOX5bR7U%2F%2BGVmlQuND7FuiEBHu%5Cgz%5Cauy8z3DycXlM%2Bbn3a0Ay9Yo77m25%5CJh3w7DSiimtp68OiuWJ1hpgvM2sEqm9w9Oy9EnH7q9HyB2DbdTacTxh6nEAJeCXRwU8AtqEwt4%5CVc%5Ch7dUGyosk%3A1524103613403; __utma=94650624.1139027543.1523417245.1524014652.1524101814.4; __utmc=94650624; __utmz=94650624.1524101814.4.3.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utmb=94650624.6.10.1524101814',
'Host':'music.163.com',
'Origin':'http://music.163.com',
'Referer':'http://music.163.com/song?id=553543014',
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}
data={
'params':'7QgO81Rzo14KfZVo7TA3864Hcp17mVtnwei5olykQtyzIowGUHSC3J21/y9O23heLoK6FZ//J5XXYq6JnQT2AwTPhDS841PgP3sRwLoOdR+87NkdUd3ek8qKlqmYInkja2C4pH9Gaq+elQli6xYHnAY7pd3oyJz0KD8yp/htEyJYUpj/xrkPSUz3onsVMUEA',
'encSecKey':'a90e24632c6579ca91333cf61b1bba59472b176931916dde17ae38268740e0b0f32b1c06dd8f38b0df8284abb98c09ff0fead160254052e56253053567733006ad7a821dcccd12e92b768b0176ca2c9cd1ed7747964c6947b6f42209da029552f89cafec79417ef207c9fd6e179d0000391374ad879f355783d451e5d4ab8619',
}
respons = requests.post(url, data=data, headers=headers, )
cInfo = json.loads(respons.text)['hotComments']
clist = []
for dic in cInfo:
cDic = {}
cDic['uName'] = dic['user']['nickname']
cDic['likedCount'] = dic['likedCount']
cDic['content'] = dic['content']
clist.append(cDic)
trump = (sName,clist)
return trump
except Exception as e:
trump = (sName, None)
return trump
构造了一个单曲评论爬取函数,并返回一个包含歌曲名称和所有热门评论的元组。这里主要是方便后续提取歌名。所有选择这样的结构,具体效果还不清楚。

  3、爬取排行榜歌曲的评论数

def getAllCom():
slist=getSongInfo()
sInfoList=[]
allComm=[]
for dic in slist:
sInfoList.append((dic['sId'],dic['sName']))
for i in sInfoList:
trump=getCommentText(str(i[0]),i[1])
allComm.append(trump)
return allComm
构造一个提取歌曲列表所有歌曲的前15个评论的数据

  4、存储数据

这里应该是存储到数据库中,但是还没开始这块,就先存到TXT文件了

def savesongInfo(sList):
num = 0
with open('./song_list.txt', 'a', encoding='utf-8') as f:
T=time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
f.write('《' + T + '》:' + '\n')
for i in sList:
num += 1
strt=json.dumps(i)
f.write(str(num) + '.' + strt + '\n')
f.write('\n====================================================\n\n')
f.close()
def savecommentInfo():
als=getAllCom()
with open('./song_coment.txt', 'a', encoding='utf-8') as f:

T = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
f.write('《' + T + '》:' + '\n')
num=0
for iterm in als:
num += 1
n=0
f.write(str(num) + '.' + '《' + iterm[0] + '》' + '\n')
for i in iterm[1]:
n=n+1
strt = json.dumps(i)
f.write('('+str(n) +')'+ '' + strt + '\n')
f.write('\n====================================================\n\n')
f.close()