I’m trying to avoid writting nested for-loops, cause they are hard to read:
from bs4 import BeautifulSoup
import requests
import time
import json
class bs_spider:
def __init__(self,url):
'''define a class named bs_spider'''
self.url = url
def get_title_list(self, number, tag1, tag2):
'''define a method'''
title_list = []
for c in range(1,number+1):
time.sleep(3)
print('正在抓取第',str(c),'页')
url1 = self.url + str(c)
try:
wb = requests.get(url1)
soup = BeautifulSoup(wb.text,'lxml')
for i in soup.find_all(tag1):
term = i.get(tag2)
if (term != None) : # delete null value
title_list.append(term)
except OSError: # I don't know why catch OSError
print('抱歉,无法访问您输入的链接!')
print('抓取完毕!共抓取',str(len(title_list)),'个标题!')
return list(set(title_list)) # get unique element
url = 'https://search.bilibili.com/all?keyword=冬泳怪鸽&page='
bbilititle = bs_spider(url).get_title_list(5,'a','title')
#> 正在抓取第 1 页
#> 正在抓取第 2 页
#> 正在抓取第 3 页
#> 正在抓取第 4 页
#> 正在抓取第 5 页
#> 抓取完毕!共抓取 200 个标题!
This is a simple chinese webpage crawler. The first for-loop is to catch every page, every time it runs, all ‘a’ and ‘title’ html tags will be collected, that what the second for
does. The logic is clear, however, I bet you could not understand. Double for-loop is disgusting to read! When you use title_list = []
to create an empty list is not delicate, thus I tried to modify my code, first I write a function:
def find_tag(url_1,tag1,tag2):
'''get tag from one page, get what caption tag'''
try:
wb = requests.get(url_1)
except OSError:
print('抱歉,无法访问您输入的链接!')
soup = BeautifulSoup(wb.text,'lxml')
alist = [i.get(tag2) for i in soup.find_all(tag1) if i.get(tag2) != None]
return alist
Here find_tag
is used to replacing the second for-loop. I use a list comprehension to construct a list, it makes my code clearer and tidier. Then I construct the bs_spider
class:
class bs_spider:
def __init__(self,url):
'''define a class named bs_spider'''
self.url = url
def get_title_list(self, number, tag1, tag2):
'''define a method'''
title_list = []
for c in range(1,number+1):
time.sleep(3)
print('正在抓取第',str(c),'页')
url1 = self.url + str(c)
term = find_tag(url1, tag1, tag2)
title_list.extend(term)
print('抓取完毕!共抓取',str(len(title_list)),'个标题!')
return list(set(title_list)) # return unique elements
url = 'https://search.bilibili.com/all?keyword=冬泳怪鸽&page='
bbilititle = bs_spider(url).get_title_list(5,'a','title')
#> 正在抓取第 1 页
#> 正在抓取第 2 页
#> 正在抓取第 3 页
#> 正在抓取第 4 页
#> 正在抓取第 5 页
#> 抓取完毕!共抓取 200 个标题!
I didn’t use list comprehension because it will make my code more complex. Attention that I use list.extend()
method to merge all page’s captions.
Use more functions, list comprehensions rather than stacking for-loops. I feel this way cause I’ve read others’ code. That moment I deeply doubt if I’ve learned programming. Code readability is really important to me.