python怎么爬取两个网址的内容

在Python中是一个相对常见的操作,主要通过requests库和BeautifulSoup库来实现，下面我将详细地介绍如何一步步爬取两个网址的内容，希望对你有所帮助。

我们需要安装所需的库,在Python中，我们通常使用requests库来发送网络请求，使用BeautifulSoup库来解析HTML文档，安装命令如下：

pip install requests
pip install beautifulsoup4

我们开始编写代码。

导入所需的库：

import requests
from bs4 import BeautifulSoup

定义爬取内容的函数：

def get_html(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
        response = requests.get(url, headers=headers, timeout=10)
        response.encoding = response.apparent_encoding
        if response.status_code == 200:
            return response.text
        else:
            return None
    except requests.RequestException as e:
        print(e)
        return None

这个函数的作用是发送网络请求,获取网页的HTML内容。

分别爬取两个网址的内容：

url1 = '第一个网址'
url2 = '第二个网址'
html1 = get_html(url1)
html2 = get_html(url2)

这里,我们将两个网址分别赋值给url1和url2变量，然后调用get_html函数获取它们的HTML内容。

解析HTML内容并提取所需数据：

def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    # 这里根据你的需求提取数据，以下是一个示例
    title = soup.find('title').get_text()
    return title
title1 = parse_html(html1)
title2 = parse_html(html2)
print('第一个网址的标题：', title1)
print('第二个网址的标题：', title2)

在这个步骤中,我们定义了一个parse_html函数，用于解析HTML内容，这里以提取网页标题为例，你可以根据自己的需求修改这个函数，提取其他数据。

完整代码：

将以上代码整合在一起,我们得到以下完整的爬虫代码：

import requests
from bs4 import BeautifulSoup
def get_html(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
        response = requests.get(url, headers=headers, timeout=10)
        response.encoding = response.apparent_encoding
        if response.status_code == 200:
            return response.text
        else:
            return None
    except requests.RequestException as e:
        print(e)
        return None
def parse_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    title = soup.find('title').get_text()
    return title
url1 = '第一个网址'
url2 = '第二个网址'
html1 = get_html(url1)
html2 = get_html(url2)
title1 = parse_html(html1)
title2 = parse_html(html2)
print('第一个网址的标题：', title1)
print('第二个网址的标题：', title2)

运行这段代码,你将分别获取到两个网址的标题，这里只是以标题为例，你可以根据自己的需求，修改parse_html函数，提取其他更有价值的数据。

需要注意的是,在进行网络爬虫操作时，要遵守相关法律法规，不要对目标网站造成不必要的负担，也要尊重网站的robots协议，合理合法地进行数据抓取。