Python Requests Response [text/content]

09 Apr 2015 - evi1m0

在 Code Review 的过程中我发现有人喜欢对 Requests 库的 Response 对象使用 text 方法,有人则喜欢使用 content 方法,同样是获取网页内容,那么这两种究竟有什么差别?

/Library/Python/2.7/site-packages/requests-2.2.1-py2.7.egg/requests/models.py :

@property
def content(self):
    """Content of the response, in bytes."""

    if self._content is False:
        # Read the contents.
        try:
            if self._content_consumed:
                raise RuntimeError(
                    'The content for this response was already consumed')

            if self.status_code == 0:
                self._content = None
            else:
                self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()

        except AttributeError:
            self._content = None

    self._content_consumed = True
    # don't need to release the connection; that's been handled by urllib3
    # since we exhausted the data.
    return self._content

@property
def text(self):
    """Content of the response, in unicode.

    If Response.encoding is None, encoding will be guessed using
    ``chardet``.

    The encoding of the response content is determined based soley on HTTP
    headers, following RFC 2616 to the letter. If you can take advantage of
    non-HTTP knowledge to make a better guess at the encoding, you should
    set ``r.encoding`` appropriately before accessing this property.
    """

    # Try charset from content-type
    content = None
    encoding = self.encoding

    if not self.content:
        return str('')

    # Fallback to auto-detected encoding.
    if self.encoding is None:
        encoding = self.apparent_encoding

    # Decode unicode from given encoding.
    try:
        content = str(self.content, encoding, errors='replace')
    except (LookupError, TypeError):
        # A LookupError is raised if the encoding was not found which could
        # indicate a misspelling or similar mistake.
        #
        # A TypeError can be raised if encoding is None
        #
        # So we try blindly encoding.
        content = str(self.content, errors='replace')

    return content

仔细阅读上面两段函数可以看到 text 方法对数据进行了 encoding 的操作:

encoding = self.apparent_encoding

@property
def apparent_encoding(self):
    """The apparent encoding, provided by the lovely Charade library
    (Thanks, Ian!)."""
    return chardet.detect(self.content)['encoding']

/Library/Python/2.7/site-packages/requests-2.2.1-py2.7.egg/requests/adapters.py :

def build_response(self, req, resp):
    """Builds a :class:`Response <requests.Response>` object from a urllib3
    response. This should not be called from user code, and is only exposed
    for use when subclassing the
    :class:`HTTPAdapter <requests.adapters.HTTPAdapter>`

    :param req: The :class:`PreparedRequest <PreparedRequest>` used to generate the response.
    :param resp: The urllib3 response object.
    """
    response = Response()

    # Fallback to None if there's no status_code, for whatever reason.
    response.status_code = getattr(resp, 'status', None)

    # Make headers case-insensitive.
    response.headers = CaseInsensitiveDict(getattr(resp, 'headers', {}))

    # Set encoding.
    response.encoding = get_encoding_from_headers(response.headers)
    response.raw = resp
    response.reason = response.raw.reason

    if isinstance(req.url, bytes):
        response.url = req.url.decode('utf-8')
    else:
        response.url = req.url

    # Add new cookies from the server.
    extract_cookies_to_jar(response.cookies, req, resp)

    # Give the Response some context.
    response.request = req
    response.connection = self

    return response

看到 build_response 函数中调用了 get_encoding_from_headers :

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.
    """

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

看完 get_encoding_from_headers 函数会发现它获取网页头类型来区分然后进行编码操作的,但仍推荐使用 content 方法来获取内容,如编码有问题可对具体情况进行转换。

评论插件使用 Disqus ,需翻墙才能查看及留言。