Python Requests Response [text/content]

09 Apr 2015 - evi1m0

在 Code Review 的过程中我发现有人喜欢对 Requests 库的 Response 对象使用 text 方法,有人则喜欢使用 content 方法,同样是获取网页内容,那么这两种究竟有什么差别?

/Library/Python/2.7/site-packages/requests-2.2.1-py2.7.egg/requests/ :

def content(self):
    """Content of the response, in bytes."""

    if self._content is False:
        # Read the contents.
            if self._content_consumed:
                raise RuntimeError(
                    'The content for this response was already consumed')

            if self.status_code == 0:
                self._content = None
                self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()

        except AttributeError:
            self._content = None

    self._content_consumed = True
    # don't need to release the connection; that's been handled by urllib3
    # since we exhausted the data.
    return self._content

def text(self):
    """Content of the response, in unicode.

    If Response.encoding is None, encoding will be guessed using

    The encoding of the response content is determined based soley on HTTP
    headers, following RFC 2616 to the letter. If you can take advantage of
    non-HTTP knowledge to make a better guess at the encoding, you should
    set ``r.encoding`` appropriately before accessing this property.

    # Try charset from content-type
    content = None
    encoding = self.encoding

    if not self.content:
        return str('')

    # Fallback to auto-detected encoding.
    if self.encoding is None:
        encoding = self.apparent_encoding

    # Decode unicode from given encoding.
        content = str(self.content, encoding, errors='replace')
    except (LookupError, TypeError):
        # A LookupError is raised if the encoding was not found which could
        # indicate a misspelling or similar mistake.
        # A TypeError can be raised if encoding is None
        # So we try blindly encoding.
        content = str(self.content, errors='replace')

    return content

仔细阅读上面两段函数可以看到 text 方法对数据进行了 encoding 的操作:

encoding = self.apparent_encoding

def apparent_encoding(self):
    """The apparent encoding, provided by the lovely Charade library
    (Thanks, Ian!)."""
    return chardet.detect(self.content)['encoding']

/Library/Python/2.7/site-packages/requests-2.2.1-py2.7.egg/requests/ :

def build_response(self, req, resp):
    """Builds a :class:`Response <requests.Response>` object from a urllib3
    response. This should not be called from user code, and is only exposed
    for use when subclassing the
    :class:`HTTPAdapter <requests.adapters.HTTPAdapter>`

    :param req: The :class:`PreparedRequest <PreparedRequest>` used to generate the response.
    :param resp: The urllib3 response object.
    response = Response()

    # Fallback to None if there's no status_code, for whatever reason.
    response.status_code = getattr(resp, 'status', None)

    # Make headers case-insensitive.
    response.headers = CaseInsensitiveDict(getattr(resp, 'headers', {}))

    # Set encoding.
    response.encoding = get_encoding_from_headers(response.headers)
    response.raw = resp
    response.reason = response.raw.reason

    if isinstance(req.url, bytes):
        response.url = req.url.decode('utf-8')
        response.url = req.url

    # Add new cookies from the server.
    extract_cookies_to_jar(response.cookies, req, resp)

    # Give the Response some context.
    response.request = req
    response.connection = self

    return response

看到 build_response 函数中调用了 get_encoding_from_headers :

def get_encoding_from_headers(headers):
    """Returns encodings from given HTTP Header Dict.

    :param headers: dictionary to extract encoding from.

    content_type = headers.get('content-type')

    if not content_type:
        return None

    content_type, params = cgi.parse_header(content_type)

    if 'charset' in params:
        return params['charset'].strip("'\"")

    if 'text' in content_type:
        return 'ISO-8859-1'

看完 get_encoding_from_headers 函数会发现它获取网页头类型来区分然后进行编码操作的,但仍推荐使用 content 方法来获取内容,如编码有问题可对具体情况进行转换。

