jwm.robotstxt.googlebot.RobotsMatcher¶

class jwm.robotstxt.googlebot.RobotsMatcher¶

Bases: RobotsParseHandler

RobotsMatcher - matches robots.txt against URLs.

The Matcher uses a default match strategy for Allow/Disallow patterns which is the official way of Google crawler to match robots.txt. It is also possible to provide a custom match strategy.

The entry point for the user is to call one of the *AllowedByRobots() methods that return directly if a URL is being allowed according to the robots.txt and the crawl agent. The RobotsMatcher can be re-used for URLs/robots.txt but is not thread-safe.

__init__(self: jwm.robotstxt.googlebot.RobotsMatcher) → None¶

Create a RobotsMatcher with the default matching strategy. The default matching strategy is longest-match as opposed to the former internet draft that provisioned first-match strategy. Analysis shows that longest-match, while more restrictive for crawlers, is what webmasters assume when writing directives. For example, in case of conflicting matches (both Allow and Disallow), the longest match is the one the user wants. For example, in case of a robots.txt file that has the following rules

Allow: / Disallow: /cgi-bin

it’s pretty obvious what the webmaster wants: they want to allow crawl of every URI except /cgi-bin. However, according to the expired internet standard, crawlers should be allowed to crawl everything with such a rule.

AllowedByRobots(self: jwm.robotstxt.googlebot.RobotsMatcher, robots_body: str, user_agents: list[str], url: str) → bool¶: Returns true iff ‘url’ is allowed to be fetched by any member of the “user_agents” vector. ‘url’ must be %-encoded according to RFC3986.

static ExtractUserAgent(user_agent: str) → str¶: Extract the matchable part of a user agent string, essentially stopping at the first invalid character. Example: ‘Googlebot/2.1’ becomes ‘Googlebot’

HandleAllow(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, value: str) → None¶

HandleDisallow(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, value: str) → None¶

HandleRobotsEnd(self: jwm.robotstxt.googlebot.RobotsParseHandler) → None¶

HandleRobotsStart(self: jwm.robotstxt.googlebot.RobotsParseHandler) → None¶

HandleSitemap(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, value: str) → None¶

HandleUnknownAction(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, action: str, value: str) → None¶: Any other unrecognized name/value pairs.

HandleUserAgent(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, value: str) → None¶

InitUserAgentsAndPath(self: jwm.robotstxt.googlebot.RobotsMatcher, user_agents: list[str], path: str) → None¶: Initialize next path and user-agents to check. Path must contain only the path, params, and query (if any) of the url and must start with a ‘/’.

static IsValidUserAgentToObey(user_agent: str) → bool¶: Verifies that the given user agent is valid to be matched against robots.txt. Valid user agent strings only contain the characters [a-zA-Z_-].

OneAgentAllowedByRobots(self: jwm.robotstxt.googlebot.RobotsMatcher, robots_txt: str, user_agent: str, url: str) → bool¶: Do robots check for ‘url’ when there is only one user agent. ‘url’ must be %-encoded according to RFC3986.

ReportLineMetadata(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, metadata: jwm.robotstxt.googlebot.RobotsParseHandler.LineMetaData) → None¶

disallow(self: jwm.robotstxt.googlebot.RobotsMatcher) → bool¶: Returns true if we are disallowed from crawling a matching URI.

disallow_ignore_global(self: jwm.robotstxt.googlebot.RobotsMatcher) → bool¶: Returns true if we are disallowed from crawling a matching URI. Ignores any rules specified for the default user agent, and bases its results only on the specified user agents.

ever_seen_specific_agent(self: jwm.robotstxt.googlebot.RobotsMatcher) → bool¶: Returns true iff, when AllowedByRobots() was called, the robots file referred explicitly to one of the specified user agents.

matching_line(self: jwm.robotstxt.googlebot.RobotsMatcher) → int¶: Returns the line that matched or 0 if none matched.

seen_any_agent(self: jwm.robotstxt.googlebot.RobotsMatcher) → bool¶: Returns true if any user-agent was seen.