jwm.robotstxt.googlebot.RobotsMatcher

class jwm.robotstxt.googlebot.RobotsMatcher

Bases: RobotsParseHandler

RobotsMatcher - matches robots.txt against URLs.

The Matcher uses a default match strategy for Allow/Disallow patterns which is the official way of Google crawler to match robots.txt. It is also possible to provide a custom match strategy.

The entry point for the user is to call one of the *AllowedByRobots() methods that return directly if a URL is being allowed according to the robots.txt and the crawl agent. The RobotsMatcher can be re-used for URLs/robots.txt but is not thread-safe.

__init__(self: jwm.robotstxt.googlebot.RobotsMatcher) None

Create a RobotsMatcher with the default matching strategy. The default matching strategy is longest-match as opposed to the former internet draft that provisioned first-match strategy. Analysis shows that longest-match, while more restrictive for crawlers, is what webmasters assume when writing directives. For example, in case of conflicting matches (both Allow and Disallow), the longest match is the one the user wants. For example, in case of a robots.txt file that has the following rules

Allow: / Disallow: /cgi-bin

it’s pretty obvious what the webmaster wants: they want to allow crawl of every URI except /cgi-bin. However, according to the expired internet standard, crawlers should be allowed to crawl everything with such a rule.

AllowedByRobots(self: jwm.robotstxt.googlebot.RobotsMatcher, robots_body: str, user_agents: list[str], url: str) bool

Returns true iff ‘url’ is allowed to be fetched by any member of the “user_agents” vector. ‘url’ must be %-encoded according to RFC3986.

static ExtractUserAgent(user_agent: str) str

Extract the matchable part of a user agent string, essentially stopping at the first invalid character. Example: ‘Googlebot/2.1’ becomes ‘Googlebot’

HandleAllow(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, value: str) None
HandleDisallow(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, value: str) None
HandleRobotsEnd(self: jwm.robotstxt.googlebot.RobotsParseHandler) None
HandleRobotsStart(self: jwm.robotstxt.googlebot.RobotsParseHandler) None
HandleSitemap(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, value: str) None
HandleUnknownAction(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, action: str, value: str) None

Any other unrecognized name/value pairs.

HandleUserAgent(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, value: str) None
InitUserAgentsAndPath(self: jwm.robotstxt.googlebot.RobotsMatcher, user_agents: list[str], path: str) None

Initialize next path and user-agents to check. Path must contain only the path, params, and query (if any) of the url and must start with a ‘/’.

static IsValidUserAgentToObey(user_agent: str) bool

Verifies that the given user agent is valid to be matched against robots.txt. Valid user agent strings only contain the characters [a-zA-Z_-].

OneAgentAllowedByRobots(self: jwm.robotstxt.googlebot.RobotsMatcher, robots_txt: str, user_agent: str, url: str) bool

Do robots check for ‘url’ when there is only one user agent. ‘url’ must be %-encoded according to RFC3986.

ReportLineMetadata(self: jwm.robotstxt.googlebot.RobotsParseHandler, line_num: int, metadata: jwm.robotstxt.googlebot.RobotsParseHandler.LineMetaData) None
disallow(self: jwm.robotstxt.googlebot.RobotsMatcher) bool

Returns true if we are disallowed from crawling a matching URI.

disallow_ignore_global(self: jwm.robotstxt.googlebot.RobotsMatcher) bool

Returns true if we are disallowed from crawling a matching URI. Ignores any rules specified for the default user agent, and bases its results only on the specified user agents.

ever_seen_specific_agent(self: jwm.robotstxt.googlebot.RobotsMatcher) bool

Returns true iff, when AllowedByRobots() was called, the robots file referred explicitly to one of the specified user agents.

matching_line(self: jwm.robotstxt.googlebot.RobotsMatcher) int

Returns the line that matched or 0 if none matched.

seen_any_agent(self: jwm.robotstxt.googlebot.RobotsMatcher) bool

Returns true if any user-agent was seen.