Improve the regexp used in _update_content

a html tag always starts with <[a-z], < [a-z] is invalid
a space can be found after the = in href='bleh'

This function is taking 10% of the compilation time, with caching enabled,
maybe it's worth optimising the regexp a bit more, I don't know.
This commit is contained in:
jvoisin 2017-03-08 14:05:57 +01:00
commit ad38d602c7
2 changed files with 23 additions and 2 deletions

View file

@ -219,8 +219,8 @@ class Content(object):
instrasite_link_regex = self.settings['INTRASITE_LINK_REGEX']
regex = r"""
(?P<markup><\s*[^\>]* # match tag with all url-value attributes
(?:href|src|poster|data|cite|formaction|action)\s*=)
(?P<markup><[^\>]+ # match tag with all url-value attributes
(?:href|src|poster|data|cite|formaction|action)\s*=\s*)
(?P<quote>["\']) # require value to be quoted
(?P<path>{0}(?P<value>.*?)) # the url value