Some content parsers escape link paths in their html output
(i.e. docutils uses HTML escaping and markdown uses URL encoding.
Intrasite link discovery is refactored to also attempt HTML or URL
unescaped versions of the path in order to match more permissively.
This commit removes Six as a dependency for Pelican, replacing the
relevant aliases with the proper Python 3 imports. It also removes
references to Python 2 logic that did not require Six.
The refresh_metadata_intersite_links() is called again later, so this
bit was only causing everything to be processed twice. Besides
that, it makes the processing dependent on file order -- in particular,
when metadata references a file that was not parsed yet, it reported an
"Unable to find" warning. But everything is found in the second pass, so
this only causes a superflous false warning and no change to the output.
The related test now needs to call the
refresh_metadata_intersite_links() explicitly. That function is called
from Pelican.run() and all generators but the test processes just one
page so it has no chance of being called implicitly.
Related discussion: https://github.com/getpelican/pelican/pull/2288/files#r204337359
Fix intrasite links for non-'summary' metadata
Metadata like `MyArticleBanner: ` would be properly parsed (as defined in `FORMATTED_FIELDS`), but the intrasite links would not be processed.
Only the summary gets its intrasite links processed, has its value is either generated from the content (calling self._update_content at some point) or self._update_content is explicitly called if summary is passed as metadata.
This PR expands the paths as soon as possible in (`Content.__init__`) for metadata defined in `FORMATTED_FIELDS`.
Previously, with RELATIVE_URLS disabled, when both SITEURL and
STATIC_URL were absolute, the final generate data URLs looked wrong like
this (two absolute URLs joined by `/`):
http://your.site/http://static.your.site/image.png
With this patch, the data URLs are correctly:
http://static.your.site/image.png
This also applies to all *_URL configuration options (for example,
ability to have pages and articles on different domains) and behaves
like one expects even with URLs starting with just `//`, thanks to
making use of urllib.parse.urljoin().
However, when RELATIVE_URLS are enabled, urllib.parse.urljoin() doesn't
handle the relative base correctly. In that case, simple os.path.join()
is used. That, however, breaks the above case, but as RELATIVE_URLS are
meant for local development (thus no data scattered across multiple
domains), I don't see any problem.
Just to clarify, this is a fully backwards-compatible change, it only
enables new use cases that were impossible before.
* Consolidate validation of content
Previously we validated content outside of the content class via
calls to `is_valid_content` and some additional checks in page /
article generators (valid status).
This commit moves those checks all into content.valid() resulting
in a cleaner code structure.
This allows us to restructure how generators interact with content,
removing several old bugs in pelican (#1748, #1356, #2098).
- move verification function into content class
- move generator verifying content to contents class
- remove unused quote class
- remove draft class (no more rereading drafts)
- move auto draft status setter into Article.__init__
- add now parsing draft to basic test output
- remove problematic DEFAULT_STATUS setting
- add setter/getter for content.status
removes need for lower() calls when verifying status
* expand c4b184fa32
Mostly implement feedback by @iKevinY.
* rename content.valid to content.is_valid
* rename valid_* functions to has_valid_*
* update tests and function calls in code accordingly
a html tag always starts with <[a-z], < [a-z] is invalid
a space can be found after the = in href='bleh'
This function is taking 10% of the compilation time, with caching enabled,
maybe it's worth optimising the regexp a bit more, I don't know.
There was no test case checking the behaviour of what happens when
trying to {filename} an unknown file.
Also changed the braces to `'` chars that we use elseware.
Fixes two bugs that was introduced by #1743:
- First was the unnecessary exception name output when skipping
content files if a required metadata wasn't available.
It was
`ERROR: Skipping ./demo.rst: could not find information about 'NameError: date'`
and now should be
`ERROR: Skipping ./demo.rst: could not find information about 'date'`
- Second was a more serious issue. Improper string formatting in the logger
resulted in implicit decoding and would break for non-ascii error messages.
When memoizing summary a dummy get_summary function was introduced to
properly work. This has multiple problems:
* The siteurl argument is actually not used in the function, it is only
used so it properly memoizes.
* It differs from how content is accessed and memoized.
This commit brings summary inline with how content is accessed and
processed. It also removes the old get_summary function with the unused
siteurl argument.
Additionally _get_summary is marked as deprecated and calls .summary
instead while also issueing a warning.
If an unknown replacement indicator {bla} was used, it was ignored
silently. This commit adds a warning when an unmatched indicator occurs
to help identify the issue.
closes#1794
* speed up via reduced slugify calls (only call when needed)
* fix __repr__ to not contain str, should call repr on name
* add test_urlwrappers and move URLWrappers tests there
* add new equality test
* cleanup header
additionally:
* Content is now decorated with python_2_unicode_compatible
instead of treating __str__ differently
* better formatting for test_article_metadata_key_lowercase
to actually output the conflict instead of a non descriptive
error
* Fix {filename} links on Windows.
Otherwise '{filename}/foo/bar.jpg' doesn't work
* Clean up relative Posix path handling in contents.
* Use Posix paths in readers
* Environment for Popen must be strs, not unicodes.
* Ignore Git CRLF warnings.
* Replace CRLFs with LFs in inputs on Windows.
* Fix importer tests
* Fix test_contents
* Fix one last backslash in paginated output
* Skip the remaining failing locale tests on Windows.
* Document the use of forward slashes on Windows.
* Add some Fabric and ghp-import notes
Until now, making static files end up in the same output directory as an
article that links to them has been difficult, especially when the article's
output path is generated based on metadata. This changeset introduces the
{attach} link syntax, which works like the {filename} syntax, but also
overrides the static file's output path with the directory of the
linking document.
It also clarifies and expands the documentation on linking to internal content.