Improve _HTMLWordTruncator by using more than one unicode block in
_word_regex, making word count function behave properly with CJK,
Cyrillic, and more Latin characters when generating summary.
Combined file and folder watchers under a class and refactored
common watcher related code from __init__.py to the class.
This simplifies the main and autoreload functions in __init__
as well as fix the problem with crashes related to multiprocessing
on systems where default spawn mode is "spawn" instead of "fork".
Adds a use_unicode kwarg to slugify to keep unicode
characters as is (no ASCII-fying) and add tests for
it. Also reworks how slugification logic.
slugify started with the Django method for slugiying:
- Normalize to compatibility decomposed from (NFKD)
- Encode and decode with 'ascii'
This works fine if the decomposed form contains ASCII
characters (i.e. ç can be changed in to c+CEDILLA and
ASCII would keep c only), but fails when decomposition
doesn't result in ASCII characters (i.e. Chinese). To
solve that 'unidecode' was added, which works fine for
both cases. However, old method is now redundant but
was kept. This commit removes the old method and
adjusts logic slightly.
Now slugify will normalize all text with composition
mode (NFKC) to unify format for regex substitutions.
And then if use_unicode is False, uses unidecode to
convert it to ASCII.
Adds a `preserve_case` parameter to the `slugify()` function and uses it
to preserve capital letters in category names when using the Pelican
importer.
This commit removes Six as a dependency for Pelican, replacing the
relevant aliases with the proper Python 3 imports. It also removes
references to Python 2 logic that did not require Six.
Invalid references like those missing semicolons (e.g. `&mdash`) or
those causing overflows (e.g. `�`) are now gracefully
handled and no exception is thrown.
This commit also adds tests and comments where needed.
Also update HTML output by running (after making sure to have the fr_FR.utf8
locale installed):
```sh
LC_ALL=en_US.utf8 pelican -o pelican/tests/output/custom/ -s samples/pelican.conf.py samples/content/
LC_ALL=fr_FR.utf8 pelican -o pelican/tests/output/custom_locale/ -s samples/pelican.conf_FR.py samples/content/
LC_ALL=en_US.utf8 pelican -o pelican/tests/output/basic/ samples/content/
```
as described at
http://docs.getpelican.com/en/3.6.3/contribute.html#running-the-test-suite
* Fix {filename} links on Windows.
Otherwise '{filename}/foo/bar.jpg' doesn't work
* Clean up relative Posix path handling in contents.
* Use Posix paths in readers
* Environment for Popen must be strs, not unicodes.
* Ignore Git CRLF warnings.
* Replace CRLFs with LFs in inputs on Windows.
* Fix importer tests
* Fix test_contents
* Fix one last backslash in paginated output
* Skip the remaining failing locale tests on Windows.
* Document the use of forward slashes on Windows.
* Add some Fabric and ghp-import notes
reverts getpelican/pelican@ddcccfeaa9
If one used a locale that made use of unicode characters (like fr_FR.UTF-8)
the files on disk would be in correct locale while links would be to C.
Uses a SafeDatetime class that works with unicode format strigns
by using custom strftime to prevent ascii decoding errors with Python2.
Also added unicode decoding for the calendar module to fix period
archives.
The locale is a global state, and it was not properly reset to
whatever it was before the unitttest possibly changed it.
This is now fixed.
Not restoring the locale led to weird issues: depending on
the order chosen by "python -m unittest discover" to run
the unit tests, some tests would apparently randomly fail
due to the locale not being what was expected.
For example, test_period_in_timeperiod_archive would
call mock('posts/1970/ 1月/index.html',...) instead of
expected mock('posts/1970/Jan/index.html',...) and fail.
`copy('', 'a/b.ext0', 'c/d.ext1')` is copying `a/b.ext0` into `c/d.ext1/b.ext0`
(creating folder `c/d.ext1` in the process) instead of `c/d.ext1`.
Bug introduced by e03cf3f517.
Add a `Readers` class which contains a dict of file extensions / `Reader`
instances. This dict can be overwritten with a `READERS` settings, for instance
to avoid processing *.html files:
READERS = {'html': None}
Or to add a custom reader for the `foo` extension:
READERS = {'foo': FooReader}
This dict is no storing the Reader classes as it was done before with
`EXTENSIONS`. It stores the instances of the Reader classes to avoid instancing
for each file reading.
The `slugify()` function used by Pelican is in general very good at
coming up with something both readable and URL-safe. However, there are
a few specific cases where it causes conflicts. One that I've run into
is using the strings `C++` and `C` as tags, both of which transform to
the slug `c`. This commit adds an optional `SLUG_SUBSTITUTIONS` setting
which is a list of 2-tuples of substitutions to be carried out
case-insensitively just prior to stripping out non-alphanumeric
characters. This allows cases like `C++` to be transformed to `CPP` or
similar. This can also improve the readability of slugs.
If DELETE_OUTPUT_DIRECTORY is set to True, all files and directories are
deleted from the output directory. There are, however, several reasons
one might want to retain certain files/directories and avoid their
deletion from the output directory. One such use case is version control
system data: a versioned output directory can facilitate deployment via
Heroku and/or allow the user to easily revert to a prior version of the
site without having to rely on regeneration via Pelican.
This change introduces the OUTPUT_RETENTION setting, a tuple of
filenames that will be preserved when the clean_output_dir function in
pelican.utils is run. Setting OUTPUT_RETENTION = (".hg", ".git") would,
for example, prevent the relevant VCS data from being deleted when the
output directory is cleaned.
We'll get better failure messages if we use an assertion method that
understands the comparison we're trying to make. If you make the
comparison by hand and assertTrue(), you don't get much constructive
feedback ;).
Support the forms listed by the W3C [1]. I also removed the
'%Y-%d-%m' form, which can be confused with the '%Y-%m-%d' ISO form.
The new ISO forms can use 'Z' to designate UTC or '[+-]HHMM' to
specify offsets from UTC. Other time zone designators are not
supported.
The '%z' directive has only been supported since Python 3.2 [2], so if
you're running Pelican on Python 2.7, you're stuck with 'Z' for UTC.
Conveniently, we get ValueErrors for both invalid directives and
data/format missmatches, so we don't need special handling for the 2.7
case inside get_date().
[1]: http://www.w3.org/TR/NOTE-datetime
[2]: http://bugs.python.org/issue6641