Document encoding behaviour.

author Jelmer Vernooĳ <jelmer@jelmer.uk>

Mon, 24 Oct 2016 23:10:42 +0000 (23:10 +0000)

committer Jelmer Vernooĳ <jelmer@jelmer.uk>

Mon, 24 Oct 2016 23:10:42 +0000 (23:10 +0000)
author Jelmer Vernooĳ <jelmer@jelmer.uk>
Mon, 24 Oct 2016 23:10:42 +0000 (23:10 +0000)
committer Jelmer Vernooĳ <jelmer@jelmer.uk>
Mon, 24 Oct 2016 23:10:42 +0000 (23:10 +0000)
diff --git a/docs/tutorial/encoding.txt b/docs/tutorial/encoding.txt

new file mode 100644 (file)

index 0000000..0dd0d7e
--- /dev/null
+++ b/docs/tutorial/encoding.txt
@@ -0,0 +1,26 @@
+Encoding
+========
+
+You will notice that all lower-level functions in Dulwich take byte strings
+rather than unicode strings. This is intentional.
+
+Although `C git`_ recommends the use of UTF-8 for encoding, this is not
+strictly enforced and C git treats filenames as sequences of non-NUL bytes.
+There are repositories in the wild that use non-UTF-8 encoding for filenames
+and commit messages.
+
+.. _C git: https://github.com/git/git/blob/master/Documentation/i18n.txt
+
+The library should be able to read *all* existing git repositories,
+irregardless of what encoding they use. This is the main reason why Dulwich
+does not convert paths to unicode strings.
+
+A further consideration is that converting back and forth to unicode
+is an extra performance penalty. E.g. if you are just iterating over file
+contents, there is no need to consider encoded strings. Users of the library
+may have specific assumptions they can make about the encoding - e.g. they
+could just decide that all their data is latin-1, or the default Python
+encoding.
+
+Higher level functions, such as the porcelain in dulwich.porcelain, will
+automatically convert unicode strings to UTF-8 bytestrings.
diff --git a/docs/tutorial/index.txt b/docs/tutorial/index.txt

index 7d085a1f27c3776cea1ca27ee009618120443c2f..5a249de334361776f1a0ad172613e29efc986c25 100644 (file)
--- a/docs/tutorial/index.txt
+++ b/docs/tutorial/index.txt
@@ -8,6 +8,7 @@ Tutorial
     :maxdepth: 2
  
     introduction
+   encoding 
     file-format
     repo
     object-store
author	Jelmer Vernooĳ <jelmer@jelmer.uk>
	Mon, 24 Oct 2016 23:10:42 +0000 (23:10 +0000)
committer	Jelmer Vernooĳ <jelmer@jelmer.uk>
	Mon, 24 Oct 2016 23:10:42 +0000 (23:10 +0000)
docs/tutorial/encoding.txt	[new file with mode: 0644]	patch \| blob
docs/tutorial/index.txt		patch \| blob \| history