20221001_kitab_libgen.rst (6971B)
1 A portable book metadata exercise 2 ################################# 3 4 :date: 2022-10-01 12:40 5 :modified: 2022-10-01 12:40 6 :category: Archiving 7 :author: Louis Holbrook 8 :tags: hash,kitab,literature,metadata,dublincore,libgen 9 :slug: portable-book-metadata 10 :summary: Structured approach to generate portable metadata files for bibliographies and literature files using cryptographic hash mapping. 11 :lang: en 12 :status: published 13 14 One of the things I have been working on the last few weeks is a rust application I have dubbed `kitab <https://git.defalsify.net/kitab>`_ [1]_. 15 16 In short, the application makes it easy to extract literary metadata to a separate file structure. 17 18 The metadata can in turn be applied as *extended attributes* recursively on a directory for files that match. 19 20 The way it's accomplished it simple: The file name of the metadata is the hex representation of the digest of the file. The same digest is used to match files to metadata when applying it back to the file. 21 22 There are two advantages to this: 23 24 1. The digest of the media file need not be affected by the metadata, i.e. by embedding metadata in the file itself. 25 26 2. You do not need to use the file name to keep record of what a file is. 27 28 29 Yarr, ye matey-data 30 =================== 31 32 Let's demonstrate with an example. 33 34 The fabulous `Library Genesis <https://libgen.rs>`_ project has made available an endpoint to retrieve :literal:`bibtex` entries based on the :literal:`md5` hash of the book media file. 35 36 A version of the `Bitcoin White Paper <https://libgen.rs/book/index.php?md5=BCD99F1AB4155F2A2A362E5B7938A852>`_, under the :code:`md5` hash :code:`bcd99f1ab4155f2a2a362e5b7938a852`, can be found there. 37 38 If you download this file using a synchronous download link, the browser will provide you with a filename to go with the download. 39 40 However, if you use the torrent alternative, the filename will be the :literal:`md5` hash itself. If you are torrenting a bunch of those files, it quickly becomes a nuisance to distinguish them. 41 42 And, of course: In either case there is no guarantee the any metadata comes with the file. 43 44 45 Inside the book 46 --------------- 47 48 Kitab (v0.0.2) is able to read metadata from both a bibtex source and xattr entries on a file, as well as its native `rdf-turtle <https://www.w3.org/TR/turtle/>`_ format. 49 50 In kitab's data store, every media file entity in rdf-turtle is keyed with a `URN <https://www.rfc-editor.org/info/rfc8141>`_ specifying a digest for the file. 51 52 To see exactly what that looks like, let's download and import the bibtex metadata for the paper [2]_: 53 54 .. code:: bash 55 56 bibtex_file=`mktemp` 57 kitab_dir=`mktemp -d` 58 curl -s -X GET https://libgen.rs/book/bibtex.php?md5=BCD99F1AB4155F2A2A362E5B7938A852 -o $bibtex_file 59 kitab --store $kitab_dir import --digest md5:BCD99F1AB4155F2A2A362E5B7938A852 $bibtex_file 60 cat $kitab_dir/* 61 62 The output of the above should be: 63 64 .. code:: turtle 65 66 <URN:md5:bcd99f1ab4155f2a2a362e5b7938a852> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ; 67 <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ; 68 <https://purl.org/dc/terms/type> "book" . 69 70 71 Now let's say the media file itself has been downloaded to :literal:`~/.local/share/transmission`. We can apply this metadata as extended attributes. 72 73 This time we turn on logging to see what's going on: 74 75 .. code:: console 76 77 $ RUST_LOG=info kitab --store $kitab_dir apply --digest md5 ~/.local/share/transmission 78 [2022-10-01T11:14:59Z INFO kitab] have index directory "/tmp/tmp.r0jBm6q4hW" 79 [2022-10-01T11:14:59Z INFO kitab] using digest type md5 80 [2022-10-01T11:14:59Z INFO kitab] apply from path "/home/lash/.local/share/transmission/" 81 [2022-10-01T11:14:59Z INFO kitab] apply DirEntry("/home/lash/.local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852") -> title "Bitcoin: A Peer-to-Peer Electronic Cash System" author "Satoshi Nakamoto" digest md5:bcd99f1ab4155f2a2a362e5b7938a852 82 83 $ find ~/.local/share/transmission -type f -regextype sed -regex ".*/[a-f0-9]\{32\}$" -exec getfattr -d {} \; 84 # file: .local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852 85 user.dcterms:creator="Satoshi Nakamoto" 86 user.dcterms:title="Bitcoin: A Peer-to-Peer Electronic Cash System" 87 user.dcterms:type="book" 88 89 90 Let the right one in 91 -------------------- 92 93 Conversely, the metadata can be re-imported directly from the extended attributes. And this time, let's store it both under the :literal:`md5` and the :literal:`sha512` hash: 94 95 .. code:: bash 96 97 $ kitab_dir_new=`mktemp -d` 98 $ kitab --store $kitab_dir_new import --digest md5 --digest sha512 .local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852 99 $ find $kitab_dir_new -type f -exec cat {} \; 100 /tmp/tmp.B6j41YMmEM/493f2a720d63156d77187bcd5f0715e4e765a38d616ef47f24e0df817ee6b4f601d47a06ffae10ef1f6ba60bb5d2e99a26318f035f9cd56e30bfe7bcdf64a792 101 <URN:sha512:493f2a720d63156d77187bcd5f0715e4e765a38d616ef47f24e0df817ee6b4f601d47a06ffae10ef1f6ba60bb5d2e99a26318f035f9cd56e30bfe7bcdf64a792> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ; 102 <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ; 103 <https://purl.org/dc/terms/type> "book" ; 104 <https://purl.org/dc/terms/MediaType> "application/epub+zip" . 105 /tmp/tmp.B6j41YMmEM/bcd99f1ab4155f2a2a362e5b7938a852 106 <URN:md5:bcd99f1ab4155f2a2a362e5b7938a852> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ; 107 <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ; 108 <https://purl.org/dc/terms/type> "book" ; 109 <https://purl.org/dc/terms/MediaType> "application/epub+zip" . 110 111 112 Level up 113 ======== 114 115 Finally, a bash script [3]_ example that lets you retrieve and apply metadata for a batch of files found in the directory given as the *first positional arg*. 116 117 This script even renames the files according to the metadata applied. 118 119 .. include:: code/portable-book-metadata/batch.sh 120 :code: bash 121 :number-lines: 0 122 123 This last example will result in: 124 125 - A media file named :literal:`$outdir/Bitcoin: A Peer-to-Peer Electronic Cash System.epub` 126 - ... with metadata applied as extended attributes 127 - An rdf-turtle metadata entry in :literal:`~/.local/share/kitab/idx/bcd99f1ab4155f2a2a362e5b7938a852` 128 129 .. 130 131 .. [1] The relevant documentation for :literal:`kitab` at the time of writing is `here <https://defalsify.org/doc/crates/kitab/0.0.2/kitab/>`_. To build kitab, simply *clone* the repository and build with :code:`cargo build --all-features`. 132 133 .. 134 135 .. [2] The :code:`kitab` command in the script assumes you have built the *kitab binary* and made it available in your path. 136 137 .. 138 139 .. [3] the script uses :code:`xmllint` which on archlinux is provided by the :literal:`libxml2` package.