manbytesgnu_site

Source files for manbytesgnu.org
git clone git://holbrook.no/manbytesgnu_site.git
Info | Log | Files | Refs

20221001_kitab_libgen.rst (6971B)


      1 A portable book metadata exercise
      2 #################################
      3 
      4 :date: 2022-10-01 12:40
      5 :modified: 2022-10-01 12:40
      6 :category: Archiving
      7 :author: Louis Holbrook
      8 :tags: hash,kitab,literature,metadata,dublincore,libgen
      9 :slug: portable-book-metadata
     10 :summary: Structured approach to generate portable metadata files for bibliographies and literature files using cryptographic hash mapping.
     11 :lang: en
     12 :status: published
     13 
     14 One of the things I have been working on the last few weeks is a rust application I have dubbed `kitab <https://git.defalsify.net/kitab>`_ [1]_.
     15 
     16 In short, the application makes it easy to extract literary metadata to a separate file structure.
     17 
     18 The metadata can in turn be applied as *extended attributes* recursively on a directory for files that match.
     19 
     20 The way it's accomplished it simple: The file name of the metadata is the hex representation of the digest of the file. The same digest is used to match files to metadata when applying it back to the file.
     21 
     22 There are two advantages to this:
     23 
     24 1. The digest of the media file need not be affected by the metadata, i.e. by embedding metadata in the file itself.
     25 
     26 2. You do not need to use the file name to keep record of what a file is.
     27 
     28 
     29 Yarr, ye matey-data
     30 ===================
     31 
     32 Let's demonstrate with an example.
     33 
     34 The fabulous `Library Genesis <https://libgen.rs>`_ project has made available an endpoint to retrieve :literal:`bibtex` entries based on the :literal:`md5` hash of the book media file.
     35 
     36 A version of the `Bitcoin White Paper <https://libgen.rs/book/index.php?md5=BCD99F1AB4155F2A2A362E5B7938A852>`_, under the :code:`md5` hash :code:`bcd99f1ab4155f2a2a362e5b7938a852`, can be found there.
     37 
     38 If you download this file using a synchronous download link, the browser will provide you with a filename to go with the download.
     39 
     40 However, if you use the torrent alternative, the filename will be the :literal:`md5` hash itself. If you are torrenting a bunch of those files, it quickly becomes a nuisance to distinguish them.
     41 
     42 And, of course: In either case there is no guarantee the any metadata comes with the file.
     43 
     44 
     45 Inside the book
     46 ---------------
     47 
     48 Kitab (v0.0.2) is able to read metadata from both a bibtex source and xattr entries on a file, as well as its native `rdf-turtle <https://www.w3.org/TR/turtle/>`_ format.
     49 
     50 In kitab's data store, every media file entity in rdf-turtle is keyed with a `URN <https://www.rfc-editor.org/info/rfc8141>`_ specifying a digest for the file. 
     51 
     52 To see exactly what that looks like, let's download and import the bibtex metadata for the paper [2]_:
     53 
     54 .. code:: bash
     55 
     56         bibtex_file=`mktemp`
     57         kitab_dir=`mktemp -d`
     58         curl -s -X GET https://libgen.rs/book/bibtex.php?md5=BCD99F1AB4155F2A2A362E5B7938A852 -o $bibtex_file
     59         kitab --store $kitab_dir import --digest md5:BCD99F1AB4155F2A2A362E5B7938A852 $bibtex_file
     60         cat $kitab_dir/*
     61 
     62 The output of the above should be:
     63 
     64 .. code:: turtle
     65 
     66         <URN:md5:bcd99f1ab4155f2a2a362e5b7938a852> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ;
     67         <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ;
     68         <https://purl.org/dc/terms/type> "book" .
     69 
     70 
     71 Now let's say the media file itself has been downloaded to :literal:`~/.local/share/transmission`. We can apply this metadata as extended attributes.
     72 
     73 This time we turn on logging to see what's going on:
     74 
     75 .. code:: console
     76 
     77         $ RUST_LOG=info kitab --store $kitab_dir apply --digest md5 ~/.local/share/transmission
     78         [2022-10-01T11:14:59Z INFO  kitab] have index directory "/tmp/tmp.r0jBm6q4hW"
     79         [2022-10-01T11:14:59Z INFO  kitab] using digest type md5
     80         [2022-10-01T11:14:59Z INFO  kitab] apply from path "/home/lash/.local/share/transmission/"
     81         [2022-10-01T11:14:59Z INFO  kitab] apply DirEntry("/home/lash/.local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852") -> title "Bitcoin: A Peer-to-Peer Electronic Cash System" author "Satoshi Nakamoto" digest md5:bcd99f1ab4155f2a2a362e5b7938a852
     82 
     83         $ find ~/.local/share/transmission -type f -regextype sed -regex ".*/[a-f0-9]\{32\}$" -exec getfattr -d {} \;
     84         # file: .local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852
     85         user.dcterms:creator="Satoshi Nakamoto"
     86         user.dcterms:title="Bitcoin: A Peer-to-Peer Electronic Cash System"
     87         user.dcterms:type="book"
     88 
     89 
     90 Let the right one in
     91 --------------------
     92 
     93 Conversely, the metadata can be re-imported directly from the extended attributes. And this time, let's store it both under the :literal:`md5` and the :literal:`sha512` hash:
     94 
     95 .. code:: bash
     96 
     97         $ kitab_dir_new=`mktemp -d` 
     98         $ kitab --store $kitab_dir_new import --digest md5 --digest sha512 .local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852
     99         $ find $kitab_dir_new -type f -exec cat {} \;
    100         /tmp/tmp.B6j41YMmEM/493f2a720d63156d77187bcd5f0715e4e765a38d616ef47f24e0df817ee6b4f601d47a06ffae10ef1f6ba60bb5d2e99a26318f035f9cd56e30bfe7bcdf64a792
    101         <URN:sha512:493f2a720d63156d77187bcd5f0715e4e765a38d616ef47f24e0df817ee6b4f601d47a06ffae10ef1f6ba60bb5d2e99a26318f035f9cd56e30bfe7bcdf64a792> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ;
    102                 <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ;
    103                 <https://purl.org/dc/terms/type> "book" ;
    104                 <https://purl.org/dc/terms/MediaType> "application/epub+zip" .
    105         /tmp/tmp.B6j41YMmEM/bcd99f1ab4155f2a2a362e5b7938a852
    106         <URN:md5:bcd99f1ab4155f2a2a362e5b7938a852> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ;
    107                 <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ;
    108                 <https://purl.org/dc/terms/type> "book" ;
    109                 <https://purl.org/dc/terms/MediaType> "application/epub+zip" .
    110 
    111 
    112 Level up
    113 ========
    114 
    115 Finally, a bash script [3]_ example that lets you retrieve and apply metadata for a batch of files found in the directory given as the *first positional arg*.
    116 
    117 This script even renames the files according to the metadata applied.
    118 
    119 .. include:: code/portable-book-metadata/batch.sh
    120    :code: bash
    121    :number-lines: 0
    122 
    123 This last example will result in:
    124 
    125 - A media file named :literal:`$outdir/Bitcoin: A Peer-to-Peer Electronic Cash System.epub`
    126 - ... with metadata applied as extended attributes
    127 - An rdf-turtle metadata entry in :literal:`~/.local/share/kitab/idx/bcd99f1ab4155f2a2a362e5b7938a852`
    128 
    129 ..
    130 
    131         .. [1] The relevant documentation for :literal:`kitab` at the time of writing is `here <https://defalsify.org/doc/crates/kitab/0.0.2/kitab/>`_. To build kitab, simply *clone* the repository and build with :code:`cargo build --all-features`.
    132 
    133 ..
    134 
    135         .. [2] The :code:`kitab` command in the script assumes you have built the *kitab binary* and made it available in your path.
    136 
    137 ..
    138 
    139         .. [3] the script uses :code:`xmllint` which on archlinux is provided by the :literal:`libxml2` package.