manbytesgnu_site

Source files for manbytesgnu.org
git clone git://holbrook.no/manbytesgnu_site.git
Log | Files | Refs

commit fc4bee71d694b895f44671c6cabe88690f68d058
parent 750c260f8d01d499439d8b20191b4ded098ee8b8
Author: lash <dev@holbrook.no>
Date:   Sat,  1 Oct 2022 13:16:21 +0000

New post finall

Diffstat:
Acontent/20221001_kitab_libgen.rst | 135+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Acontent/code/portable-book-metadata/batch.sh | 56++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 191 insertions(+), 0 deletions(-)

diff --git a/content/20221001_kitab_libgen.rst b/content/20221001_kitab_libgen.rst @@ -0,0 +1,135 @@ +A portable book metadata exercise +################################# + +:date: 2022-10-01 12:40 +:modified: 2022-10-01 12:40 +:category: Archiving +:author: Louis Holbrook +:tags: hash,kitab,literature,metadata,dublincore,libgen +:slug: portable-book-metadata +:summary: Structured approach to generate portable metadata files for bibliographies and literature files using cryptographic hash mapping. +:lang: en +:status: published + +One of the things I have been working on the last few weeks is a rust application I have dubbed `kitab <https://git.defalsify.net/kitab>`_. + +In short, the application makes it easy to extract literary metadata to a separate file structure. + +The metadata can in turn be applied as *extended attributes* recursively on a directory for files that match. + +The way it's accomplished it simple: The file name of the metadata is the hex representation of the digest of the file. The same digest is used to match files to metadata when applying it back to the file. + +There are two advantages to this: + +1. The digest of the media file need not be affected by the metadata, i.e. by embedding metadata in the file itself. + +2. You do not need to use the file name to keep record of what a file is. + + +Yarr, ye matey-data +=================== + +Let's demonstrate with an example. + +The fabulous `Library Genesis <https://libgen.rs>`_ project has made available an endpoint to retrieve :literal:`bibtex` entries based on the :literal:`md5` hash of the book media file. + +A version of the `Bitcoin White Paper <https://libgen.rs/book/index.php?md5=BCD99F1AB4155F2A2A362E5B7938A852>`_, under the :code:`md5` hash :code:`bcd99f1ab4155f2a2a362e5b7938a852`, can be found there. + +If you download this file using a synchronous download link, the browser will provide you with a filename to go with the download. + +However, if you use the torrent alternative, the filename will be the :literal:`md5` hash itself. If you are torrenting a bunch of those files, it quickly becomes a nuisance to distinguish them. + +And, of course: In either case there is no guarantee the any metadata comes with the file. + + +Inside the book +--------------- + +Kitab (v0.0.2) is able to read metadata from both a bibtex source and xattr entries on a file, as well as its native `rdf-turtle <https://www.w3.org/TR/turtle/>`_ format. + +In kitab's data store, every media file entity in rdf-turtle is keyed with a `URN <https://www.rfc-editor.org/info/rfc8141>`_ specifying a digest for the file. + +To see exactly what that looks like, let's download and import the bibtex metadata for the paper [1]_: + +.. code:: bash + + bibtex_file=`mktemp` + kitab_dir=`mktemp -d` + curl -s -X GET https://libgen.rs/book/bibtex.php?md5=BCD99F1AB4155F2A2A362E5B7938A852 -o $bibtex_file + kitab --store $kitab_dir import --digest md5:BCD99F1AB4155F2A2A362E5B7938A852 $bibtex_file + cat $kitab_dir/* + +The output of the above should be: + +.. code:: turtle + + <URN:md5:bcd99f1ab4155f2a2a362e5b7938a852> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ; + <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ; + <https://purl.org/dc/terms/type> "book" . + + +Now let's say the media file itself has been downloaded to :literal:`~/.local/share/transmission`. We can apply this metadata as extended attributes. + +This time we turn on logging to see what's going on: + +.. code:: console + + $ RUST_LOG=info kitab --store $kitab_dir apply --digest md5 ~/.local/share/transmission + [2022-10-01T11:14:59Z INFO kitab] have index directory "/tmp/tmp.r0jBm6q4hW" + [2022-10-01T11:14:59Z INFO kitab] using digest type md5 + [2022-10-01T11:14:59Z INFO kitab] apply from path "/home/lash/.local/share/transmission/" + [2022-10-01T11:14:59Z INFO kitab] apply DirEntry("/home/lash/.local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852") -> title "Bitcoin: A Peer-to-Peer Electronic Cash System" author "Satoshi Nakamoto" digest md5:bcd99f1ab4155f2a2a362e5b7938a852 + + $ find ~/.local/share/transmission -type f -regextype sed -regex ".*/[a-f0-9]\{32\}$" -exec getfattr -d {} \; + # file: .local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852 + user.dcterms:creator="Satoshi Nakamoto" + user.dcterms:title="Bitcoin: A Peer-to-Peer Electronic Cash System" + user.dcterms:type="book" + + +Let the right one in +-------------------- + +Conversely, the metadata can be re-imported directly from the extended attributes. And this time, let's store it both under the :literal:`md5` and the :literal:`sha512` hash: + +.. code:: bash + + $ kitab_dir_new=`mktemp -d` + $ kitab --store $kitab_dir_new import --digest md5 --digest sha512 .local/share/transmission/bcd99f1ab4155f2a2a362e5b7938a852 + $ find $kitab_dir_new -type f -exec cat {} \; + /tmp/tmp.B6j41YMmEM/493f2a720d63156d77187bcd5f0715e4e765a38d616ef47f24e0df817ee6b4f601d47a06ffae10ef1f6ba60bb5d2e99a26318f035f9cd56e30bfe7bcdf64a792 + <URN:sha512:493f2a720d63156d77187bcd5f0715e4e765a38d616ef47f24e0df817ee6b4f601d47a06ffae10ef1f6ba60bb5d2e99a26318f035f9cd56e30bfe7bcdf64a792> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ; + <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ; + <https://purl.org/dc/terms/type> "book" ; + <https://purl.org/dc/terms/MediaType> "application/epub+zip" . + /tmp/tmp.B6j41YMmEM/bcd99f1ab4155f2a2a362e5b7938a852 + <URN:md5:bcd99f1ab4155f2a2a362e5b7938a852> <https://purl.org/dc/terms/title> "Bitcoin: A Peer-to-Peer Electronic Cash System" ; + <https://purl.org/dc/terms/creator> "Satoshi Nakamoto" ; + <https://purl.org/dc/terms/type> "book" ; + <https://purl.org/dc/terms/MediaType> "application/epub+zip" . + + +Level up +======== + +Finally, a bash script [2]_ example that lets you retrieve and apply metadata for a batch of files found in the directory given as the *first positional arg*. + +This script even renames the files according to the metadata applied. + +.. include:: code/portable-book-metadata/batch.sh + :code: bash + :number-lines: 0 + +This last example will result in: + +- A media file named :literal:`$outdir/Bitcoin: A Peer-to-Peer Electronic Cash System.epub` +- ... with metadata applied as extended attributes +- An rdf-turtle metadata entry in :literal:`~/.local/share/kitab/idx/bcd99f1ab4155f2a2a362e5b7938a852` + +.. + + .. [1] The :code:`kitab` command in the script assumes you have built the *kitab binary* and made it available in your path. + +.. + + .. [2] the script uses :code:`xmllint` which on archlinux is provided by the :literal:`libxml2` package. diff --git a/content/code/portable-book-metadata/batch.sh b/content/code/portable-book-metadata/batch.sh @@ -0,0 +1,56 @@ +# NOTE! this will only work if your fs supports xattr. +# That's why we cannot use tmpfs (mktemp) here; tmpfs does not support xattr. + +# directory to copy media files to +outdir=./$(uuidgen) +mkdir -vp $outdir + +# Input dir is the first positional arg. +indir=$1 + +IFS=$'\n' + +# Retrieve metadata for each file and import it into the kitab store. +# Also copy the media file to the separate output directory. +for f in $(find $indir -type f); do + sum=$(md5sum $f | awk '{print $1;}') + echo "downloading metadata for $indir/$f" + srct=$(mktemp) + curl -s -X GET https://libgen.rs/book/bibtex.php?md5=$sum -o $srct + dstt=$(mktemp) + xmllint --html --xpath 'string(/html/body/textarea[@id="bibtext"])' $srct > $dstt + kitab import --digest md5:$sum $dstt + cp $f $outdir/ +done + +# Apply metadata imported from bibtex as xattr for the media files. +RUST_LOG=info kitab apply --digest md5 $outdir/ + +# Rename the files according to the metadata title and media type. +for f in $(ls $outdir); do + title=$(getfattr --only-values -n user.dcterms:title $outdir/$f) + + f_typ=$(file -b --mime-type $outdir/$f) + f_ext="" + case "$f_typ" in + "application/pdf") + f_ext=".pdf" + ;; + "application/epub+zip") + f_ext=".epub" + ;; + "application/x-mobipocket-ebook") + f_ext=".mobi" + ;; + "text/plain") + f_ext=".txt" + ;; + "text/html") + f_ext=".html" + ;; + *) + >&2 echo unhandled mime type $f_typ + exit 1 + esac + mv -v $outdir/$f $outdir/${title}${f_ext} +done