Published webshot article - manbytesgnu_site - Source files for manbytesgnu.org

commit 8e0d64c4eac862a3955906915ee07f1e2784fe80
parent 1598776431407220c6524bdac50ad499d6e30c3a
Author: nolash <dev@holbrook.no>
Date:   Mon,  3 May 2021 21:33:46 +0200

Published webshot article

Diffstat:
M content/20210419_docker_python.rst  | 4 ++++
M content/20210420_docker_offline.rst  | 4 ++++
M content/20210421_web_shapshot.rst  | 167 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
M content/code/web-snapshot/webshot.sh  | 24 ++++++------------------
M lash/static/css/style.css  | 47 +++++++++++++++++++++++++++++++++++++++++++++++
M pelicanconf.py  | 5 ++++-

6 files changed, 228 insertions(+), 23 deletions(-)
diff --git a/content/20210419_docker_python.rst b/content/20210419_docker_python.rst
@@ -14,6 +14,10 @@ Local python repository
 :status: published
 
 
+.. CAUTION::
+
+   This is a purely technical article. You should probably be `this geeky <https://g33k.holbrook.no/c18beedd>`_ to continue reading.
+
 In the previous part of this series we were able to connect a Docker network to a virtual interface on our host, neither of which have access to the internet. That means we are ready to host content for the container locally. And we will start out with creating a local Python repository.
 
 Packaging the packages
diff --git a/content/20210420_docker_offline.rst b/content/20210420_docker_offline.rst
@@ -14,6 +14,10 @@ The routing to freedom
 :status: published
 
 
+.. CAUTION::
+
+   This is a purely technical article. You should probably be `this geeky <https://g33k.holbrook.no/8319a926>`_ to continue reading.
+
 Five years ago I decided that I wanted to be able to work from anywhere, anytime. Four years ago, that promise was kept. In part.
 
 I do not need to go to an office somewhere. I can work outside in a park if I want to. I can ride on to a new town every day. I only ever need to bring my trusty old `Tuxedo Laptop`_ whereever I go.
diff --git a/content/20210421_web_shapshot.rst b/content/20210421_web_shapshot.rst
@@ -1,16 +1,175 @@
-Web snapshots with proof
+Proving what you link to
 ########################
 
-:date: 2021-04-21 09:37
-:modified: 2021-04-21 09:37
+:date: 2021-05-03 14:22
 :category: Archiving
 :author: Louis Holbrook
 :tags: web,hash,chromium
 :slug: web-snapshot
 :summary: Generating proof of a web resource when you read and share
 :lang: en
-:status: draft
+:status: published
 
 
+When we send a link to someone, we are trusting that whoever is behind that link will serve that someone the content we actually saw. Give or take an ad or two.
+
+Also, when we bookmark the link for later retrieval. We are trusting that the entity will be serving that content at any time in the future that you may want it.
+
+This may not be a huge problem if the page is merely a list of `dead baby jokes`_. They are objectively funny, of course. But you can also get along without them.
+
+But what of the case of formal, scientific texts we may depend on, and that use citations from the web as part of their source material? Usually, they refer to a source by *link* and *date of retrieval*. This is not of much use unless the actual source and/or render they saw at that time also is available.
+
+That may not always be the case.
+
+
+Take care of your shelf
+=======================
+
+        "No worries, the `Wayback Machine`_ has me covered."
+
+Yes. But no. The Wayback Machine is a (thus far) centralized entity that depend on a few idealists and donations to keep going. If they stop to keep going, they depend on passing the buck to someone else. If that someone is evil, they may take it and rewrite history to suit themselves. If that someone else cannot be found, it becomes a garbage collection blip on Bezos' `infrastructure monopoly`_ dashboard. [1]_ 
+
+That aside, sources like Wayback Machine are like libraries. Libraries are, of course, essential. Not only because they serve as one of the pillars of democracy, providing free access to knowledge for everyone. They are also essential because it's simply not very practical for you to pre-emptively own all the books that you may at some point want to read. Let alone ask around in your neighborhood if they happen to have a copy (although a crowd-sourced library app sounds like a fun decentralization project to explore).
+
+You may however want to keep a copy of the books you *depend* on, and the ones you *really like*. Just to make really sure you have it available when you want it. Then, if some New Public Management clowns get the chance to gut public infrastructure where you live, or someone starts a good old-fashioned fascist book burning, you have yourself covered.
+
+
+A lack of friction
+==================
+
+Yes, stuff may disappear on the web. Just as books may.
+
+On the web that stuff can get *rewritten*. Books may be rewritten, too. [2]_ The previous editions of the books will still exist as independent physical objects until they degrade, as long as something is not actively done to them. But so too with data on storage media. If it is not renewed or copied during that lifetime it may degrade until it becomes illegible. And of course, it may also simply be deleted. And without the smell of smoke at that.
+
+That's the difference that seems to leap out when using this imagery. How  *easy* it is to change and destroy stuff at scale on the web compared with the real world of books. And how inconspicuously it can happen, without *anyone* noticing. And for those who notice, it is very hard to prove what has changed, unless you have a copy. [5]_
+
+So what can we do? Copies and proofs are definitely keywords here. Fortunately, copying is what computers are all about. And making a cryptographical proof of what you see is easy enough these days, too. The tricky bit is to build *credibility* around that proof. But let's stick our head in the sand and start with the easy part and see where it takes us.
+
+
+Look, no head
+=============
+
+.. WARNING::
+        
+        Nerdy zone ahead
+
+A good start is to dump the document source to disk, then calculating and storing the sum of it.
+
+In many cases this will be insufficient, though, as many sites populates the DOM through scripts, either in part or in full. As the use-case here is humans voting on what they see is what they get, human aids will be needed. In other words, we need to render the page to store what we actually see.
+
+Printing to PDF from the browser is an option, but that is really difficult to automate. Fortunately, modern browser engines provide command line access to rendering. Since I mostly use `Brave Browser`_ these days, we'll use *headless Chromium* here.
+
+In addition to source, sum and rendering, we should also include a copy of the request headers for good measure.
+
+Thus, we end up with something like this.
+
 .. include:: code/web-snapshot/webshot.sh
    :code: bash
+
+
+What does this mean
+===================
+
+Let's sum up what information we've managed to store with this operation.
+
+- We have a copy of the unrendered source. It may or may not include all the information we want to store.
+- We have a fingerprint of that source.
+- We have a copy of the headers we were served when retrieving the document. [3]_
+- We have a image copy of what we actually saw when visiting the page.
+- We have a date and time for retrieval (file attributes).
+
+To link the headers together with the visual copy, we could sum the header file and image file aswell, put those sums together with the content sum in a deterministic order, and calculate the sum of those sums. E.g. [4]_
+
+.. code-block:: bash
+
+        $ cp contents.txt.sha256 sums.txt
+        $ sha256sum headers.txt | awk '{ print $1; }' >> sums.txt
+        $ sha256sum <pdf file> | awk '{ print $1; }' >> sums.txt
+        $ sha256sum sums.txt | awk '{ print $1; }' > topsum.txt
+
+If we now sign *this* sum, we are confirming that for this particuar resource:
+
+        "This was the source. These were the headers for that source. This is how that source served in that manner looked for **ME** at the time" 
+        
+
+Proving links in this post
+==========================
+
+This post makes use of several external links to articles. So as a final step, let's eat our own dogfood and add proofs for them.
+
+.. list-table:: Article retrieval proofs
+        :widths: 34 22 22 22 
+        :header-rows: 1
+
+        * - Link
+          - Content
+          - Header
+          - Image
+
+        * - https://gizmodo.com/i-tried-to-block-amazon-from-my-life-it-was-impossible-1830565336
+          - `0c657ec2 <{filename}misc/web_snapshot/0c657ec25e55e72702b0473f62e6d632dece992485f67c407ed1f748b3f40bc2.txt>`_
+          - `360125fa <{filename}misc/web_snapshot/360125fa513b8db8380eb2b62c4479164baa5b48d8544b2242bcc7305bad0de4.txt>`_
+          - `a2584ca0 <{filename}misc/web_snapshot/a2584ca07ba16b15b9fad315a767cbb0c4b7124abdfd578cab2afb4dce9d1971.pdf>`_
+        
+        * - https://www.thelocal.se/20111109/37244/
+          - `be8741ff <{filename}misc/web_snapshot/be8741ff91c6e3eac760a3a4652b57f47bce92fa9b617662296c2ab5b3c5fe31.txt>`_ 
+          - `6c20e767 <{filename}misc/web_snapshot/6c20e7678a0235467687425b9818718ab143fd477afee2087d7f79c933abdc75.txt>`_
+          - `f1c557b9 <{filename}misc/web_snapshot/f1c557b9555149dde570976ed956ffb17d17b99cea5f2651020f66408dacf301.pdf>`_
+
+        * - https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/
+          - `595a07b8 <{filename}misc/web_snapshot/595a07b85e6b75a41fe0880c8538f15c4b6da5770db230d986efa2e080ca479a.txt>`_
+          - `f2aa95b4 <{filename}misc/web_snapshot/f2aa95b42c5420cb11ccbf1cb084d5e5ccc3fc6cd575c51e9ee473b9df19c890.txt>`_
+          - `6a610fa6 <{filename}misc/web_snapshot/6a610fa69d4909cc9aa1dcb7105203cc50c1bcf26733411964f95cbcbdf37eb5.pdf>`_
+
+        * - https://web.archive.org/web/20170710185409/https://www.botkyrka.se/arkiv/nyhetsarkiv/nyheter-startsida/2017-07-10-angaende-uttalanden-av-journalisten-janne-josefsson-om-bibliotek-botkyrka.html
+          - `5a806be4 <{filename}misc/web_snapshot/5a806be410da82986a85f68658279234c1a5cf3bb6dc55da137d5274dc722f26.txt>`_
+          - `078d9fe6 <{filename}misc/web_snapshot/078d9fe6be070de0d378a6e903f8558a7da4917cba5c95ca453ae1936541e4f6.txt>`_
+          - `2d0f0e69 <{filename}misc/web_snapshot/2d0f0e69e7c8b6ffe9ff8ffc8702a78d6a2d46ab1edd4123e84eb211171d6cde.pdf>`_
+
+        * - https://www.breitbart.com/europe/2017/07/19/swedens-libraries-pulp-traditional-children-books-racist-phrases/
+          - `a8c6b61c <{filename}misc/web_snapshot/a8c6b61cec2cce1a58bcf7a65091a6c2e8510ca5fa17f2d242d286f087d95cd5.txt>`_
+          - `3ba9094b <{filename}misc/web_snapshot/3ba9094b295a898cfe8cba256f4ebf65ef98ff05bb3034e048e517d52fc13d33.txt>`_
+          - `eb76a87f <{filename}misc/web_snapshot/eb76a87f224b0e4f4e5d1673b1505aaeee28d8ad03ce544656b6f5ac7c4d9983.pdf>`_
+
+Clicking on the "image" links, we see that thanks to the recent ubiquity of cookie nag boxes screaming "accept all" at you, those very boxes are now blocking the content we want to get at. So we will need more work to get to where we want by automation. But it's s start.
+
+
+.. _dead baby jokes: https://dead-baby-joke.com/
+
+.. _Wayback Machine: https://archive.org
+
+.. _Brave Browser: https://brave.com/
+
+.. _headless Chromium: https://developers.google.com/web/updates/2017/04/headless-chrome
+
+.. _infrastructure monopoly: https://gizmodo.com/i-tried-to-block-amazon-from-my-life-it-was-impossible-1830565336
+
+..
+
+        .. [1] Early 2021 survey puts Amazon at one-third of the global market share. https://www.statista.com/chart/18819/worldwide-market-share-of-leading-cloud-infrastructure-service-providers/
+
+..
+
+        .. [2] 2011 sparked a controversy_ around Astrid Lindgren's Pippi Longstockings. Echoing those viewpoints, the books were edited in 2015 during which some alleged "racist" content was altered. The rabid rightwing media later spun a `false tale of mass purges`_ of Pippi books around one single swedish library's decision to throw out copies of the originial "racist" versions. Case-in-point, a public statement in which the library tries to justify its actions is no longer available on their website, and has to be retrieved by the Wayback Machine. https://web.archive.org/web/20170710185409/https://www.botkyrka.se/arkiv/nyhetsarkiv/nyheter-startsida/2017-07-10-angaende-uttalanden-av-journalisten-janne-josefsson-om-bibliotek-botkyrka.html (in swedish)
+
+..
+
+        .. [3] Well actually, not quite. We did the same request twice, but they were two separate requests. Using a single request would improve the script.
+
+..
+
+        .. [4] We use the hex representation for clarity here. A proper tool would convert the hex values to bytes before calculating sum on them.
+
+..
+
+        .. [5] Another important difference is that the book does not need to be *interpreted* by a *machine* in order to make sense for a human. Availability of tooling is definitely an equally important topic in this discussion. However this post limits focus to the actual data itself.
+
+..
+        .. _data availability:  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.433.3480
+
+.. _controversy: https://www.thelocal.se/20111109/37244/
+
+..
+        https://web.archive.org/web/20170710185409/https://www.botkyrka.se/arkiv/nyhetsarkiv/nyheter-startsida/2017-07-10-angaende-uttalanden-av-journalisten-janne-josefsson-om-bibliotek-botkyrka.html <- not available on link, no result on search https://www.mhpbooks.com/the-trouble-with-pippi/
+
+.. _false tale of mass purges: https://www.breitbart.com/europe/2017/07/19/swedens-libraries-pulp-traditional-children-books-racist-phrases/
diff --git a/content/code/web-snapshot/webshot.sh b/content/code/web-snapshot/webshot.sh
@@ -1,11 +1,7 @@
 #!/bin/bash
 
-# possible regex for title 
-# grep -e "<title>" | sed -e "s/^.*<title>\([^<]*\)<\/title>/\\1/g
-# should also convert xml entities, eg. &#8211 -> \u2013 (int -> hex) and render
-
 f=${WEBSHOT_OUTPUT_DIR:-/tmp}
-title_parser=${WEBSHOT_TITLE_PARSER} # script that takes contents.txt as input and outputs a single utf8 string
+url=$1
 title=$2
 >&2 echo using outdir $f
 
@@ -20,25 +16,17 @@ pushd $t
 echo $1 > url.txt
 curl -s -I $1 > headers.txt
 curl -s -X GET $1 > contents.txt
-sha256sum contents.txt > contents.txt.sha256
+z=`sha256sum contents.txt`
+echo $z > contents.txt.sha256
+h=`echo -n $z | awk '{ print $1; }'`
 
-# determine title to use and store it, too
-#TODO insert title name protection for mkdir
 if [ -z "$title" ]; then
-	if [ ! -z "$title_parser" ]; then
-		title=`$title_parser contents.txt`
-	fi
-fi
-
-if [ ! -z "$title" ]; then
-	echo $title > title.txt
+	title=$h
 fi
 >&2 echo using title $title
 
-
 # rendered snapshot
-h=`cat contents.txt.sha256 | awk '{ print $1; }'`
-chromium --headless --print-to-pdf $1
+chromium --headless --print-to-pdf $url
 n=${d}_${h}
 mv output.pdf $n.pdf
 
diff --git a/lash/static/css/style.css b/lash/static/css/style.css
@@ -140,6 +140,53 @@ pre.code {
 }
 
 
+/* notice boxes */
+
+div.warning {
+	background-color: #faa;
+	border: 3px solid #f00;
+	padding: 0.3em 0.3em 0.4em 1.5em;
+}
+
+div.admonition p.admonition-title {
+	color: #fff;
+	font-weight: 900;
+	text-transform: uppercase;
+}
+
+
+div.warning p.admonition-title {
+	text-shadow: 	-1px 1px 0 #f66,
+			2px 2px 0 #f66,
+			1px -1px 0 #f66;
+			-1px -1px 0 #f66;
+}
+
+
+div.caution {
+	background-color: #fa0;
+	border: 3px solid #d80;
+	padding: 0.3em 0.3em 0.4em 1.5em;
+}
+
+div.caution p.admonition-title {
+	text-shadow: 	-1px 1px 0 #fa6,
+			2px 2px 0 #fa6,
+			1px -1px 0 #fa6;
+			-1px -1px 0 #fa6;
+}
+
+/* footnotes */
+
+table.footnote {
+	border-left: 3px solid #ccc;
+	margin-top: 2.0em;
+}
+
+table.footnote td.label {
+	padding-left: 1.2em;
+}
+
 /* custom: identities */
 div#keys {
 	font-size: 1.2em;
diff --git a/pelicanconf.py b/pelicanconf.py
@@ -35,7 +35,10 @@ RELATIVE_URLS = True
 
 DISPLAY_CATEGORIES_ON_MENU = True
 
-PLUGINS = ['pelican.plugins.neighbors']
+PLUGINS = ['pelican.plugins.neighbors', 'sign']
 
 MENUITEMS = [('tags', '/tags.html')]
 
+STATIC_PATHS = ['images', 'misc']
+
+PLUGIN_SIGN_GPGKEY = 'd1d0e001'

	manbytesgnu_site Source files for manbytesgnu.org
	git clone git://holbrook.no/manbytesgnu_site.git
	Log \| Files \| Refs

M	content/20210419_docker_python.rst	\|	4	++++
M	content/20210420_docker_offline.rst	\|	4	++++
M	content/20210421_web_shapshot.rst	\|	167	+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
M	content/code/web-snapshot/webshot.sh	\|	24	++++++------------------
M	lash/static/css/style.css	\|	47	+++++++++++++++++++++++++++++++++++++++++++++++
M	pelicanconf.py	\|	5	++++-