gemini://://carcosa.net/journal/20210909-re-searching-history.gmi
#

Re: searching through browser history

Thu 09 Sep 2021

This is a reply to “Searching Through Browser History?” by ~ew0k.

ew0k discusses the desirability and feasibility of a local search engine that is integrated with your browser history (for the WWW and Gemini).

Back around 2001 or 2002, I wrote exactly such a thing for the WWW. It was written as an HTTP proxy, in Python, over the course of a couple of days where my sysadmin work at the time wasn't very demanding. It supported HTTP 1.0, plus a limited subset of HTTP 1.1: name-based virtual hosting and keep-alives, and IIRC, gzip encoding, but not chunked transfers, byte-range requests, or request pipelining. The proxy downloaded requests and passed them on unmodified to the client, but for text/* documents, it also full-text indexed them in a pretty naive way (I was young, and full-text search libraries weren't really a thing yet), by stripping the HTML tags if any, and then adding the document's URL to a list of URLs containing that word, in a dictionary (hash table) indexed by the word. There might have been some additional metadata associated with the URL, but I don't really remember. No stemming or anything, but I did throw out words that were too short or too common.

The proxy also provided a local HTTP server that served pages for searching and editing your history, as well as viewing statistics about it.

How did it work? It was *okay*. It didn't index pages served over HTTPS, because that was a can of worms, but also HTTPS pages were not all that common in those days, and generally not that interesting to index, either – they were usually the checkout pages of eCommerce sites (sometimes, the whole site, but not always). The speed was mostly okay. The indexing was done on another thread from the fetching, but Python's Global Interpreter Lock limited concurrency. I played with different key-value stores for speeding up the indexing, but not much made a difference. I did, still, find it fairly useful at the time, though at some point I lost interest and stopped maintaining it.

Today, I would have implemented both the indexing and the persistence very differently (used a library for indexing, using a job queue to store indexing requests and a dedicated indexer thread that consumes them), but it's all kind of beside the point, because you couldn't implement it as a proxy anyway, because most pages are served over HTTPS today. You'd have to implement it as a browser plugin, which means implementing it in JavaScript (or something that compiles to it, or to WebAssembly), and a lot of packaging work, and I just don't have the time or energy for it. Unless I get really curious about parenscript.

For Gemini, it would be easy to just add to a client. I like the idea of a proxy better, but I don't think we have an established standard for proxies (I should search the Gitlab and the mailing list).