siderea | [tech] Reference Document Management Software?

I want a thing that I think should surely exist, but I can't find it.

I want a LAMP-stack (or similar) web app that's open source, that I run on my own server,
that I can use as a proxy server in any browser,

that gives me some sort of "save this" button (via injection to the HTML or better a bookmarklet),

that tells the app to save a copy of what it just proxy-served me into my saved files database,

which can handle documents that are HTML pages and PDF (and PPT and DOC and JPG and PNG and GIF and little pink ponies, while asking for the moon),

and which has a web interface with a basic reference-style metadata thingy, such that I can capture (manually entered, if necessary!) intrinsic document metadata (URL where I found it, authors, date, publication/journal) and annotative metadata (my own tags, "folder" assignments, summary, other comments),

and supports full-text search of HTML and text-based PDFs.

Put another way, I want a caching bookmark manager that supports a reference manager interface and functions as a proxy so it seamlessly has my authentication bits.

I have a caching bookmark manager, but it doesn't do PDF (or anything other than HTML) and doesn't manage references, and there's apparently reference manager software that does PDFs (but not HTML) and doesn't function as a proxy unless you go through a hosted solution on someone else's server.

And there are all these "reference manager" web applications which live on someone else's server, which is unacceptable. There are allegedly reference managers for the desktop that at least do PDF, but none of them work for an OS as old as I am on (also your references are stuck on your computer).

These are the workflows I want it to support: (1) I roam around the internet, reading things, and clicking "save to my library", which captures what page or document I'm on and prompts me with a bunch of form fields to add additional info. (2) I can go to the main URL of my library, authenticate, and then access a web app that is basically a reference manager that supports projects and tagging, allowing me to call up a list of all the docs/urls/refs I classified to a certain thing. (3) extra credit: I can go to my library, authenticate, then do a full text search of my documents. Honestly, basic grep would be fine.

Does anybody know if the thing I want (need) exists?

ETA: User review of the day:

ander1122 Posted 09/27/2012
★★★★★
I seldom had to create any academic literature in my career as a designer of miniature golf courses. Then I tried this app, and it was so fantastic that I changed my entire career course just so I could use it. That's how good it is. I suppose it'd be irresponsible of me not to warn you that, if you currently aren't called upon to prepare academic literature, and you're not prepared to change careers, you'd best avoid this app, despite its wonderfulness. Therefore, I shall: Please reread the preceding part of this paragraph, beginning with "...if you currently aren't called upon..." Anyway, this app is great. This concludes my review. Feel free to add your own citations.

Duly noted. Since I've already succumbed to a tawdry researcher lifestyle, supporting my academic journal article habit by selling my intellect on internet streetcorners, I suppose I have nothing let to lose.

About

Artisanal wisdom prepared by hand in small batches from only the finest, locally sourced, organic insights.

Not homogenized • Superlative clarity • Excellently thought provoking

Telling you things you didn't know you knew & pointing out things that you didn't know that you didn't know since at least 2004.

Profile • Support

Flat | Top-Level Comments Only

From:

squirrelitude

So, there's Zotero, which has a component that sits *in* your browser and can take snapshots of pages and extract metadata, and has full-text search and tagging and whatnot... so it's kind of like you want that + a web interface to your library. [ETA: Looks like they might have that too.]

How essential is the proxy, beyond "I want a copy of what was actually served, especially when paywalls/auth are involved"? Is it also important that it be usable from an arbitrary proxy-capable browser that can't have e.g. the Zotero connector?

(Zotero's collaboration service also has a web UI, but I'm not sure that's relevant here.)

Edited (found a web UI source repo) Date: 2018-03-13 09:25 pm (UTC)

siderea

By "in" you mean a plugin? The problems are:

(1) browser compatibility - I'm on a Mac and behind some version because it's a very old Mac, and

(2) browser bloat – part of why I want it on some other machine (i.e. a server) is to reduce the load on my device. Slinging PDFs taxes my machine already and trying to open up some additional specialty software – whether a desktop app or an extention in my browser – makes the problem worse not better.

Yes, it's a Firefox/Chrome extension for detecting/harvesting pages, and that shunts the data into a desktop app where you can actually curate your collection.

OK, so this clears up why a proxy is of interest -- you want to offload compute onto your server. I feel like that's going to be hard because the proxy is going to have a hard time knowing what network requests are relevant, unless of course it runs a headless browser. It would also need a TLS MITM mechanism...

No idea if Zotero would be too much bloat on your machine. I'm on a fairly old Thinkpad laptop, and running Debian stable, which isn't the freshest thing in the world itself. Zotero doesn't cause noticeable CPU load when I'm not dumping stuff into it. 170 MB of RAM if I'm understanding shared vs. resident memory correctly.

I feel like that's going to be hard because the proxy is going to have a hard time knowing what network requests are relevant, unless of course it runs a headless browser

? Are you seeing something I'm missing? I was thinking something like a regular proxy server, which caches everything temporarily, so if I go to foo.bar/bem/quux, my proxy server has a copy of foo.bar/bem/quux and all its related assets, and then if I click my bookmarklet while viewing foo.bar/bem/quux in my browser, the bookmarklet notifies my server, "save foo.bar/bem/quux and its assets into my library, and add this metadata". That sounds simple enough I'm tempted to write it myself.

Were you thinking about the problem of related assets?

It would also need a TLS MITM mechanism

Yes, but I was under the impression that was a solved problem because proxies exist. Is it not? Or is it very hard?

If it's a static page with static assets (CSS, JS, images...) or just a PDF, then it's not hard. The bookmarklet can traverse the DOM and pick out all those URLs, then ask the server to retain the assets at those URLs.

It gets harder if there's content loaded from ajax requests and whatnot, which is distressingly common. Could even be from third-party hosts. At that point, the bookmarklet would not have sufficient information to provide the server. Even if the server grabbed all things requested after the initial page load, assuming they're related, this would not detect objects served from the browser cache. (e.g. the second time you viewed the same page.) Cross-domain ajax requests already in the cache would be lost. I don't know how common that would be, but it would certainly suck to find out the hard way.

Zotero saves the *rendered* page, which is an entirely different approach. Maybe the bookmarklet could do that too, but then I'm not sure how different it is from the Zotero connector, besides the option to have the receiving software running on a different computer. (Fun thing for me to maybe try later: See if I can use port-forwarding so the standalone Zotero is running on a different computer than my browser. It appears to listen on three ports.)

On proxies, I wasn't trying to imply a huge technical challenge. The issue is that you'd need to generate a root certificate, and then install it in your browser so that your proxy could sign on behalf of every website. I'm not sure if you'd find this as distasteful as I do, but it would be required for this type of proxy (unless you are just viewing raw HTTP sites I guess.)

$mdlbear: blue fractal bear with text "since 2002" (Default)$

mdlbear

Agreed. I actually built something like this about -- must have been well over a decade ago -- and it was getting difficult even then.

A bookmarklet or browser extension would be the way to go to make sure that you're capturing what you're seeing.

Wait, does Zotero even support running your own server? Those gihub repositories seem all about building your own clients/connectors.

Using their server is a hard fail of my requirements.

ETA: AHAHAHAHA OMG No.

ETA2: Oh, wait: this interesting article pointed me at https://github.com/zotero/dataserver So apparently maybe you can run your own Zotero server, but you have to edit your clients and compile them yourself, because the client software doesn't have any way to point at a non-Zotero.org zotero server.

Or one can apparently do what that blogger does, and carefully copy SQLite files around the internet.

Edited Date: 2018-03-13 10:15 pm (UTC)

I *think* you can sync to any old WebDAV server, and those aren't awful to set up. And then you can sync your clients using that server. I've never explored the possibility of setting up my own web UI, since I just had the one laptop and the one client.

You shouldn't have to recompile the client; there's a config editor (standalone Zotero is built on XULRunner, so this is basically about:config) that has slots for entering URL, username, and password for sync, and it looks like I was using WebDAV to my own server once upon a time.

According to that blog post, which is out of date, WebDAV sync only does files (attachments), which is is good and all, but doesn't do the metadata. Has that changed, do you know? That's why he was moving SQLite files around manually/via bash: to avoid his metadata going through Zotero.org.

Oh, hmm... I was sure it was syncing the metadata too, the same way it does with their main site. :-/ I don't know if it has changed, either direction. It has been a while since I looked into it.

alexxkay

That review is amazing!

Sibylla Bostoniensis

If you wanna know the future, simply look into the past

[tech] Reference Document Management Software?

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

(no subject)

About

June 2026

Active Entries