The situation is, I have access to a site that has a history of the work I have done. The current version of the site is due to be shutdown in 12 months time. The replacement won’t contain the legacy data. The site hosts a page for each of my tasks. Each page includes a large number (80+) of fields of useful information as well as large sections of technical text.
I’m copying this information into DEVONthink. Thanks to @cgrunenberg and @chrillek I have a script that opens each page in turn and saves it as a PDF record. As it saves it, it extracts some fields from the HTML and saves them as custom metadata along side the PDF. I need to save the PDF for regulatory requirements so that I can demonstrate the content has not been tampered with (Yes I know PDFs can be tampered with but this is the regulatory standard).
However, as time goes by I’ve realised that there are more of the 80+ fields that I would like to extract from the HTML and add to custom metadata fields of the PDF records. I believe I have 4 options
-
Attempt to extract the fields from the PDF. I wouldn’t even know where to start with this.
-
Rerun my script now and extract all 80 fields and store them as custom meta data in the PDF record although I know this is over kill. And due to the lack of "id"s in some of the fields I could spend a lot of time trying to extract the fields but never need them.
-
In the short term I could re-retrieve the source HTML from the site, extracting the specific fields I require and update the custom meta data of the existing PDF records. But with literally 10,000+ documents this is going to be a burden on the host and I would prefer not to do this each time I decide I want to extract more data from the HTML.
-
In the ideal world I would do one more pass of the site downloading the page HTML and attach the HTML to the existing PDF records. Then, in the future if I identify a specific field that I’d benefit from as a custom meta data field I can write a script to visit all of the records that contain the HTML source, extract the fields I require and add them as custom meta data to the existing PDF record.
The DEVONthink documentation doesn’t make it clear that some fields such as “source” are only available to certain classes of record type, hence my misunderstanding.
I don’t want to create and manage duplicate records one HTML record and one PDF record. So storing the entire HTML source as a custom metadata field appears to be a possibility but I wouldn’t have thought it ideal, so I’m looking for better solutions.
Is there a more appropriate field in a standard PDF record where I could store the HTML source ? Do you have any suggestions ?