Challenges involved in archiving a webpage
Last year as I was picking up on ideas around building a personal archival system, I put together small utility that would download and archive a webpage. As I kept thinking on the subject I realized it has very significant shortcomings:
- In the utility I am parsing the content of page, finding different kind of urls(
img
,css
,js
) to recursively fetch the static resources in the end I will have the archive of page. But there is more to that in how a page gets rendered. Browsers parses HTML and all the resources to finally render the page. The program we write has to be equally capable or else the archive won't be complete. - Whenever I have been behind a proxy in college campuses I have noticed reCAPTCHA would pop up saying something in line that suspicious activity is noticed from my connection. With this automated archival system, how to avoid it? I have a feeling that if the system triggers the reCAPTCHA activation, for automated browsing of a page, the system will be locked out and won't have any relevant content of the page. My concern is, I don't know enough on how and when captcha triggers, how to avoid or handle them and have guaranteed access to the content of the page.
- Handling paywalls, or the access to limited articles in a certain time, or banners that obfuscate the content with login screen.
I am really slow in executing and experimenting around these questions. And I feel unless I start working on it, I will keep collecting these questions and add more inertia to taking a crack at the problem.