Page 1 of 1
Cleaning Sensei's Library Webpages for Offline Storage
Posted: Sat May 01, 2010 11:45 pm
by RobertJasiek
If you want to clean potentially dangerous files, potentially dangerous (JavaScript, forms) or superfluous (header, footer, left pane, TOC) source code from Sensei's Library webpages for your offline storage, do the following before viewing the webpages offline in your HTML viewer or browser:
Delete all JavaScript *.js files:
On Windows, put all the files and their subdirectories in a directory, open the command line, go to that directory and type:
del /S *.js
The parameter /S deletes also in all the subdirectories.
Edit the source code by means of (regular) expressions as follows:
Use a program that allows batch processing of files and lists of (regular) expressions. As of 2010-05-02, set these expressions, where you will have to use your program's suitable syntax instead of the placeholders FROM, TO, REPLACEBY:
Deleted text:
FROM <!-- TO -->
FROM <script TO </script>
FROM <div id="pageheaders"> TO </div>
FROM <table id="toc" TO </table>
FROM <form TO </form>
FROM <div class="editsection"> TO </div>
FROM <div class='editsection'> TO </div>
Replaced text:
FROM <div id="pgfooter"> TO </body> REPLACEBY </body>
Re: Cleaning Sensei's Library Webpages for Offline Storage
Posted: Sun May 02, 2010 4:23 am
by Phelan
I know you're not a fan of Java, if I remember correctly, but have you tried
http://senseis.xmp.net/?SenseisLibraryOnTour or
http://senseis.xmp.net/?SLSnapshot ?
I don't know how different these are from your method.
Re: Cleaning Sensei's Library Webpages for Offline Storage
Posted: Sun May 02, 2010 5:02 am
by kirkmc
Maybe not the ideal place to post this; this should be in off-topic or something. Robert, you _can_ post in forums other than the Rules forum.

Re: Cleaning Sensei's Library Webpages for Offline Storage
Posted: Sun May 02, 2010 6:31 am
by Harleqin
You should not try to parse HTML with regular expressions, because HTML is not a regular language (please note the very specific meaning of "regular" here). Every popular language has a proper HTML parsing library.
Browsers usually support the disabling of JavaScript anyway. For Firefox, the NoScript addon gives you the ability to disable/enable it for selected sites.
Re: Cleaning Sensei's Library Webpages for Offline Storage
Posted: Sun May 02, 2010 6:49 am
by RobertJasiek
Snapshots are not suitable for me. I do not want the entire SL as a copy but only the pages that interest me.
This topic is hard to put in the right forum; I find Go Rules to be the most fitting because it is about getting the expressions aka rules right.
Actually I do not use classical regular expressions for the purpose but others might because it is much easier to find an RE editor than a FROM-TO expressions editor.
Disabling JavaScript does not prevent it from being stored locally. NoScript does that but not everybody (also not I) uses NoScript. It may, if one uses it, solve the the JavaScript problem but it does not treat the other undesired parts of a webpage.
Since I do not use NoScript, editing expressions of the source code is the most fitting approach for me. Presumably not for everybody but everybody has to know his preferred way anyway.
I think my expressions list is not complete for SL yet. Does somebody have a more complete one?
Re: Cleaning Sensei's Library Webpages for Offline Storage
Posted: Sun May 02, 2010 7:37 am
by Harleqin
I would not use a blacklist of things I do not want from a page, but a whitelist of the things I do want.
In other words, parse the HTML into a tree data structure (this is what an HTML parser does), then select the nodes of interest.
Re: Cleaning Sensei's Library Webpages for Offline Storage
Posted: Sun May 02, 2010 9:42 am
by RobertJasiek
Which short whitelist would work for all webpages? I do not know. Therefore I use a substitute for whitelisting: looking through the edited HTML source code in a plain text editor whether it still contains dubious tags.
Re: Cleaning Sensei's Library Webpages for Offline Storage
Posted: Fri May 14, 2010 4:44 pm
by willemien
maybe easiest is to start with the sl snapshot and copy from there all you want to keep....