Re: httrack
wetroof said:
I'm just creating this thread, but I will probably post more if I get to exploring the program.
httrack is a command-line program which runs a mac / linux computers. It crawls/trawls (lol) the internet, certain websites etcetera, and downloads the pages into a browser viewable format(s). I figure if parts of the internet go down in the future, this might be a useful tool. The main motivation I had for downloading the program and learning how to use it, which should be fairly simple, is that I had the idea to download the entire cassiopaea.org website along with the forums here. This raises some ethical questions however. The program does allow bandwidth limits, which is a good thing. Anyways thats all, I'll probably get back with some more information later.
About two months ago I used WinHTTrack to download the forum, Cass site the Cass glossary for the cases where I won't have internet access. Don't know if Mac/Linux version has the same settings, but it is advisable to set bandwidth limits just as a courtesy and because you don't know if it will eat all the bandwidth or overload the server. It will probably also depend on how many people are copying the site at the same time. I even had a scare at one point where the forum was inaccessible due to a server problem and thought that perhaps it was my fault.
But apparently it was not.
But this taught me that it is best to always ask first, especially if there are several people doing the download.
If time is not an issue for you, you can keep the program gradually downloading in the background while setting the limits to the minimum. Perhaps I was overcautious and it would work just as well with more connections, but what I did is to click on preferences and mirror options, click on "limits" and setting max transfer rate to 5000 B/s, and Max connections / seconds to 1. Then I clicked on "flow control" and set the number of persistent connections per second to 1 too. I didn't touch size or time limits because I wanted all of it downloaded fully.
One thing to remember is that if you don't set the mirroring depth, it will download all the site including images and pages that the site is linking too. It is cool if you are downloading the forum and you want for all the images to be displayed or being able to see the content of the added links, but it will increase the overall download size considerably (in gigabytes). I personally didn't mind, so it depends on your own preferences.
Also, a note that if you download Cass site, it will download the forum too.
And here is a link to the list that explains what not to do. On HTTrack site there are also detailed descriptions with screen-shots how to set everything. Hope it helps.
_http://www.httrack.com/html/abuse.html
Advice & what not to do
Please follow these common sense rules to avoid any network abuse
* Do not overload the websites!
Downloading a site can overload it, if you have a fast pipe, or if you capture too many simultaneous cgi (dynamically generated pages).
o Do not download too large websites: use filters
o Do not use too many simultaneous connections
o Use bandwidth limits
o Use connection limits
o Use size limits
o Use time limits
o Only disable robots.txt rules with great care
o Try not to download during working hours
o Check your mirror transfer rate/size
o For large mirrors, first ask the webmaster of the site
* Ensure that you can copy the website
o Are the pages copyrighted?
o Can you copy them only for private purpose?
o Do not make online mirrors unless you are authorized to do so
* Do not overload your network
o Is your (corporate, private..) network connected through dialup ISP?
o Is your network bandwidth limited (and expensive)?
o Are you slowing down the traffic?
* Do not steal private information
o Do not grab emails
o Do not grab private information