Download posts/comments from PL with this program!
Hi all,
I am putting this on hold for now, as it has been claimed that my program took down the site.
1. I am incredibly sorry to all of you, Michael and Jon included, if this is indeed the case.
2. I have reasons to believe that it is not the case, as I explain in the comment below (posted under My View).
Be 2) as it may, if I am wrong about this, my apology stands.
OK - My View
I had hope that this program would be a god-send to those who wished to save their posts (like Kathleen Gee, who simply wished to get her posts so she could re-post them on her blog.)
However, Jon is blaming my program for the downing of PL, and I feel the need to communicate my view here. First I would like to state that I have never had a problem with Jon in any way, and found him to be one of the best mods on the site over the years.
First, I would like to point out that there were, in total, 10 downloads of my program (the total is actually 13 at the time of this message, but 3 of them were me, testing on other computers).
The way the software worked was that you would login, then type in the username of the user whose posts you wanted to download.
The software would then issue 1 webpage request to PL, for the users profile page.
If the page existed (and thus the user existed also), then the program would send 1 more webpage request to PL, asking for their user/posts page.
It would then use the returned data to detect how many pages of posts existed for that user, and then place, on a request queue, a request for each page of posts that existed. 24 pages of posts, 24 requests. On a queue, though. Only 6 were able to run at one time. This was not my decision, although I was comfortable with it, as I didn't want the site to go down due to my program flooding the server, similar to DDOS. Wow, if this is true, I've discovered a way to take down websites with only 10 copies of a program..... Hmmmm.
The real reason for this limitation to 6 requests at a time is that I used the application api Qt to build my software. Qt, I assume to avoid crashing websites via DDOS attacks, or too many simultaneous HTTP requests, has a hard limit (6) on how many HTTP requests can simultaneously run. The rest are queued.
"Note: QNetworkAccessManager queues the requests it receives. The number of requests executed in parallel is dependent on the protocol. Currently, for the HTTP protocol on desktop platforms, 6 requests are executed in parallel for one host/port combination." - http://doc.qt.io/qt-5/qnetworkaccessmanager.html#details
(Now, there is of course a way around that, but I didn't do that. (IE I used only one QNetworkAccessManager. So, please, anyone who codes, my code is still there at https://github.com/team2e16/PostRetrievePL. Download it, check it, and either tell me I'm wrong, or...)
After it retrieved each page (6 at a time, and always waiting for the Website to reply before issuing new requests), it would use the returned webpage data to find a link to each post on said page, and make a list of each post's links, to later download. I was so worried about crashing the site that I purposely avoided immediately downloading each post, and instead made a list of links, as I wanted the previous requests to be finished with by the PL website, before I issued more.
To that end, at each stage, I had the next part of the software running a 5 second timer, which would then check if all requests had been replied to by the website. If they had, the program would continue to the next stage. If they hadn't, the program would wait another 5 seconds for the requests to finish, and check again.
The next stage sent the webpage requests from the posts list to the website, again six at a time, and again waiting for the website to reply. If there were 24 pages of posts, the total number of requests would be roughly 240 (10 posts per page). But again, it's not 240 at once. It's 6, then wait for the reply (just as you do in your browser), then 6, then wait, then 6, then wait. That's why it took me about 15 minutes to download Emalvini's entire posts/comments. Because the program doesn't, and in fact can't, generate more than 6 concurrent http requests.
Then the program would move on to comments.
a) request the user's comments page and check the total pages of comments
b) The comments were stored directly on each comments page, so if you had 10 pages of comments, then my program only made 10 more requests. The highest number I saw when downloading, was 100 pages approx, so about 1000 comments. But only 100 http requests.
So practical example.
I downloaded Emalvini's entire posting and commenting history. 1 http request for his profile page, 1 for his posts page, and around 34 requests for his posts pages, 340 requests for his actual posts, and about 100 requests for his 1000 or so comments (10 per page).
So that's 476 requests. It took around 15 minutes, from memory. I went away and made coffee, talked to my kids, and came back. 15 minutes is 900 seconds.
476 requests in 900 seconds is: 0.53 requests per second. So about 1 request every two seconds.
The main reason for this is not that my program couldn't send requests faster (at least in blocks of 6). The massive limitation on how many requests I could make, was how fast the PL website replied, because only 6 requests would be issued, and it would take around between 0.5-2 seconds to receive a reply from PL.
Not let's, for the sake of argument, assume that my program was issuing requests three times as fast as this. IE that it took only 5 minutes to download Emalvini's stuff.
So now there would be 1.59 http requests per second. Scary stuff, indeed. Pretty sure I can beat that by splitting chrome into two separate windows and clicking refresh every second.....
And now add in the other 10 copies of the program that people downloaded.
Let's assume:
1) They all got it to work (doubtful, because shortly before the site went down, I posted links to an updated version of the software for Windows 7, because it wasn't working for Windows 7 users)
2) They were on their computers from the moment they got the software until the moment of the crash, constantly feeding in new usernames to download posts/comments from, without any breaks whatsoever.
3) They also managed 1.6 requests per second.
And we now have a combined total of 17.6 requests per second. Very, very, worst case scenario. What was it that Jon said in his email to Michael?
"Multiply this by a few people and there are hundreds of heavy requests a second, causing all issues being logged on the server."
If hundreds of people had downloaded my software, then yes. This could indeed be the case. But only 10 did......
Finally, if we assume that each error message is around 300 bytes long, then at the calculated rate, 11 users including myself going continuously, could rack up around 5KB per second of error messages (or 18MB per hour, or 432MB per day, or around 0.9GB of error message between the time my program was released, and the site went down, assuming that every single http request was logged as an error. I don't see how that could happen, unless my program wasn't actually working, which it was, and I have 40 users posts and comments to prove it. Total size of all of these posts and comments? 30MB...
Second Issue:
Jon 'quoted' me in his email. This is what he sent to Michael:
'They even stated "you're going to have to wait, this slows things down..."'
This is completely untrue. I have a copy of my original post (thanks to my program lol). This is what I said.
"Press the 'Gather Posts and Comments' button. Patience is required here. The program now sifts through all of the users posts and comments pages, and then downloads and extracts the information from each one."
Now, I'll give Jon the benefit of the doubt and assume he just is stressed, fixing a server on Xmas eve, and didn't get what I meant.
I didn't mean that the site would slow down. I meant that it takes time to download all of the relevant pages from PL, actually because instead of requesting every single page at once, my program grabs them 6 at a time, and then has to wait for PL's delay in reply.
However:
One thing that is possible is that, when I was working on my program, I may have been generating many error messages during this process. Also, after I finished my first program, I was rushing out a program that could download entire threads (not by user, but by downloading the post and all conversation in the comments below, and stitch it all into one monolithic HTML page).
It is possible that this is the case; that my testing of my software has caused the problem. But again, I can't see any conceivable way, whether via ten users plus me, or in testing, that my software generated 'hundreds of requests per second'.
So in order for this to be true, 1) the available hard drive space must have been already very low, and 2) the software (Drupal, in the case of PL) must have not sent a warning email to the responsible parties to warn of impending problems (hard drive space nearly full, lots of errors).
My theory:
I assume that I am correct on the above, and haven't overlooked something. (Possible, but I don't think so)
I assume that there is no funny business going on with Michael/Jon. I want to make this clear. I don't think they're pulling the plug early for some unknown reason.
I assume that, given the site is to be shut down within 1 more week, that no further hard drive space was to be supplied to the site, and that Michael may have requested the site be 'drawn down' slowly, to save costs, or whatever, making it possible for my program (and perhaps the influx of members we haven't seen in quite some time (some names I've never seen) - after the shutdown announcement) to push it over the brink.
Again, if my software or testing of the software caused the site to go down, and I am solely responsible for downing a well-equipped (sufficient hard drive space) and much-loved website prematurely at the cusp of its imminent shutdown, I am sincerely sorry to Michael, and Jon, and the entire DP/PL community.