Download posts/comments from PL with this program!

enemyofthestate1776 Tue, 12/22/2015 - 19:13

Hi all,
I am putting this on hold for now, as it has been claimed that my program took down the site.

1. I am incredibly sorry to all of you, Michael and Jon included, if this is indeed the case.

2. I have reasons to believe that it is not the case, as I explain in the comment below (posted under My View).

Be 2) as it may, if I am wrong about this, my apology stands.

OK - My View

I had hope that this program would be a god-send to those who wished to save their posts (like Kathleen Gee, who simply wished to get her posts so she could re-post them on her blog.) 

However, Jon is blaming my program for the downing of PL, and I feel the need to communicate my view here. First I would like to state that I have never had a problem with Jon in any way, and found him to be one of the best mods on the site over the years.

First, I would like to point out that there were, in total, 10 downloads of my program (the total is actually 13 at the time of this message, but 3 of them were me, testing on other computers). 

The way the software worked was that you would login, then type in the username of the user whose posts you wanted to download.

The software would then issue 1 webpage request to PL, for the users profile page.

If the page existed (and thus the user existed also), then the program would send 1 more webpage request to PL, asking for their user/posts page. 

It would then use the returned data to detect how many pages of posts existed for that user, and then place, on a request queue, a request for each page of posts that existed. 24 pages of posts, 24 requests. On a queue, though. Only 6 were able to run at one time. This was not my decision, although I was comfortable with it, as I didn't want the site to go down due to my program flooding the server, similar to DDOS. Wow, if this is true, I've discovered a way to take down websites with only 10 copies of a program..... Hmmmm.

The real reason for this limitation to 6 requests at a time is that I used the application api Qt to build my software. Qt, I assume to avoid crashing websites via DDOS attacks, or too many simultaneous HTTP requests, has a hard limit (6) on how many HTTP requests can simultaneously run. The rest are queued.

"Note: QNetworkAccessManager queues the requests it receives. The number of requests executed in parallel is dependent on the protocol. Currently, for the HTTP protocol on desktop platforms, 6 requests are executed in parallel for one host/port combination." - http://doc.qt.io/qt-5/qnetworkaccessmanager.html#details

(Now, there is of course a way around that, but I didn't do that. (IE I used only one QNetworkAccessManager. So, please, anyone who codes, my code is still there at https://github.com/team2e16/PostRetrievePL. Download it, check it, and either tell me I'm wrong, or...)

After it retrieved each page (6 at a time, and always waiting for the Website to reply before issuing new requests), it would use the returned webpage data to find a link to each post on said page, and make a list of each post's links, to later download. I was so worried about crashing the site that I purposely avoided immediately downloading each post, and instead made a list of links, as I wanted the previous requests to be finished with by the PL website, before I issued more.

To that end, at each stage, I had the next part of the software running a 5 second timer, which would then check if all requests had been replied to by the website. If they had, the program would continue to the next stage. If they hadn't, the program would wait another 5 seconds for the requests to finish, and check again.

The next stage sent the webpage requests from the posts list to the website, again six at a time, and again waiting for the website to reply. If there were 24 pages of posts, the total number of requests would be roughly 240 (10 posts per page). But again, it's not 240 at once. It's 6, then wait for the reply (just as you do in your browser), then 6, then wait, then 6, then wait. That's why it took me about 15 minutes to download Emalvini's entire posts/comments. Because the program doesn't, and in fact can't, generate more than 6 concurrent http requests.

Then the program would move on to comments.

a) request the user's comments page and check the total pages of comments

b) The comments were stored directly on each comments page, so if you had 10 pages of comments, then my program only made 10 more requests. The highest number I saw when downloading, was 100 pages approx, so about 1000 comments. But only 100 http requests.

So practical example. 

I downloaded Emalvini's entire posting and commenting history. 1 http request for his profile page, 1 for his posts page, and around 34 requests for his posts pages, 340 requests for his actual posts, and about 100 requests for his 1000 or so comments (10 per page). 

So that's 476 requests. It took around 15 minutes, from memory. I went away and made coffee, talked to my kids, and came back. 15 minutes is 900 seconds.

476 requests in 900 seconds is: 0.53 requests per second. So about 1 request every two seconds.

The main reason for this is not that my program couldn't send requests faster (at least in blocks of 6). The massive limitation on how many requests I could make, was how fast the PL website replied, because only 6 requests would be issued, and it would take around between 0.5-2 seconds to receive a reply from PL.

Not let's, for the sake of argument, assume that my program was issuing requests three times as fast as this. IE that it took only 5 minutes to download Emalvini's stuff.

So now there would be 1.59 http requests per second. Scary stuff, indeed. Pretty sure I can beat that by splitting chrome into two separate windows and clicking refresh every second.....

And now add in the other 10 copies of the program that people downloaded.

Let's assume:

1) They all got it to work (doubtful, because shortly before the site went down, I posted links to an updated version of the software for Windows 7, because it wasn't working for Windows 7 users)

2) They were on their computers from the moment they got the software until the moment of the crash, constantly feeding in new usernames to download posts/comments from, without any breaks whatsoever.

3) They also managed 1.6 requests per second.

And we now have a combined total of 17.6 requests per second. Very, very, worst case scenario. What was it that Jon said in his email to Michael?

"Multiply this by a few people and there are hundreds of heavy requests a second, causing all issues being logged on the server."

If hundreds of people had downloaded my software, then yes. This could indeed be the case. But only 10 did......

Finally, if we assume that each error message is around 300 bytes long, then at the calculated rate, 11 users including myself going continuously, could rack up around 5KB per second of error messages (or 18MB per hour, or 432MB per day, or around 0.9GB of error message between the time my program was released, and the site went down, assuming that every single http request was logged as an error. I don't see how that could happen, unless my program wasn't actually working, which it was, and I have 40 users posts and comments to prove it. Total size of all of these posts and comments? 30MB...

Second Issue:

Jon 'quoted' me in his email. This is what he sent to Michael:

'They even stated "you're going to have to wait, this slows things down..."'
This is completely untrue. I have a copy of my original post (thanks to my program lol). This is what I said.

"Press the 'Gather Posts and Comments' button. Patience is required here. The program now sifts through all of the users posts and comments pages, and then downloads and extracts the information from each one."

Now, I'll give Jon the benefit of the doubt and assume he just is stressed, fixing a server on Xmas eve, and didn't get what I meant. 

I didn't mean that the site would slow down. I meant that it takes time to download all of the relevant pages from PL, actually because instead of requesting every single page at once, my program grabs them 6 at a time, and then has to wait for PL's delay in reply.

However:

One thing that is possible is that, when I was working on my program, I may have been generating many error messages during this process. Also, after I finished my first program, I was rushing out a program that could download entire threads (not by user, but by downloading the post and all conversation in the comments below, and stitch it all into one monolithic HTML page).

It is possible that this is the case; that my testing of my software has caused the problem. But again, I can't see any conceivable way, whether via ten users plus me, or in testing, that my software generated 'hundreds of requests per second'. 

So in order for this to be true, 1) the available hard drive space must have been already very low, and 2) the software (Drupal, in the case of PL) must have not sent a warning email to the responsible parties to warn of impending problems (hard drive space nearly full, lots of errors).

My theory:

I assume that I am correct on the above, and haven't overlooked something. (Possible, but I don't think so)

I assume that there is no funny business going on with Michael/Jon. I want to make this clear. I don't think they're pulling the plug early for some unknown reason.

I assume that, given the site is to be shut down within 1 more week, that no further hard drive space was to be supplied to the site, and that Michael may have requested the site be 'drawn down' slowly, to save costs, or whatever, making it possible for my program (and perhaps the influx of members we haven't seen in quite some time (some names I've never seen) - after the shutdown announcement) to push it over the brink.

Again, if my software or testing of the software caused the site to go down, and I am solely responsible for downing a well-equipped (sufficient hard drive space) and much-loved website prematurely at the cusp of its imminent shutdown, I am sincerely sorry to Michael, and Jon, and the entire DP/PL community.

What is the category of this post? (choose up to 2): 
enemyofthestate1776's picture
About the author
"I know not what course others may take; but as for me, give me liberty or give me death!" - Patrick Henry
Tbone's picture

Letter on PL by M. Nystrum

=================================

P O P U L A R - L I B E R T Y

December 24, 2015

Hey friends, Merry Christmas!

I woke up this morning to find an email from Jon that read:

So...update:

Someone posted a program for people to download & archive the site themselves, by requesting pages they want directly. They even stated "you're going to have to wait, this slows things down..." Multiply this by a few people and there are hundreds of heavy requests a second, causing all issues being logged on the server.

That's happened, and it filled up the disk with error messages, basically. I can't even get in.

I thought a simple fix would involve restarting to get a connection in and clear up the disk, or enlarge the disk, but neither of those worked, and it's not up now...

I'm sorry the site is effectively down. Maybe you'd want to point it somewhere else for the time being.

Well, to quote a famous internet meme, this is why we can't have nice things. Ha ha ha.

We'll try to get the archive back up, or at least that page of new projects that everyone is working on, sometime after the Christmas break.

If you want to be notified, enter your email here, or just check back.

Sorry this happened, but well, things happen.

Merry Christmas everyone and thank you for everything! I'll be off now to enjoy a nice glass of egg nog...

Michael Nystrom
 

M E R R Y - C H R I S T M A S !

john2k's picture

Seems to be in line with the error messages that I saw.  One error message stated error 28 from the database, which means it's a space issue.  Other error message stated that the watchdog table was full.  Watchdog is a logging feature of drupal to help website admins see what's going on with their site.  So, seems that it could definitely be related.

On another note, I wonder if the server hard drive or database hard drive partition was already close to being full with maybe enough space to get the site through Dec 31st... but this issue pushed it to the limit sooner than expected.

@dpc_network  DP Community Network on Twitter
View the DP/PL member directory and connect with others.

VR's picture

you nailed it

All our knowledge begins with the senses, proceeds then to the understanding, and ends with reason. There is nothing higher than reason.
Immanuel Kant

enemyofthestate1776's picture

bump

"I know not what course others may take; but as for me, give me liberty or give me death!" - Patrick Henry

ecard71's picture

I do not blame you in any way shape or form. Your intentions were noble and kind in trying to help others when it was rudely denied elsewhere (There's nothing wrong with saying "NO", but it doesn't have to be said in such a mean spirited way).

Secondly, it appears you were responsible in taking the precautions necessary before sharing it. It's funny you mentioned emalvini, because I was going to jokingly suggest it were his posts alone that brought the entire site down (thanks for ruining it enemy). LOL

Now onto my questions;

I was one of the users that downloaded your program (Windows 7), and though I was successful in downloading my content, it's pretty unreadable. You mentioned you later put up a version for Windows 7 users. Did you take those same precautions, or were they not necessary? Is it possible that just the Windows 7 users alone infinitely multiplied the errors. I ask this only, because I remember about 10 years ago I was having errors (don't think it was a virus) that caused my system to open up a new window continuously about every second on its own until it crashed (I had XP at the time), or I used "Ctrl+Alt+Del".

Is it possible that someone could have tried to modify your program be it because they had a Windows 7 program & thought they could fix it, because they thought that maybe there was a way to download EVERYTHING & not just THEIR posts/comments (maybe they were unaware of web crawlers), or (and I hate to say it, but possible) someone maliciously altered your program? I ask this because I know people download fully licensed programs all the time, and then change & tweak them leaving them with a "free" fully licensed program.

The "Hundreds of users" also puzzles me, because if I were to judge from both "goodbye" posts that Michael put up, most of the people on those and other threads were saying things like "it's just a forum", and "what's the big deal, they're only words, just move on". Aside from archiving the site, I only saw one or two members asking if there was a way to save/download their posts. Keeping you at your word that there were only 10 downloads, where did the "Hundreds" of users come from? My guess is that's very possibly just an assumption on their part, barring of course someone was able to manipulate your program and maybe hack into the site (people have claimed their passwords have been hacked before).

In conclusion, though it did not personally work for me as I'd hoped, I do want to thank you for your kind intentions. It was very nice of you to take time to help others out.

 

I STILL STAND WITH RAND!

enemyofthestate1776's picture

They are much appreciated.

Answers to questions:

"and though I was successful in downloading my content, it's pretty unreadable" - Are you referring to the CSV files? (like ecard71-posts.csv) If you are, there was a drop-down box to choose to save as TXT files, which were very human-readable. Also, if you are talking about the CSV files, did you keep them? If you want simple TXT copies of them, you could upload them somewhere, heck email them to me, and I'll write a small utility to convert them into the TXT format, convert them and send them back, so you can read them. If you downloaded multiple user's posts/comments, let me know the usernames, as I may already have them, and can email you back with the list of files I need you to send me for conversion.

"Windows 7 users. Did you take those same precautions" - The program was not modified in any way, shape or form for Windows 7; all of the versions were compiled directly from the same source code. The only difference is that the code needed to be compiled on Windows 7 to run on Windows 7 (and 8 on 8, Linux on Linux etc), and each different version needed the correct copy of the stock system/Qt libraries that had been built for the particular platform (those are all the other .dll files you saw in the folder you unzipped).

"Is it possible that someone could have tried to modify your program" - Yes, it would have been possible, as I released the code (to ensure that people had access to it, so they could make sure I wasn't emailing myself all of your passwords). However, my github account shows that there was only one downloader of the code: me. It's extremely unlikely, but possible, that they could 'reverse engineer' the code, but what would be the point when I had already posted the link to the source code? Also, if an interested outside party wished to take down the site on purpose, they could just use normal methods of DDOS to bring it down; no need to try to hide behind my program. I would think them using my program would have been just an annoyance, really. But again, never say never, I guess. Every assassination needs a patsy :)

""Hundreds of users"" - To be fair, 'Hundreds of users' was my phrase, not Jon's. He asserted that the program, used by a 'few' users was making 'hundreds of heavy requests a second', which is also something I find hard to believe. The core part of my skepticism comes from not seeing how 11 people running a program could bring a website down, when every HTTP reply was 200 (after testing: as I said, with poor enough hard drive space, it's possible that my errors caused during testing could have at least contributed to the downing of PL), which means OK, which means No error... Every ten or so attempts to get the very first page from PL (and this program would do it once per run), the website wouldn't reply with anything. Assuming that logged 1 error each time, then my total errors in my use of the program should have amounted to less than 40 (the number of users I downloaded posts/comments from). But the testing? Who knows? Personally, though, I do think that a website should be able to handle malicious attacks that cause errors on purpose. For example, from 

https://www.owasp.org/index.php/Error_Handling,_Auditing_and_Logging

"How to protect yourself

Simply be aware of this type of attack, take every security violation seriously, always get to the bottom of the cause event log errors rather, and don't just dismiss errors unless you can be completely sure that you know it to be a technical problem.

Denial of Service

By repeatedly hitting an application with requests that cause log entries, multiply this by ten thousand, and the result is that you have a large log file and a possible headache for the security administrator. Where log files are configured with a fixed allocation size, then once full, all logging will stop and an attacker has effectively denied service to your logging mechanism. Worse still, if there is no maximum log file size, then an attacker has the ability to completely fill the hard drive partition and potentially deny service to the entire system. This is becoming more of a rarity though with the increasing size of today's hard disks.

How to protect yourself

The main defense against this type of attack are to increase the maximum log file size to a value that is unlikely to be reached, place the log file on a separate partition to that of the operating system or other critical applications and best of all, try to deploy some kind of system monitoring application that can set a threshold against your log file size and/or activity and issue an alert if an attack of this nature is underway."

Note that the article makes a distinction between a situation where:

1) The logging simply stops working, or

2) The system stops working because there is no maximum logfile size, and it resides on the same hard drive as the operating system, and the website itself, causing the whole thing to freeze up.

Also, that large enough hard drives will help prevent this.

So I guess I'm still somewhat skeptical that I was able to cause so many errors, and take down a website run by a seasoned website owner/maintainer with only 11 users using the program. Guess we may find out later; I dunno.

"because they thought that maybe there was a way to download EVERYTHING & not just THEIR posts/comments" - This would have involved rewriting my code so extensively that they would have been better off starting from scratch. In fact, if I was to write a program for that purpose, I would start from scratch and then only copy relevant parts of the code, because modifying my code to do this would mean a complete restructuring of the logic. So if they did, it would have been so far from my code, that I will happily claim no responsibility for the downing. If they could modify my code that well, they could have written their own program in a very similar amount of time. As a final note to this: because their program would have been indiscriminate, their job would have been easier; my program had to be specific, as I wanted people to only have access to one user at a time (for the very reason that it might overwhelm the site). But there already exist tools to achieve this. See http://www.httrack.com/

Now, HTTrack copies entire websites. What are the chances that someone was running this prior to my software's release, for up to a week or two, once the closedown announcement came? Now that could indeed log a lot of errors.

(Note, too, that in Jon's message he states that "I can't even get in". How, then, would he know what caused the error messages? It's easy, of course, to tell that the hard drive is full, anybody who tried to login got the "Error 28" message, which anyone can Google to find out the drive is full. So I'm sincerely wondering whether Jon is away from the server machine, therefore logging remotely to it from another computer, perhaps for the holidays, saw my post releasing my program and worried, and then saw the site go down with this error message and put '2 & 2 together'. It would indeed be a logical assumption. But without reading those error messages (which requires access to the system), how could he know when they began? Someone using HTTrack could have been generating the errors for a week before, at a slightly higher rate than usual, and caused this. So it could be an assumption. Who knows?)

Still, I will not discount the possibility that I was at fault. As I've said before, I could indeed be missing something.

Anyways, thanks again for your kind words, and your thanks. 

"I know not what course others may take; but as for me, give me liberty or give me death!" - Patrick Henry

ecard71's picture

Thanks for taking the time to reply to all those questions in detail (feel guilty now). I did keep the csv files, but I don't want to bother/impose you. Is there any program readily available that you'd recommend that I can download to convert them? If not, I'll definitely take you up on your offer to upload them. Just tell me where.

I didn't know we had the option of downloading other user's posts without the passwords!??

"To be fair, 'Hundreds of users' was my phrase, not Jon's."

Yes, thanks. Deacon pointed that out to me.

I STILL STAND WITH RAND!

enemyofthestate1776's picture

"I didn't know we had the option of downloading other user's posts without the passwords!??" - Yeah, the program was designed to log you in as you, because you couldn't access someone's post history unless you were a logged-in member. (IE the public couldn't see post and comment histories). So you logged in as yourself, then entered their username, and you could download their posts/comments.

"(feel guilty now)" - No, don't! I'm kinda perfectionist and detailed, and so I tend to overdo my responses to people, so there is no confusion about what I mean.

"Just tell me where" - I'll PM you my email address. It's only two files, so feel free to send 'em to me. I'll get them converted, no imposition at all. (I actually tried other programs, and there's some kind of inconsistency in the file that wouldn't gel with other converters I tried)

"I know not what course others may take; but as for me, give me liberty or give me death!" - Patrick Henry

ATruepatriot's picture

To make this effort and help everyone out is huge and deserves the upmost appreciation and respect. :)

"Jack of all Trades...Master of None" But forever learning more!

enemyofthestate1776's picture

for you kind message :)

"I know not what course others may take; but as for me, give me liberty or give me death!" - Patrick Henry

Flex's picture

Nice work!

Pages