Page 2 of 3
Re: Mousing over diagrams
Posted: Sun Apr 25, 2010 6:02 pm
by schultz
Kirby wrote:daal wrote:
Doesn't seem to have done the trick, but thanks for the effort.
It looks like your browser shows the URL regardless of the ALT text, then. I'm thinking that we're going to have to save the images on the server to adjust this, rather than generating them dynamically. I don't see how using an MD5 for a dynamically generated filename would allow for us to get around saving the images to disk since MD5 can have collisions. Perhaps we could save the files to the server using an MD5 hash for the filename, and if a duplicate filename existed, we could increment the filecount (eg. <md5-hash>1.gif, <md5-hash>2.gif, etc.). This would be inefficient for the cases where you have two of the same images (because we would generate separate, but identical images).
On the other hand, perhaps the chances of having an MD5 collision are low enough that we could use this filenaming scheme.
The chances of an MD5 collision are relatively low and I would think we wouldn't really need to worry about this. See:
http://www.miketaylor.org.uk/tech/law.html for a simplified description of this.

And if a collision ever did occur, I don't think we'd have to wait long to hear about it! Plenty of us follow all the Malkovich games as it is, and that's where the vast majority of the diagram images are used.

Also, I know Adrian wrote the wordpress plugin creating the diagram codes that did the above. Simply did an md5 hash of what was sent in to create a filename that could be checked so we didn't create duplicate images.
Re: Mousing over diagrams
Posted: Sun Apr 25, 2010 7:14 pm
by apetresc
schultz wrote:Also, I know Adrian wrote the wordpress plugin creating the diagram codes that did the above. Simply did an md5 hash of what was sent in to create a filename that could be checked so we didn't create duplicate images.
I'm flattered that you remembered the details of that code from so long ago

I don't think we have to worry about collisions. The chances of a collision in a data set as small as the collection of all diagrams on this board are extremely vanishingly small; it was a significant academic event when someone was able to produce
any two strings that had the same MD5 hash, and as far as I know, it's still an open question how to generate any collision for a given hash. The fact that the strings we'd be hashing are, by their nature, almost identical, doesn't matter. md5 is very "discontinuous" in that sense.
The reason I think it may be important to implement this at all is because we may soon run into Internet Explorer's 1000-character-URL limit if we make GET requests with the entire diagram
with escape characters in there.
Re: Mousing over diagrams
Posted: Sun Apr 25, 2010 7:27 pm
by Kirby
schultz wrote:Kirby wrote:daal wrote:
Doesn't seem to have done the trick, but thanks for the effort.
It looks like your browser shows the URL regardless of the ALT text, then. I'm thinking that we're going to have to save the images on the server to adjust this, rather than generating them dynamically. I don't see how using an MD5 for a dynamically generated filename would allow for us to get around saving the images to disk since MD5 can have collisions. Perhaps we could save the files to the server using an MD5 hash for the filename, and if a duplicate filename existed, we could increment the filecount (eg. <md5-hash>1.gif, <md5-hash>2.gif, etc.). This would be inefficient for the cases where you have two of the same images (because we would generate separate, but identical images).
On the other hand, perhaps the chances of having an MD5 collision are low enough that we could use this filenaming scheme.
The chances of an MD5 collision are relatively low and I would think we wouldn't really need to worry about this. See:
http://www.miketaylor.org.uk/tech/law.html for a simplified description of this.

And if a collision ever did occur, I don't think we'd have to wait long to hear about it! Plenty of us follow all the Malkovich games as it is, and that's where the vast majority of the diagram images are used.

Also, I know Adrian wrote the wordpress plugin creating the diagram codes that did the above. Simply did an md5 hash of what was sent in to create a filename that could be checked so we didn't create duplicate images.
Personally, I don't really buy the argument about using MD5 without any worries at all. There have been enough exploitations of MD5 to create a concern when you're using it for an important application. A popular example is this one:
http://www.win.tue.nl/hashclash/rogue-ca/. However, it is true that these people were actively trying to break the system. It's also against my philosophy to develop something with a known problem if it can be avoided (like in this case by not generating filenames at all)...
But you guys are right: there is a problem with the current implementation if the characters in the URL get too long. Also as has been said, if a collision did occur, we could address it at that time. And the probability of a collision is still low, especially if nobody is trying to attack the system. Another point is that it seems that diagrams have already been implemented in this manner without problems, so it might be OK to follow suit...
It's not as straightforward as copying what Adrian has done verbatim because of the limitations in the BBCode, but we should probably go this route. I think that the possibility of MD5 collisions is low enough that it is a much lesser issue to worry about than passing things through the URL like this.
So if we do go that route, what about using php's sha1 function? It's unlikely that we'll have an MD5 collision, and probably even more unlikely that we'll have a sha1 collision. It probably doesn't make a difference, since we'll probably get no collisions at all, but it might make me feel a little happier inside.
Re: Mousing over diagrams
Posted: Sun Apr 25, 2010 10:09 pm
by daal
I've gotten rid of it. I asked the question in the Opera forums, and
Tools>Preferences>Advanced>Browsing> disable tooltips.
That did the trick.
Re: Mousing over diagrams
Posted: Sun Apr 25, 2010 10:59 pm
by schultz
Kirby wrote:Personally, I don't really buy the argument about using MD5 without any worries at all. There have been enough exploitations of MD5 to create a concern when you're using it for an important application. A popular example is this one:
http://www.win.tue.nl/hashclash/rogue-ca/. However, it is true that these people were actively trying to break the system. It's also against my philosophy to develop something with a known problem if it can be avoided (like in this case by not generating filenames at all)...
But you guys are right: there is a problem with the current implementation if the characters in the URL get too long. Also as has been said, if a collision did occur, we could address it at that time. And the probability of a collision is still low, especially if nobody is trying to attack the system. Another point is that it seems that diagrams have already been implemented in this manner without problems, so it might be OK to follow suit...
It's not as straightforward as copying what Adrian has done verbatim because of the limitations in the BBCode, but we should probably go this route. I think that the possibility of MD5 collisions is low enough that it is a much lesser issue to worry about than passing things through the URL like this.
So if we do go that route, what about using php's sha1 function? It's unlikely that we'll have an MD5 collision, and probably even more unlikely that we'll have a sha1 collision. It probably doesn't make a difference, since we'll probably get no collisions at all, but it might make me feel a little happier inside.
I know it's not as straightforward as what Adrian did, unfortunately. I attempted to essentially modify it to work with the BBCode stuff a while ago, but kind of hit my personal limit on javascript and php coding. Always planned on going back and figuring it out, but then my club's forums kind of dried up and died so all my incentive disappeared.
And I see no problem with using php's sha-1 function. It makes sense, and should (theoretically) make collisions even less likely. And if it makes you a little happier, then why not.

Adrian Petrescu wrote:schultz wrote:Also, I know Adrian wrote the wordpress plugin creating the diagram codes that did the above. Simply did an md5 hash of what was sent in to create a filename that could be checked so we didn't create duplicate images.
I'm flattered that you remembered the details of that code from so long ago

Got a little lucky.
Just happened to be looking through that code because of what was brought up in this thread, and realized the name looked very familiar. Never made the connection in the past (also would make sense, though, since you were using a different username).

Re: Mousing over diagrams
Posted: Mon Apr 26, 2010 12:52 am
by HermanHiddema
Kirby wrote:Personally, I don't really buy the argument about using MD5 without any worries at all. There have been enough exploitations of MD5 to create a concern when you're using it for an important application. A popular example is this one:
http://www.win.tue.nl/hashclash/rogue-ca/. However, it is true that these people were actively trying to break the system. It's also against my philosophy to develop something with a known problem if it can be avoided (like in this case by not generating filenames at all)...
But you guys are right: there is a problem with the current implementation if the characters in the URL get too long. Also as has been said, if a collision did occur, we could address it at that time. And the probability of a collision is still low, especially if nobody is trying to attack the system. Another point is that it seems that diagrams have already been implemented in this manner without problems, so it might be OK to follow suit...
It's not as straightforward as copying what Adrian has done verbatim because of the limitations in the BBCode, but we should probably go this route. I think that the possibility of MD5 collisions is low enough that it is a much lesser issue to worry about than passing things through the URL like this.
So if we do go that route, what about using php's sha1 function? It's unlikely that we'll have an MD5 collision, and probably even more unlikely that we'll have a sha1 collision. It probably doesn't make a difference, since we'll probably get no collisions at all, but it might make me feel a little happier inside.
I think that a far greater problem with the current implementation is the huge CPU load that you're generating. Redrawing every image from scratch for each request is an enormous waste of processor cycles. A file cache with hash based file names is definitely the way to go, IMO. SHA1 is a little longer than MD5, but also a little slower. Neither of them will generate a collision with any likelihood at all.
Re: Mousing over diagrams
Posted: Mon Apr 26, 2010 1:13 am
by CarlJung
Edit: Note to self: Thinking before posting saves embarassment.
Re: Mousing over diagrams
Posted: Mon Apr 26, 2010 6:13 am
by HermanHiddema
HermanHiddema wrote:SHA1 is a little longer than MD5, but also a little slower. Neither of them will generate a collision with any likelihood at all.
To put this in perspective: If you post 1 diagram
every second for 1 billion years, continuously, then at the end of that billion years you will have roughly a 1 in a million chance that two of those diagrams have the same MD5 hash (you will have posted over 30 million billion diagrams at that point). I think we have more important things to worry about

Re: Mousing over diagrams
Posted: Mon Apr 26, 2010 7:30 am
by Kirby
HermanHiddema wrote:HermanHiddema wrote:SHA1 is a little longer than MD5, but also a little slower. Neither of them will generate a collision with any likelihood at all.
To put this in perspective: If you post 1 diagram
every second for 1 billion years, continuously, then at the end of that billion years you will have roughly a 1 in a million chance that two of those diagrams have the same MD5 hash (you will have posted over 30 million billion diagrams at that point). I think we have more important things to worry about

I already know that the chances are very low - I just don't feel good about it philisophically. Even if the URL problem were the only issue with the current implementation, I think it's enogh of a reason to go with the file cache, though. Mainly, the current implementation is just what we got to work with bbcode first.
I wasn't aware of the amount of time it takes to generate a particular image for a single request. Do you have any idea on the magnitude we're talking about?
Re: Mousing over diagrams
Posted: Tue Apr 27, 2010 1:10 am
by HermanHiddema
Kirby wrote:I already know that the chances are very low - I just don't feel good about it philisophically. Even if the URL problem were the only issue with the current implementation, I think it's enogh of a reason to go with the file cache, though. Mainly, the current implementation is just what we got to work with bbcode first.
I wasn't aware of the amount of time it takes to generate a particular image for a single request. Do you have any idea on the magnitude we're talking about?
I have no idea. You could benchmark it if you want, but the exact numbers really aren't really relevant. The point is that you're doing the same thing hundreds of times when you really only need to do it once.

Re: Mousing over diagrams
Posted: Tue Apr 27, 2010 7:09 am
by Kirby
HermanHiddema wrote:Kirby wrote:I already know that the chances are very low - I just don't feel good about it philisophically. Even if the URL problem were the only issue with the current implementation, I think it's enogh of a reason to go with the file cache, though. Mainly, the current implementation is just what we got to work with bbcode first.
I wasn't aware of the amount of time it takes to generate a particular image for a single request. Do you have any idea on the magnitude we're talking about?
I have no idea. You could benchmark it if you want, but the exact numbers really aren't really relevant. The point is that you're doing the same thing hundreds of times when you really only need to do it once.

That's true. I guess it in that sense, it's a tradeoff between extra work in creating an image, and extra space being taken up on the server by saving the image. Considering the points that have been brought up in this thread, though, I'm still in agreement that we should probably save the images to the server.
Re: Mousing over diagrams
Posted: Tue Apr 27, 2010 7:32 am
by ross
Kirby wrote:That's true. I guess it in that sense, it's a tradeoff between extra work in creating an image, and extra space being taken up on the server by saving the image. Considering the points that have been brought up in this thread, though, I'm still in agreement that we should probably save the images to the server.
A quick note on this subject: the way GoDiscussions operated was to take an md5 hash of the SGF or Go Diagram text, see if a file existed on the server with that md5 hash, create the file if not, and then use that url in the generated html. The first server "crash" where I started getting involved (I think a couple of years ago now) was because there was a single directory with bazillions of these md5 files, and the server's filesystem couldn't handle it. If you go this route, I'd recommend partitioning the files somehow (e.g. by the first character or two of the md5 hash) to make it easier on the filesystem. (That's how I fixed GoDiscussions that time around.)
Re: Mousing over diagrams
Posted: Tue Apr 27, 2010 7:42 am
by Kirby
ross wrote:Kirby wrote:That's true. I guess it in that sense, it's a tradeoff between extra work in creating an image, and extra space being taken up on the server by saving the image. Considering the points that have been brought up in this thread, though, I'm still in agreement that we should probably save the images to the server.
A quick note on this subject: the way GoDiscussions operated was to take an md5 hash of the SGF or Go Diagram text, see if a file existed on the server with that md5 hash, create the file if not, and then use that url in the generated html. The first server "crash" where I started getting involved (I think a couple of years ago now) was because there was a single directory with bazillions of these md5 files, and the server's filesystem couldn't handle it. If you go this route, I'd recommend partitioning the files somehow (e.g. by the first character or two of the md5 hash) to make it easier on the filesystem. (That's how I fixed GoDiscussions that time around.)
Thanks for the tip, Ross. All things taken into consideration, is this the route you would take if you did it yourself? Since you said "If you go this route", I'm wondering if you have some other ideas that we haven't considered, yet.
Re: Mousing over diagrams
Posted: Tue Apr 27, 2010 8:00 am
by ross
Kirby wrote:Thanks for the tip, Ross. All things taken into consideration, is this the route you would take if you did it yourself? Since you said "If you go this route", I'm wondering if you have some other ideas that we haven't considered, yet.
I had actually never considered generating the image on every page load (like you do now) when Adrian and I were working on hacking the phpbb3 code to save the files on the server GD-style. It's very creative, and after seeing it work fairly well, I'm not convinced that the extra server load is significant. However, I think the annoyances of e.g. filenames when you download and other small things make it worth it to switch to the md5sum route.
Oh, another small warning—GoDiscussions had some code that attempted to "phase out" older diagrams (i.e. enabling them to be removed from the server once they were X many months old). The code never worked and actually ended up making multiple unnecessary copies of the entire md5 collection of images and sgfs (another reason why the server tanked), but I just wanted to caution you that this is probably not only unnecessary but possibly counterproductive—old threads are going to be looked at all the time, both by humans and (e.g search engine) bots, so trying to save space by deleting old images isn't going to work very well. You'll probably have to keep them all around until the end of time (which is why the partitioning approach you take is so important). Just another friendly hint.

Re: Mousing over diagrams
Posted: Tue Apr 27, 2010 8:17 am
by fwiffo
Something I've done in the past with generated images is to just treat it as a cache and delete older items automatically to keep it to a reasonable size. I'd have a cron job or something go through and just delete images that haven't been touched in 3 months or something. If somebody goes and reads a really old thread or something, the images would just be regenerated if necessary. And that also makes sure that it doesn't keep around junk images if people edit posts to fix diagrams or whatever.
You might still have to partition it, but it's still a good idea to delete older images, IMO. Might have to make some sort of countermeasure if google is constantly regenerating the images with its crawl.
Caching the images server-side is a good idea for performance reasons even if you continue to use the javascript method.