Mousing over diagrams

Is something wrong? Do you have any suggestions? Let us know.
User avatar
schultz
Lives in gote
Posts: 505
Joined: Tue Apr 20, 2010 5:31 pm
GD Posts: 0
Location: Montana
Has thanked: 80 times
Been thanked: 62 times

Re: Mousing over diagrams

Post by schultz »

Kirby wrote:
daal wrote:
Doesn't seem to have done the trick, but thanks for the effort.


It looks like your browser shows the URL regardless of the ALT text, then. I'm thinking that we're going to have to save the images on the server to adjust this, rather than generating them dynamically. I don't see how using an MD5 for a dynamically generated filename would allow for us to get around saving the images to disk since MD5 can have collisions. Perhaps we could save the files to the server using an MD5 hash for the filename, and if a duplicate filename existed, we could increment the filecount (eg. <md5-hash>1.gif, <md5-hash>2.gif, etc.). This would be inefficient for the cases where you have two of the same images (because we would generate separate, but identical images).

On the other hand, perhaps the chances of having an MD5 collision are low enough that we could use this filenaming scheme.

The chances of an MD5 collision are relatively low and I would think we wouldn't really need to worry about this. See: http://www.miketaylor.org.uk/tech/law.html for a simplified description of this. :) And if a collision ever did occur, I don't think we'd have to wait long to hear about it! Plenty of us follow all the Malkovich games as it is, and that's where the vast majority of the diagram images are used. :P

Also, I know Adrian wrote the wordpress plugin creating the diagram codes that did the above. Simply did an md5 hash of what was sent in to create a filename that could be checked so we didn't create duplicate images.
KGS: schultz [?].
User avatar
apetresc
Lives with ko
Posts: 256
Joined: Wed Apr 21, 2010 3:42 pm
Rank: AGA 1k
GD Posts: 1190
KGS: apetresc
IGS: apetresc
OGS: apetresc
Universal go server handle: apetresc
Location: Waterloo, Ontario (Canada)
Has thanked: 110 times
Been thanked: 146 times
Contact:

Re: Mousing over diagrams

Post by apetresc »

schultz wrote:Also, I know Adrian wrote the wordpress plugin creating the diagram codes that did the above. Simply did an md5 hash of what was sent in to create a filename that could be checked so we didn't create duplicate images.


I'm flattered that you remembered the details of that code from so long ago :)

I don't think we have to worry about collisions. The chances of a collision in a data set as small as the collection of all diagrams on this board are extremely vanishingly small; it was a significant academic event when someone was able to produce any two strings that had the same MD5 hash, and as far as I know, it's still an open question how to generate any collision for a given hash. The fact that the strings we'd be hashing are, by their nature, almost identical, doesn't matter. md5 is very "discontinuous" in that sense.

The reason I think it may be important to implement this at all is because we may soon run into Internet Explorer's 1000-character-URL limit if we make GET requests with the entire diagram with escape characters in there.
The road to wisdom? Well, it's plain, and simple to express: Err, and err, and err again; but less, and less, and less!
Image Image Image Image
Kirby
Honinbo
Posts: 9553
Joined: Wed Feb 24, 2010 6:04 pm
GD Posts: 0
KGS: Kirby
Tygem: 커비라고해
Has thanked: 1583 times
Been thanked: 1707 times

Re: Mousing over diagrams

Post by Kirby »

schultz wrote:
Kirby wrote:
daal wrote:
Doesn't seem to have done the trick, but thanks for the effort.


It looks like your browser shows the URL regardless of the ALT text, then. I'm thinking that we're going to have to save the images on the server to adjust this, rather than generating them dynamically. I don't see how using an MD5 for a dynamically generated filename would allow for us to get around saving the images to disk since MD5 can have collisions. Perhaps we could save the files to the server using an MD5 hash for the filename, and if a duplicate filename existed, we could increment the filecount (eg. <md5-hash>1.gif, <md5-hash>2.gif, etc.). This would be inefficient for the cases where you have two of the same images (because we would generate separate, but identical images).

On the other hand, perhaps the chances of having an MD5 collision are low enough that we could use this filenaming scheme.

The chances of an MD5 collision are relatively low and I would think we wouldn't really need to worry about this. See: http://www.miketaylor.org.uk/tech/law.html for a simplified description of this. :) And if a collision ever did occur, I don't think we'd have to wait long to hear about it! Plenty of us follow all the Malkovich games as it is, and that's where the vast majority of the diagram images are used. :P

Also, I know Adrian wrote the wordpress plugin creating the diagram codes that did the above. Simply did an md5 hash of what was sent in to create a filename that could be checked so we didn't create duplicate images.


Personally, I don't really buy the argument about using MD5 without any worries at all. There have been enough exploitations of MD5 to create a concern when you're using it for an important application. A popular example is this one: http://www.win.tue.nl/hashclash/rogue-ca/. However, it is true that these people were actively trying to break the system. It's also against my philosophy to develop something with a known problem if it can be avoided (like in this case by not generating filenames at all)...

But you guys are right: there is a problem with the current implementation if the characters in the URL get too long. Also as has been said, if a collision did occur, we could address it at that time. And the probability of a collision is still low, especially if nobody is trying to attack the system. Another point is that it seems that diagrams have already been implemented in this manner without problems, so it might be OK to follow suit...

It's not as straightforward as copying what Adrian has done verbatim because of the limitations in the BBCode, but we should probably go this route. I think that the possibility of MD5 collisions is low enough that it is a much lesser issue to worry about than passing things through the URL like this.

So if we do go that route, what about using php's sha1 function? It's unlikely that we'll have an MD5 collision, and probably even more unlikely that we'll have a sha1 collision. It probably doesn't make a difference, since we'll probably get no collisions at all, but it might make me feel a little happier inside.
be immersed
User avatar
daal
Oza
Posts: 2508
Joined: Wed Apr 21, 2010 1:30 am
GD Posts: 0
Has thanked: 1304 times
Been thanked: 1128 times

Re: Mousing over diagrams

Post by daal »

I've gotten rid of it. I asked the question in the Opera forums, and

Tools>Preferences>Advanced>Browsing> disable tooltips.

That did the trick.
Patience, grasshopper.
User avatar
schultz
Lives in gote
Posts: 505
Joined: Tue Apr 20, 2010 5:31 pm
GD Posts: 0
Location: Montana
Has thanked: 80 times
Been thanked: 62 times

Re: Mousing over diagrams

Post by schultz »

Kirby wrote:Personally, I don't really buy the argument about using MD5 without any worries at all. There have been enough exploitations of MD5 to create a concern when you're using it for an important application. A popular example is this one: http://www.win.tue.nl/hashclash/rogue-ca/. However, it is true that these people were actively trying to break the system. It's also against my philosophy to develop something with a known problem if it can be avoided (like in this case by not generating filenames at all)...

But you guys are right: there is a problem with the current implementation if the characters in the URL get too long. Also as has been said, if a collision did occur, we could address it at that time. And the probability of a collision is still low, especially if nobody is trying to attack the system. Another point is that it seems that diagrams have already been implemented in this manner without problems, so it might be OK to follow suit...

It's not as straightforward as copying what Adrian has done verbatim because of the limitations in the BBCode, but we should probably go this route. I think that the possibility of MD5 collisions is low enough that it is a much lesser issue to worry about than passing things through the URL like this.

So if we do go that route, what about using php's sha1 function? It's unlikely that we'll have an MD5 collision, and probably even more unlikely that we'll have a sha1 collision. It probably doesn't make a difference, since we'll probably get no collisions at all, but it might make me feel a little happier inside.

I know it's not as straightforward as what Adrian did, unfortunately. I attempted to essentially modify it to work with the BBCode stuff a while ago, but kind of hit my personal limit on javascript and php coding. Always planned on going back and figuring it out, but then my club's forums kind of dried up and died so all my incentive disappeared.

And I see no problem with using php's sha-1 function. It makes sense, and should (theoretically) make collisions even less likely. And if it makes you a little happier, then why not. :)

Adrian Petrescu wrote:
schultz wrote:Also, I know Adrian wrote the wordpress plugin creating the diagram codes that did the above. Simply did an md5 hash of what was sent in to create a filename that could be checked so we didn't create duplicate images.


I'm flattered that you remembered the details of that code from so long ago :)

Got a little lucky. ;)

Just happened to be looking through that code because of what was brought up in this thread, and realized the name looked very familiar. Never made the connection in the past (also would make sense, though, since you were using a different username). :P
KGS: schultz [?].
User avatar
HermanHiddema
Gosei
Posts: 2011
Joined: Tue Apr 20, 2010 10:08 am
Rank: Dutch 4D
GD Posts: 645
Universal go server handle: herminator
Location: Groningen, NL
Has thanked: 202 times
Been thanked: 1086 times

Re: Mousing over diagrams

Post by HermanHiddema »

Kirby wrote:Personally, I don't really buy the argument about using MD5 without any worries at all. There have been enough exploitations of MD5 to create a concern when you're using it for an important application. A popular example is this one: http://www.win.tue.nl/hashclash/rogue-ca/. However, it is true that these people were actively trying to break the system. It's also against my philosophy to develop something with a known problem if it can be avoided (like in this case by not generating filenames at all)...

But you guys are right: there is a problem with the current implementation if the characters in the URL get too long. Also as has been said, if a collision did occur, we could address it at that time. And the probability of a collision is still low, especially if nobody is trying to attack the system. Another point is that it seems that diagrams have already been implemented in this manner without problems, so it might be OK to follow suit...

It's not as straightforward as copying what Adrian has done verbatim because of the limitations in the BBCode, but we should probably go this route. I think that the possibility of MD5 collisions is low enough that it is a much lesser issue to worry about than passing things through the URL like this.

So if we do go that route, what about using php's sha1 function? It's unlikely that we'll have an MD5 collision, and probably even more unlikely that we'll have a sha1 collision. It probably doesn't make a difference, since we'll probably get no collisions at all, but it might make me feel a little happier inside.


I think that a far greater problem with the current implementation is the huge CPU load that you're generating. Redrawing every image from scratch for each request is an enormous waste of processor cycles. A file cache with hash based file names is definitely the way to go, IMO. SHA1 is a little longer than MD5, but also a little slower. Neither of them will generate a collision with any likelihood at all.
User avatar
CarlJung
Lives in gote
Posts: 429
Joined: Wed Apr 21, 2010 1:10 pm
Rank: SDK
GD Posts: 0
KGS: CarlJung
Location: Sweden
Has thanked: 101 times
Been thanked: 73 times

Re: Mousing over diagrams

Post by CarlJung »

Edit: Note to self: Thinking before posting saves embarassment.
User avatar
HermanHiddema
Gosei
Posts: 2011
Joined: Tue Apr 20, 2010 10:08 am
Rank: Dutch 4D
GD Posts: 645
Universal go server handle: herminator
Location: Groningen, NL
Has thanked: 202 times
Been thanked: 1086 times

Re: Mousing over diagrams

Post by HermanHiddema »

HermanHiddema wrote:SHA1 is a little longer than MD5, but also a little slower. Neither of them will generate a collision with any likelihood at all.


To put this in perspective: If you post 1 diagram every second for 1 billion years, continuously, then at the end of that billion years you will have roughly a 1 in a million chance that two of those diagrams have the same MD5 hash (you will have posted over 30 million billion diagrams at that point). I think we have more important things to worry about ;)
Kirby
Honinbo
Posts: 9553
Joined: Wed Feb 24, 2010 6:04 pm
GD Posts: 0
KGS: Kirby
Tygem: 커비라고해
Has thanked: 1583 times
Been thanked: 1707 times

Re: Mousing over diagrams

Post by Kirby »

HermanHiddema wrote:
HermanHiddema wrote:SHA1 is a little longer than MD5, but also a little slower. Neither of them will generate a collision with any likelihood at all.


To put this in perspective: If you post 1 diagram every second for 1 billion years, continuously, then at the end of that billion years you will have roughly a 1 in a million chance that two of those diagrams have the same MD5 hash (you will have posted over 30 million billion diagrams at that point). I think we have more important things to worry about ;)


I already know that the chances are very low - I just don't feel good about it philisophically. Even if the URL problem were the only issue with the current implementation, I think it's enogh of a reason to go with the file cache, though. Mainly, the current implementation is just what we got to work with bbcode first.

I wasn't aware of the amount of time it takes to generate a particular image for a single request. Do you have any idea on the magnitude we're talking about?
be immersed
User avatar
HermanHiddema
Gosei
Posts: 2011
Joined: Tue Apr 20, 2010 10:08 am
Rank: Dutch 4D
GD Posts: 645
Universal go server handle: herminator
Location: Groningen, NL
Has thanked: 202 times
Been thanked: 1086 times

Re: Mousing over diagrams

Post by HermanHiddema »

Kirby wrote:I already know that the chances are very low - I just don't feel good about it philisophically. Even if the URL problem were the only issue with the current implementation, I think it's enogh of a reason to go with the file cache, though. Mainly, the current implementation is just what we got to work with bbcode first.

I wasn't aware of the amount of time it takes to generate a particular image for a single request. Do you have any idea on the magnitude we're talking about?


I have no idea. You could benchmark it if you want, but the exact numbers really aren't really relevant. The point is that you're doing the same thing hundreds of times when you really only need to do it once. :)
Kirby
Honinbo
Posts: 9553
Joined: Wed Feb 24, 2010 6:04 pm
GD Posts: 0
KGS: Kirby
Tygem: 커비라고해
Has thanked: 1583 times
Been thanked: 1707 times

Re: Mousing over diagrams

Post by Kirby »

HermanHiddema wrote:
Kirby wrote:I already know that the chances are very low - I just don't feel good about it philisophically. Even if the URL problem were the only issue with the current implementation, I think it's enogh of a reason to go with the file cache, though. Mainly, the current implementation is just what we got to work with bbcode first.

I wasn't aware of the amount of time it takes to generate a particular image for a single request. Do you have any idea on the magnitude we're talking about?


I have no idea. You could benchmark it if you want, but the exact numbers really aren't really relevant. The point is that you're doing the same thing hundreds of times when you really only need to do it once. :)


That's true. I guess it in that sense, it's a tradeoff between extra work in creating an image, and extra space being taken up on the server by saving the image. Considering the points that have been brought up in this thread, though, I'm still in agreement that we should probably save the images to the server.
be immersed
User avatar
ross
Dies with sente
Posts: 92
Joined: Wed Apr 21, 2010 4:40 pm
Rank: DGS 9k
GD Posts: 1315
Location: シアトル
Has thanked: 24 times
Been thanked: 36 times

Re: Mousing over diagrams

Post by ross »

Kirby wrote:That's true. I guess it in that sense, it's a tradeoff between extra work in creating an image, and extra space being taken up on the server by saving the image. Considering the points that have been brought up in this thread, though, I'm still in agreement that we should probably save the images to the server.

A quick note on this subject: the way GoDiscussions operated was to take an md5 hash of the SGF or Go Diagram text, see if a file existed on the server with that md5 hash, create the file if not, and then use that url in the generated html. The first server "crash" where I started getting involved (I think a couple of years ago now) was because there was a single directory with bazillions of these md5 files, and the server's filesystem couldn't handle it. If you go this route, I'd recommend partitioning the files somehow (e.g. by the first character or two of the md5 hash) to make it easier on the filesystem. (That's how I fixed GoDiscussions that time around.)
Kirby
Honinbo
Posts: 9553
Joined: Wed Feb 24, 2010 6:04 pm
GD Posts: 0
KGS: Kirby
Tygem: 커비라고해
Has thanked: 1583 times
Been thanked: 1707 times

Re: Mousing over diagrams

Post by Kirby »

ross wrote:
Kirby wrote:That's true. I guess it in that sense, it's a tradeoff between extra work in creating an image, and extra space being taken up on the server by saving the image. Considering the points that have been brought up in this thread, though, I'm still in agreement that we should probably save the images to the server.

A quick note on this subject: the way GoDiscussions operated was to take an md5 hash of the SGF or Go Diagram text, see if a file existed on the server with that md5 hash, create the file if not, and then use that url in the generated html. The first server "crash" where I started getting involved (I think a couple of years ago now) was because there was a single directory with bazillions of these md5 files, and the server's filesystem couldn't handle it. If you go this route, I'd recommend partitioning the files somehow (e.g. by the first character or two of the md5 hash) to make it easier on the filesystem. (That's how I fixed GoDiscussions that time around.)


Thanks for the tip, Ross. All things taken into consideration, is this the route you would take if you did it yourself? Since you said "If you go this route", I'm wondering if you have some other ideas that we haven't considered, yet.
be immersed
User avatar
ross
Dies with sente
Posts: 92
Joined: Wed Apr 21, 2010 4:40 pm
Rank: DGS 9k
GD Posts: 1315
Location: シアトル
Has thanked: 24 times
Been thanked: 36 times

Re: Mousing over diagrams

Post by ross »

Kirby wrote:Thanks for the tip, Ross. All things taken into consideration, is this the route you would take if you did it yourself? Since you said "If you go this route", I'm wondering if you have some other ideas that we haven't considered, yet.

I had actually never considered generating the image on every page load (like you do now) when Adrian and I were working on hacking the phpbb3 code to save the files on the server GD-style. It's very creative, and after seeing it work fairly well, I'm not convinced that the extra server load is significant. However, I think the annoyances of e.g. filenames when you download and other small things make it worth it to switch to the md5sum route.

Oh, another small warning—GoDiscussions had some code that attempted to "phase out" older diagrams (i.e. enabling them to be removed from the server once they were X many months old). The code never worked and actually ended up making multiple unnecessary copies of the entire md5 collection of images and sgfs (another reason why the server tanked), but I just wanted to caution you that this is probably not only unnecessary but possibly counterproductive—old threads are going to be looked at all the time, both by humans and (e.g search engine) bots, so trying to save space by deleting old images isn't going to work very well. You'll probably have to keep them all around until the end of time (which is why the partitioning approach you take is so important). Just another friendly hint. :)
User avatar
fwiffo
Gosei
Posts: 1435
Joined: Tue Apr 20, 2010 6:22 am
Rank: Out of practice
GD Posts: 1104
KGS: fwiffo
Location: California
Has thanked: 49 times
Been thanked: 168 times

Re: Mousing over diagrams

Post by fwiffo »

Something I've done in the past with generated images is to just treat it as a cache and delete older items automatically to keep it to a reasonable size. I'd have a cron job or something go through and just delete images that haven't been touched in 3 months or something. If somebody goes and reads a really old thread or something, the images would just be regenerated if necessary. And that also makes sure that it doesn't keep around junk images if people edit posts to fix diagrams or whatever.

You might still have to partition it, but it's still a good idea to delete older images, IMO. Might have to make some sort of countermeasure if google is constantly regenerating the images with its crawl.

Caching the images server-side is a good idea for performance reasons even if you continue to use the javascript method.
Post Reply