How to limit the size of n?
The following assumes that the whole string has been read and that an array or arrays or a dictionary is used.
Let Length be the length of the string. Length mod n = 0.
Let Count(char) be the number of occurrences of char in the string. For each character, Count(char) mod n = 0.
That means that we may be able to use the Greatest Common Denominator (GCD) to limit the size of n before matching substrings. We have
n(max) = GCD(Count(“a”), Count(“b”), . . . , Count(“z”))
Since Length = ∑ Count, we can replace one of the Counts with Length. For instance, if Length is a prime, we know that n(max) is Length or 1, but if we only look at the Counts, we may not realize that for some time. For instance, with Length = 7, the Counts (leaving out 0s) might be (2,2,3).
It is a good idea, I think, to sort the Counts and start with the minimum non-zero Count to find the GCD.
Let’s apply all this to the example strings.
String = “abcd”
Length = 4
Count(”a”) = 1
n(max) = GCD(1,4) = 1
n = 1
String = “aaaa”
Length = 4
Count(“a”) = 4
n(max) = GCD(4,4) = 4
String = “ababab”
Length = 6
Count(“a”) = 3
n(max) = GCD(3,6) = 3 ; Note that we do not need to check Count(“b”). It is redundant.
In the next example, let’s just use the numbers.
Length = 48
Count(“a”) = 18
Count(“b”) = 30
n(max) = GCD(18,48) = 6 ; Again, we do not have to make use of Count(“b”).
Can we also make use the number of occurrences of digrams? With a possible Length of 2,000,000 that might help. The answer is yes, with a twist. Let’s look at the last example:
String = “ababab”
Count(“ba”) = 2
Count(“ab”) = 3
n(max) = GCD(2,3) = 1
This is plainly wrong. We cannot use the GCD of the number of occurrences of “ab” and “ba”. The reason is that “ba” does not belong to the repeating substring. The “b” belongs to one substring while the “a” belongs to the next substring. In fact, in this case Count(“ba”) = n-1. It could also equal the count of “ba” in the repeating substring plus n-1. In either case, if we add 1 to it, we can use it, as adding n to a count does no harm. Let’s do that.
GCD(inc(2),3) = 3
So far, so good. But how do we do know whether to add 1 to Count(“ab”) or to Count(“ba”)? Easy. “a” is the first letter in String, and “b” is the last letter. That tells us that “ba” marks the boundaries between repeating substrings.
Now, when we use digrams, we add 1 to the Count of the digram whose first character is the last character in String and whose second character is the first character in String. When we do that, the sum of the counts = Length, as it should.
In this case the digrams give us no reduction of n(max) from the result we get using single letters.
But let’s look at this example:
String = “abbaab”
Using single characters:
Count(“a”) = 3
Count(“b”) = 3
n(max) = GCD(3,3) = 3
Using digrams:
The sorted Counts:
Count(“aa”) = 1
Count(“bb”) = 1
Count(“ba”) = 1
Count(“ab”) = 2
n(max) = GCD(1,1,2,2)) = 1 ; We add 1 to Count(“ba”)
n = 1
OC, in this case we do not have to make the full computation. In fact, since Count(“bb”) = 1, we don’t need to calculate any GCD at all.
Note that even if we get a lower limit for n with the digrams than we do with the single character counts, we may be able to use the single character count result to reduce the limit even further. It does not hurt to start with the single character counts.
String = “abaabababaabacbabaababababaacb”
Length = 30 ; Chosen because of its number of divisors
Count(“c”) = 2
Count(“b”) = 12
Count(“a”) = 16
n(max) = GCD(2,12,16) = 2
Count(“ac”) = 2
Count(“cb”) = 2
Count(“aa”) = 4
Count(“ab”) = 10
Count(“ba”) = 11 ; Add 1 to this
n(max) = GCD(2,2,4,10,12) = 2
This example is meant to illustrate the virtue (I hope) of this approach. The string is perhaps deceptive in part because of so many repetitions of the initial and final parts of the string (the “a”s and “b”s). That means that if we are working down the candidate substrings, we will get early matches, whether we work from the front or the back. There are also a lot of potential “ba” boundaries between substrings. But that means that there are a lot of “a”s and “b”s, but relatively few “c”s. It is the paucity of “c”s that puts a cap on n. With only 2 “c”s n(max) = 2, while using the counts of only “a” and “b” gives us an n(max) of 4.
OC, as always, the proof is in the pudding.
