experchange > comp.programming

Paul (10-21-18, 10:05 PM)
As a coding exercise, I am writing a collection of string-matching
algorithms -- for example Boyer Moore, Rabin-Karp etc.

Does anyone know a handy collection of string patterns and texts to facilitate
testing or do I need to generate the test data myself?

Many thanks,

Paul
Ben Bacarisse (10-21-18, 10:54 PM)
Paul <pepstein5> writes:

> As a coding exercise, I am writing a collection of string-matching
> algorithms -- for example Boyer Moore, Rabin-Karp etc.
> Does anyone know a handy collection of string patterns and texts to facilitate
> testing or do I need to generate the test data myself?


A handy source of large(ish) texts is Project Gutenberg. If you need
larger (which you might for speed tests on a modern machine) you can
concatenate several books or collected works together.

You'll still have to generate test cases yourself.
Paul (10-21-18, 11:17 PM)
On Sunday, October 21, 2018 at 9:55:00 PM UTC+1, Ben Bacarisse wrote:
> Paul <pepstein5> writes:
> A handy source of large(ish) texts is Project Gutenberg. If you need
> larger (which you might for speed tests on a modern machine) you can
> concatenate several books or collected works together.
> You'll still have to generate test cases yourself.
> --
> Ben.


I don't think that English-language texts make good testcases for the
algorithms because with English-language texts, the ultra-naive method of
very simply comparing letter by letter is not likely to be inefficient.
Better would be DNA-type strings.
I suppose I'll have to generate them myself using rand().
Paul
Sjouke Burry (10-21-18, 11:23 PM)
On 21-10-2018 23:17, Paul wrote:
> On Sunday, October 21, 2018 at 9:55:00 PM UTC+1, Ben Bacarisse wrote:
> I don't think that English-language texts make good testcases for the
> algorithms because with English-language texts, the ultra-naive method of
> very simply comparing letter by letter is not likely to be inefficient.
> Better would be DNA-type strings.
> I suppose I'll have to generate them myself using rand().
> Paul

Googling for:dna string data download
gives 26milion hits.
No need to invent your own.....
Ben Bacarisse (10-22-18, 12:16 AM)
Paul <pepstein5> writes:

> On Sunday, October 21, 2018 at 9:55:00 PM UTC+1, Ben Bacarisse wrote:
> I don't think that English-language texts make good testcases for the
> algorithms because with English-language texts, the ultra-naive method of
> very simply comparing letter by letter is not likely to be
> inefficient.
> Better would be DNA-type strings.


The asymptotic complexity of the algorithms won't change based on the
kind of text. You should be able to see a significant improvement in
performance between using a naive search and, say, Boyer Moore on
English text.

Obviously you will get different relative performance depending on the
actual string used and the frequency and distribution of the symbols, so
you might well want to have lots of different sources, but I'd include
English text as at least one.

> I suppose I'll have to generate them myself using rand().


And you might want to generate them not using rand()! I.e. you will
probably want some very special test cases that are not at all random.
Similar Threads