experchange > perl

seansan (01-05-04, 12:37 PM)
Hi,

I have been set out to investigate howto split a large file in PERL.
My question is as follows.

I have a large file that is built up off data chunks of record sets.
Every new record set start with /^010/ and continues some lines (it
varies), until wel find the next '010' recordset. Finding these recs
doesnt seem so difficult, but other subject that I am not familiar
with fogg my mind.

I was thinking of opening 2 output files. I wanted to loop through the
file and according on 3-characters on the 1235th column (or
text-place) of the 010 line I have to print to either file A or file
B. How do I accomplish this?
- How do I read the 1235th 3 character identifier? and - How do I
switch between OUTPUT files? (remember I have to write several lines
to file A or B, until the next 010 line is encountered) and last, -
What considerations should I make for working with 3-4 GB files?

Any help, or examples will be appreciated

Sean Heukels
gnari (01-05-04, 12:52 PM)
"seansan" <sean103> wrote in message
news:6f80

> I was thinking of opening 2 output files. I wanted to loop through the
> file and according on 3-characters on the 1235th column (or
> text-place) of the 010 line I have to print to either file A or file
> B. How do I accomplish this?


it would help if we knew exactly what your problem is.
what have you tried, how does it fail ?

> - How do I read the 1235th 3 character identifier?


many ways spring to mind:
substr()
//
split// and array manipulations

> and - How do I
> switch between OUTPUT files?


again many ways, among them plain old if/else

> What considerations should I make for working with 3-4 GB files?


depends on your OS, probably.
if there is a problem, just split the file

> Any help, or examples will be appreciated

again, what have you done (or planned) and
what exactly is your problem?

if you want us to do the program for you, just say so.

gnari
Walter Roberson (01-05-04, 01:03 PM)
In article <cd2f8856.0401050237.37686f80>,
seansan <sean103> wrote:
:I was thinking of opening 2 output files. I wanted to loop through the
:file and according on 3-characters on the 1235th column (or
:text-place) of the 010 line I have to print to either file A or file
:B. How do I accomplish this?
:- How do I read the 1235th 3 character identifier?

If you already have the line read in to a string, then
use substr $string, 1234, 3

:and - How do I
:switch between OUTPUT files? (remember I have to write several lines
:to file A or B, until the next 010 line is encountered) and last, -

Switching between output files:

$ perldoc -f print
=item print FILEHANDLE LIST

Prints a string or a comma-separated list of strings. Returns TRUE
if successful. FILEHANDLE may be a scalar variable name, in which case
the variable contains the name of or a reference to the filehandle, thus
introducing one level of indirection.

:What considerations should I make for working with 3-4 GB files?

If you are just doing linear processing you should be okay, provided
your filesystem supports files that are large enough.

If, though, you need to skip around in the file, you need
to use 'seek' and 'tell' (or sysseek instead of either),
and that can be a problem because on many unix systems the
underlying system calls 'seek' and 'tell' are *signed* 32 bit
numbers -- which gives out after 2 Gb.

Other than that... the usual tricks. e.g., if your filesystem
supports "holes" and you are writing bunches of binary zeroes,
use seek to position to the new location rather than
writing the zeroes: systems that support holes often do not
convert blocks of zeroes to holes, and instead require
repositioning to accomplish it. This isn't a trick specific
to very large files, but it's hard to put a large hole in a
small file ;-)
Anno Siegel (01-05-04, 03:20 PM)
seansan <sean103> wrote in comp.lang.perl.misc:
[..]
> switch between OUTPUT files? (remember I have to write several lines
> to file A or B, until the next 010 line is encountered) and last, -
> What considerations should I make for working with 3-4 GB files?


First off, make "\n010" the input record separator. The each "line"
will essentially contain one chunk of data.

Then loop over the chunks, determine the output file for each, and
print it out.

There will be a certain skew since each chunk contains the initial bit
of the *following* record (if any). There will also be a spurious record
before the first one. The code below tries to take that into account, but
these things are *never* correct on the first try, so get yourself a
smallish test file and debug it. Untested:

open my $in, $infile or die "Can't read $infile: $!";
open my $out1, '>', $outfile1 or die "Can't create $outfile1: $!";
open my $out2, '>', $outfile2 or die "Can't create $outfile2: $!";

$/ = "\n010"
<$in>; # discard spurious "first" record
while ( <$in> ) {
# there are length( $/) characters missing from the beginning
my $tag = substr( $_, 1235 - length $/, 3);
# decide which output file to use (pseudocode)
my $out = $tag =~ /.../ ? $out1 : $out2;
print $out $/, $_; # add missing record separator
}
# add final linefeeds
print $out1, "\n";
print $out2, "\n";

Anno
Anno Siegel (01-05-04, 03:33 PM)
seansan <sean103> wrote in comp.lang.perl.misc:
[..]
> switch between OUTPUT files? (remember I have to write several lines
> to file A or B, until the next 010 line is encountered) and last, -
> What considerations should I make for working with 3-4 GB files?


First off, make "\n010" the input record separator. The each "line"
will essentially contain one chunk of data.

Then loop over the chunks, determine the output file for each, and
print it out.

There will be a certain skew since each chunk contains the initial bit
of the *following* record (if any). There will also be a spurious record
before the first one. The code below tries to take that into account, but
these things are *never* correct on the first try, so get yourself a
smallish test file and debug it. Untested:

open my $in, $infile or die "Can't read $infile: $!";
open my $out1, '>', $outfile1 or die "Can't create $outfile1: $!";
open my $out2, '>', $outfile2 or die "Can't create $outfile2: $!";

$/ = "\n010"
<$in>; # discard spurious "first" record
while ( <$in> ) {
chomp; # remove record separator
# there are length( $/) characters missing from the beginning
my $tag = substr( $_, 1235 - length $/, 3);
# decide which output file to use (pseudocode)
my $out = $tag =~ /.../ ? $out1 : $out2;
print $out $/, $_; # add missing record separator to previous entry
}
# add final linefeeds
print $out1, "\n";
print $out2, "\n";

Anno
gnari (01-05-04, 03:43 PM)
"Anno Siegel" <anno4000> wrote in message
news:n7v2
> seansan <sean103> wrote in comp.lang.perl.misc:


[snipped problem and proposed solution]

> # add final linefeeds
> print $out1, "\n";
> print $out2, "\n";


skip the commas

gnari
Anno Siegel (01-05-04, 04:51 PM)
gnari <gnari> wrote in comp.lang.perl.misc:
> "Anno Siegel" <anno4000> wrote in message
> news:n7v2
> [snipped problem and proposed solution]


Ugh, yes. Thanks.

Anno
Similar Threads