experchange > shell

Martijn Dekker (06-26-19, 02:39 AM)
Consider:

$ printf '%s\n' one two three | tac -b

According to 'man tac':
-b, --before: attach the separator before instead of after

So I would expect this output ('\n' is a real linefeed):

\nthree\ntwo\none

Instead, the output is:

\n\nthree\ntwoone

What use is that?

The behaviour seems to be as documented in "info coreutils 'tac
invocation'":

`-b'
`--before'
The separator is attached to the beginning of the record that it
precedes in the file.

But, as we see above, that creates a double separator at the beginning
and two unseparated records at the end.

Can anyone think of a valid use case for that, or is this a bug?

- M.
Janis Papanagnou (06-26-19, 09:09 AM)
On 26.06.2019 02:39, Martijn Dekker wrote:
[..]
> But, as we see above, that creates a double separator at the beginning and two
> unseparated records at the end.
> Can anyone think of a valid use case for that, or is this a bug?


Looks like a bug to me. (And I see no usecase for that.)

On my system I see another strange behaviour...

$ printf $'one\ntwo\nthree\n' | od -c
0000000 o n e \n t w o \n t h r e e \n
0000016
$ printf $'one\ntwo\nthree\n' | tac | od -c
0000000

Obviously tac creates no output if sent to a pipe. - What am I missing?

And...

$ printf $'one\ntwo\nthree\n' | tac -b

three
$ printf $'one\ntwo\nthree\n' | tac -b | od -c
0000000

Janis
Keith Thompson (06-26-19, 11:58 AM)
Janis Papanagnou <janis_papanagnou> writes:
[...]
> On my system I see another strange behaviour...
> $ printf $'one\ntwo\nthree\n' | od -c
> 0000000 o n e \n t w o \n t h r e e \n
> 0000016
> $ printf $'one\ntwo\nthree\n' | tac | od -c
> 0000000


I don't see that on my system (Ubuntu 18.04, coreutils 8.28):

$ printf $'one\ntwo\nthree\n' | od -c
0000000 o n e \n t w o \n t h r e e \n
0000016
$ printf $'one\ntwo\nthree\n' | tac | od -c
0000000 t h r e e \n t w o \n o n e \n
0000016
$

> Obviously tac creates no output if sent to a pipe. - What am I missing?


I don't know.

> And...
> $ printf $'one\ntwo\nthree\n' | tac -b
> three
> $ printf $'one\ntwo\nthree\n' | tac -b | od -c
> 0000000


$ printf $'one\ntwo\nthree\n' | tac -b

three
twoone$ printf $'one\ntwo\nthree\n' | tac -b | od -c
0000000 \n \n t h r e e \n t w o o n e
0000016

In the "printf ... | tac -b" case, I wonder of some of the output
is being clobbered by your shell prompt. Try adding " ; sleep 3"
to delay the prompt.
John-Paul Stewart (06-26-19, 06:24 PM)
On 2019-06-25 8:39 p.m., Martijn Dekker wrote:
> Consider:
> $ printf '%s\n' one two three | tac -b
> According to 'man tac':
> -b, --before: attach the separator before instead of after
> So I would expect this output ('\n' is a real linefeed):
>     \nthree\ntwo\none
> Instead, the output is:
>     \n\nthree\ntwoone
> What use is that? [snip]
> Can anyone think of a valid use case for that, or is this a bug?


I think it is just that the documentation is unclear that the -b option
affects how tac splits its input into records. Also note that tac
doesn't add or remove separators, it strictly re-orders input.

The original input was "one\ntwo\nthree\n", which 'tac -b' sees as four
input records: (no separator) "one", "\ntwo", "\nthree", and "\n" (just
a separator). Reverse the order of those, and you get exactly the
output you saw. If you change the input to "\none\ntwo\nthree" with a
leading newline instead, then you'll get the expected output.

To really see what's going on, try all of these strings:

"one\ntwo\nthree\n"
"one\ntwo\nthree"
"\none\ntwo\nthree"

Run each through both 'tac -b' and plain 'tac' and you'll see what I
mean about it not adding/removing separators and that the -b affects how
the input is broken into records, not the output.
Janis Papanagnou (06-26-19, 08:54 PM)
On 26.06.2019 09:09, Janis Papanagnou wrote:
> On 26.06.2019 02:39, Martijn Dekker wrote:
> Looks like a bug to me. (And I see no usecase for that.)


Re-thinking about it one thought came to my mind (not sure it makes
technically or else sense); what about processing languages that are
written from right to left, would they have (in computer processing)
their line terminators also on the left side of each line? - If so,
then that could be an application case.

$ printf $'\nthree\ntwo\none' | tac -b

would create \none, \ntwo, and, \nthree, and another tac -b added

$ printf $'\nthree\ntwo\none' | tac -b | tac -b

would re-create the original sequence.

Janis
Stephane Chazelas (06-28-19, 12:00 PM)
2019-06-26 02:39:11 +0200, Martijn Dekker:
[...]
> Can anyone think of a valid use case for that, or is this a bug? [...]


Can be useful in things like:

~$ printf 'header: %s\n' foo bar | tac -s 'header: '
bar
foo
header: header: <no-eol>
~$ printf 'header: %s\n' foo bar | tac -s 'header: ' -b
header: bar
header: foo
Grant Taylor (06-29-19, 02:00 AM)
On 6/28/19 4:00 AM, Stephane Chazelas wrote:
> Can be useful in things like:
> ~$ printf 'header: %s\n' foo bar | tac -s 'header: '
> bar
> foo
> header: header: <no-eol>
> ~$ printf 'header: %s\n' foo bar | tac -s 'header: ' -b
> header: bar
> header: foo


Intriguing.

Thank you for sharing Stephane.
Kaz Kylheku (06-29-19, 02:52 AM)
On 2019-06-28, Stephane Chazelas <stephane.chazelas> wrote:
[..]
> ~$ printf 'header: %s\n' foo bar | tac -s 'header: ' -b
> header: bar
> header: foo


You've cracked the nut.

But your code doesn't serve as a good motivating example; it is
confounded by the fact that for the given line-oriented data sample, we
would be better off working with the default newline separator as a
record terminator:

$ printf 'header: %s\n' foo bar | tac
header: bar
header: foo

Perhaps this the following is a clearer demonstration of your finding:

$ echo -n 'a:b:c:' | tac -s ':' -b ; echo
::c:ba

Huh? That looks like a bug?! But, what if we have record start
symbols rather than terminator symbols:

$ echo -n ':a:b:c' | tac -s ':' -b ; echo
:c:b:a

Aha!

Thus the ::c:ba output is explained like this: in a:b:c: the colon is
treated as leader, and so the data is considered to be:

<missing-leader> a <leader> b <leader> c <leader> <empty-record>

Thus we get the output:

<leader> <empty-record> <leader> c <leader> b <missing-leader> a

This could benefit from being better documented. The "separator"
terminology is flawed, for starters.
Martijn Dekker (07-02-19, 02:03 AM)
Thanks to everyone who responded in this thread, particularly John-Paul
Stewart:

> I think it is just that the documentation is unclear that the -b option
> affects how tac splits its input into records. Also note that tac
> doesn't add or remove separators, it strictly re-orders input.


That's the essential bit of insight I needed in a nutshell. Stphane and
Kaz also gave very useful examples.

With the help of all your responses I've created a new cross-platform
'tac' implementation, in shell and awk, as a module for the modernish
shell library so it can be used on any POSIX system, not just where GNU
coreutils are available (note that installing modernish does not require
a compiler and you can install it in your home directory on multi-user
systems). Modernish 'tac' acts identically to GNU 'tac' with the
examples you all gave in this thread, as long as the input is text.

Brief documentation (hopefully better than the GNU one):


View the code (the awk code is the interesting bit):


I've added couple of new options to this 'tac':
* -B is like -b except it acts like I originally expected -b to act:
it expects separators to follow records in the input, but makes them
precede records in the output. I can't quite figure out a concrete
use case for that, perhaps one of you can think of some. :)
* -P is for paragraph mode: output a text last paragraph first, with
paragraphs separated by at least two linefeeds. (This is easy to do
in awk by using an empty record separator.)

One other important difference is that the '-r' option causes it to
accept an extended (as opposed to basic) regex as the separator; this is
inevitable as awk can't parse basic regexes.

It would be fun if some of you could try to break this and post your
findings.

To get started quickly, get the current modernish development code and
make yourself a little wrapper script:

git clone
modernish/install.sh -n
cat > ~/bin/mtac <<EOF
#! /usr/bin/env modernish
#! use sys/base/tac
tac "$@"
EOF
chmod +x ~/bin/mtac

Now you have an 'mtac' command in ~/bin that makes it easy to test this
implementation.

To get rid of modernish again:

modernish/uninstall.sh -fn

Thanks,

- M.
Stephane Chazelas (07-02-19, 06:40 PM)
2019-07-02 02:03:35 +0200, Martijn Dekker:
[...]
> With the help of all your responses I've created a new cross-platform 'tac'
> implementation, in shell and awk, as a module for the modernish shell
> library so it can be used on any POSIX system, not just where GNU coreutils
> are available (note that installing modernish does not require a compiler
> and you can install it in your home directory on multi-user systems).
> Modernish 'tac' acts identically to GNU 'tac' with the examples you all gave
> in this thread, as long as the input is text. [...]


Note that while "tac" is a GNU-specific command, most other
systems have "tail -r" to achieve the same effect (though some
have limitations for non-seekable input IIRC).
Ed Morton (07-02-19, 08:08 PM)
On 7/1/2019 7:03 PM, Martijn Dekker wrote:
[..]
>
> View the code (the awk code is the interesting bit):
>


To be able to handle large files you'd be better off using:

cat -n | sort -rn | cut -f2-

to reverse the line order than what you're currently doing which is
reading the whole input into a string then splitting that string into an
array thereby more than requiring twice the memory of the input file size.

For example you can do this to reverse a file:

$ seq 5 | cat -n | sort -rn | cut -f2-
5
4
3
2
1

so I'd build a solution around that process of adding line numbers then
using sort (to do the heavy lifting since it's designed to handle huge
files and can do paging, etc. as necessary to handle them) to reorder
them, then cut to remove the line numbers you added at the start.

It's easy with awk to convert paragraphs to/from lines when necessary if
that's a concern, e.g.:

#####
$ cat file
Wee, sleekit, cowrin, tim'rous beastie,
O, what a panic's in thy breastie!
Thou need na start awa sae hasty,
Wi' bickering brattle!
I wad be laith to rin an' chase thee,
Wi' murd'ring pattle!

I'm truly sorry man's dominion,
Has broken nature's social union,
An' justifies that ill opinion,
Which makes thee startle
At me, thy poor, earth-born companion,
An' fellow-mortal!

#####

$ awk -v RS= '{gsub(/@/,"@A"); gsub(/#/,"@B"); gsub(ORS,"#")}1' file |
cat -n | sort -rn | cut -f2- |
awk -v ORS='\n\n' '{gsub(/#/,RS); gsub(/@B/,"#"); gsub(/@A/,"@")}1'
I'm truly sorry man's dominion,
Has broken nature's social union,
An' justifies that ill opinion,
Which makes thee startle
At me, thy poor, earth-born companion,
An' fellow-mortal!

Wee, sleekit, cowrin, tim'rous beastie,
O, what a panic's in thy breastie!
Thou need na start awa sae hasty,
Wi' bickering brattle!
I wad be laith to rin an' chase thee,
Wi' murd'ring pattle!

#####

Obviously you could add the NR with awk but I left the cat -n in as I
like the consistency of always handling `cat -n | sort -rn | cut -f2-`
the same way regardless of other things you're doing. You can also do
whatever else you need to do wrt other tac options using awk outside of
that main line-sorting pipeline.

Regards,

Ed.
[..]
Chris Elvidge (07-02-19, 08:13 PM)
On 02/07/2019 19:08, Ed Morton wrote:
[..]
> that main line-sorting pipeline.
> Regards,
> Ed.


How many UUOC can you get in one post?
Ed Morton (07-02-19, 10:26 PM)
On 7/2/2019 1:13 PM, Chris Elvidge wrote:
> On 02/07/2019 19:08, Ed Morton wrote:
> How many UUOC can you get in one post?


You didn't see any in my post but I'd expect someone could create a post
with many of them if they wanted to. If you think any of the `cat`s in
my post are useless then you misunderstood what they're doing and why -
feel free to ask if you have questions.

Ed.
Martijn Dekker (07-03-19, 04:46 PM)
Op 02-07-19 om 20:08 schreef Ed Morton:
> To be able to handle large files you'd be better off using:
>     cat -n | sort -rn | cut -f2-
> to reverse the line order than what you're currently doing which is
> reading the whole input into a string then splitting that string into an
> array thereby more than requiring twice the memory of the input file size.


Thanks. I did think about that. Unfortunately I can't think of another
way of handling arbitrary (and arbitrarily varying) separators that are
matched by a regular expression, while remembering each individual
separator, as GNU tac does. Your idea only works for lines of text.

(Also, 'cat -n' (count lines) is not POSIX -- though its functionality
can easily be duplicated in awk.)

If only POSIX awk supported a regex RS, like GNU awk does, then this
wouldn't be a problem... Instead, I have to abuse FS to be able to match
regex separators, which involves reading the entire document and then
splitting it as one 'field'.

Maybe it would be better to add another awk script for use with a
single-character, non-regex separator. This case can be handled
straightforwardly with RS and one array, so there's no need for two
copies of the document to exist in memory.

[re: paragraph mode]
> It's easy with awk to convert paragraphs to/from lines when necessary
> if that's a concern, e.g.: [...]
> $ awk -v RS= '{gsub(/@/,"@A"); gsub(/#/,"@B"); gsub(ORS,"#")}1' file |
> cat -n | sort -rn | cut -f2- |
> awk -v ORS='\n\n' '{gsub(/#/,RS); gsub(/@B/,"#"); gsub(/@A/,"@")}1'


I'm guessing the point of this is that none of the utilities in this
pipeline need to hold a large file entirely in working memory (with
'sort' doing its own paging where needed, as you pointed out).

On the other hand, for smaller files, it comes at the cost of invoking
multiple processes instead of just the one. Although I could reduce it
to three by eliminating 'cat -n' and making awk do that work.

Thanks for the ideas -- they did give me stuff to think about.

- M.
Martijn Dekker (07-03-19, 05:20 PM)
Below, gtac = GNU tac, mtac = modernish tac.

$ printf 'fourXXXXthreeXXXtwoXXoneX' | gtac -r -s 'XX*'; echo
oneXXtwoXXXthreeXXXXfourX
$ printf 'fourXXXXthreeXXXtwoXXoneX' | mtac -r -s 'XX*'; echo
oneXtwoXXthreeXXXfourXXXX

With -b (separator precedes record in both input and output):

$ printf 'XXXXfourXXXthreeXXtwoXone' | gtac -b -r -s 'XX*'; echo
XoneXtwoXXthreeXXXfourXXX
$ printf 'XXXXfourXXXthreeXXtwoXone' | mtac -b -r -s 'XX*'; echo
XoneXXtwoXXXthreeXXXXfour

The regex 'XX*' means: match one or more Xes, in both basic and extended
regular expressions, right?

So, mtac does what I would expect, and the gtac output smells like some
off-by-one bug in GNU tac to me.

Am I missing something?

_________
As an aside, here's something GNU tac can't do: separator follows record
in input, precedes record in output.

$ printf 'fourXXXXthreeXXXtwoXXoneX' | mtac -B -r -s 'XX*'; echo
XoneXXtwoXXXthreeXXXXfour

- M.