experchange > php

vfhhutt45hghjjdewrghb (12-24-18, 06:39 PM)
I'm not sure if it is a bug, feature or what ever and if this is the correct group. Just for the archive:

In my script I create a file and offer it for downloading. But in the downloaded file was the UTF-8 byte-order mark (BOM) twice at the beginning:
> EF BB BF EF BB BF ...


The code was:
<?php
include 'dbconfig.php';
//... no output etc.
header('Content-type: text/xml; charset=utf-8');
header('Content-Disposition: attachment; filename="' . $titel . '.gpx"');
print "Hello";
exit();
?>

The dbconfig.php did not create any blanks etc. only:
<?php
$db_server = "xx";
$db_user = "xx";
?>

I figured out the problem: The dbconfig.php was saved as UTF-8 with 1 BOM at beginning. It seems that this BOM was added with include. After I delete the 3 first Bytes from the dbconfig.php everything works fine and only 1 BOM is in the downloaded file.
Tim Streater (12-24-18, 06:49 PM)
In article <82a7cd23-82be-4611-a691-e0edcf8c30d6>,
<vfhhutt45hghjjdewrghb> wrote:

[..]
>beginning. It seems that this BOM was added with include. After I delete the 3
>first Bytes from the dbconfig.php everything works fine and only 1 BOM is in
>the downloaded file.


If it's UTF-8 you don't need *any* BOM in the file.
Luuk (12-24-18, 08:07 PM)
On 24-12-2018 17:49, Tim Streater wrote:
> In article <82a7cd23-82be-4611-a691-e0edcf8c30d6>,
> <vfhhutt45hghjjdewrghb> wrote:
> If it's UTF-8 you don't need *any* BOM in the file.


Also specifying two-BOM's should not make the file invalid, unless
there's some CRC code to check if its valid (probably only in executable
files?)

Back to the subject:
I do not think this is a bug in PHP, but a 'bug' in your application.
Everythin PHP has done, is done accourindg to definition.

BTW: Then only thing PHP can do is add this to their docs (if its not
already in the docs): YOU should make sure not any of your PHP-files
starts with a BOM
J.O. Aho (12-24-18, 09:54 PM)
On 12/24/18 7:07 PM, Luuk wrote:
> On 24-12-2018 17:49, Tim Streater wrote:
> Also specifying two-BOM's should not make the file invalid, unless
> there's some CRC code to check if its valid (probably only in executable
> files?)
> Back to the subject:
> I do not think this is a bug in PHP, but a 'bug' in your application.
> Everythin PHP has done, is done accourindg to definition.
> BTW: Then only thing PHP can do is add this to their docs (if its not
> already in the docs): YOU should make sure not any of your PHP-files
> starts with a BOM

I would suggest that the BOM is the bug and shouldn't have been accepted
as part of the UTF-8 standard.
Luuk (12-24-18, 10:15 PM)
On 24-12-2018 20:54, J.O. Aho wrote:
> On 12/24/18 7:07 PM, Luuk wrote:
> I would suggest that the BOM is the bug and shouldn't have been accepted
> as part of the UTF-8 standard.


But even than it's still not a bug in PHP,
just a nasty side-effect
Arno Welzel (12-26-18, 10:13 AM)
Luuk:

[...]
> Back to the subject:
> I do not think this is a bug in PHP, but a 'bug' in your application.
> Everythin PHP has done, is done accourindg to definition.


I think it's a bug.

PHP should NOT output anything unless told so. If there are two files
which only contain "<?php" as the very first content, there should not
be any output at all. And even an echo statement should not output any
BOM, just because the file itself contains it.
Luuk (12-26-18, 01:41 PM)
On 26-12-2018 09:13, Arno Welzel wrote:
> Luuk:
> [...]
> I think it's a bug.
> PHP should NOT output anything unless told so. If there are two files
> which only contain "<?php" as the very first content, there should not
> be any output at all. And even an echo statement should not output any
> BOM, just because the file itself contains it.


PHP is not outputing anything...

Its because you can combine HTML and PHP in one file like:
<?php
echo "2";
?>
+3
<?php
echo "=5";
?>

which outputs:
2+3
=5

If you create an empty php-file with these contents:
<?php

Than noting will be outputted, unless....
There's a BOM before the '<?php'

You can test this using this:
$ echo -e '\xef\xbb\xbf<?php' >bom.php
$ file bom.php
bom.php: PHP script, UTF-8 Unicode (with BOM) text
$ php bom.php | hexdump -C
00000000 ef bb bf |...|
00000003

again, it's not PHP

also read: 'Potential issues with the UTF-8 BOM' on this page:


wich states: "You should ensure that the included files do not start
with a BOM."
Arno Welzel (12-26-18, 04:32 PM)
Luuk:

[...]
> If you create an empty php-file with these contents:
> <?php
> Than noting will be outputted, unless....
> There's a BOM before the '<?php' [...]
> also read: 'Potential issues with the UTF-8 BOM' on this page:
>
> wich states: "You should ensure that the included files do not start
> with a BOM."


Thanks for the clarification.
J.O. Aho (12-26-18, 04:41 PM)
On 12/26/18 9:13 AM, Arno Welzel wrote:
> Luuk:
> [...]
> I think it's a bug.
> PHP should NOT output anything unless told so. If there are two files
> which only contain "<?php" as the very first content, there should not
> be any output at all. And even an echo statement should not output any
> BOM, just because the file itself contains it.


As long as it begins with <?php it will not output anything, but the BOM
is put before the <?php and so you will have output which is outside the
php engines control.
vfhhutt45hghjjdewrghb (01-04-19, 06:29 PM)
Am Mittwoch, 26. Dezember 2018 15:41:06 UTC+1 schrieb J.O. Aho:
> As long as it begins with <?php it will not output anything, but the BOM
> is put before the <?php and so you will have output which is outside the
> php engines control.


I understand the modus operandi but I think it's not a good way. UTF-8 should be used as a norm. And the BOM is part of a UTF-8 file. In other cases (without BOM) it is a ISO 8859-1 file or anything else.

I'm wondering that not everybody have my problems. To include the DB conneting data is a normal procedure for security reasons and for usability.

Maybe php should filter the BOM if "include" is used?!
Richard Damon (01-05-19, 04:08 AM)
On 1/4/19 11:29 AM, vfhhutt45hghjjdewrghb wrote:
> Am Mittwoch, 26. Dezember 2018 15:41:06 UTC+1 schrieb J.O. Aho:
> I understand the modus operandi but I think it's not a good way. UTF-8 should be used as a norm. And the BOM is part of a UTF-8 file. In other cases (without BOM) it is a ISO 8859-1 file or anything else.
> I'm wondering that not everybody have my problems. To include the DB conneting data is a normal procedure for security reasons and for usability.
> Maybe php should filter the BOM if "include" is used?!


The BOM is NOT 'part of a UTF-8 File' in the sense that it is needed to
say that the file is encoded in UTF-8. The Unicode Standard permits the
BOM at the beginning of the file, but does not encourage it.

PHP, by its definition, says that the contents of a file, outside the
<?php and ?> is immediately output to the output stream, so that
includes any whitespace like the BOM.

This is one reason some coding guidelines say to not place the closing
?> on the file if it is all PHP code so you can't have any invisible
white space (like a new line) after it that might get output before the
script gets done with the things that have to be done before anything is
output.
J.O. Aho (01-05-19, 12:04 PM)
On 1/4/19 5:29 PM, vfhhutt45hghjjdewrghb wrote:
> Am Mittwoch, 26. Dezember 2018 15:41:06 UTC+1 schrieb J.O. Aho:
>> As long as it begins with <?php it will not output anything, but the BOM
>> is put before the <?php and so you will have output which is outside the
>> php engines control.

> I understand the modus operandi but I think it's not a good way. UTF-8 should be used as a norm. And the BOM is part of a UTF-8 file. In other cases (without BOM) it is a ISO 8859-1 file or anything else.


The BOM was just an addition for issues in some lesser operating
systems, in newer versions like UTF-16 BOM has been removed from the
standard.

> I'm wondering that not everybody have my problems. To include the DB conneting data is a normal procedure for security reasons and for usability.


Not everyone uses a lesser OS and have BOM enabled in text editors.

> Maybe php should filter the BOM if "include" is used?!


PHP do not know what your intention is with the file, say you are
generating some data inside a UTF-8 BOM document, then you don't want it
to be stripped.

Just do the sensible thing and say no to BOM as there are other ways to
identify your file as UTF-8, and if you don't use UTF-8 characters in
your text, then it's just plain ASCII and has no use of BOM in anyway.
Richard Damon (01-05-19, 06:26 PM)
On 1/5/19 5:04 AM, J.O. Aho wrote:
> On 1/4/19 5:29 PM, vfhhutt45hghjjdewrghb wrote:
> The BOM was just an addition for issues in some lesser operating
> systems, in newer versions like UTF-16 BOM has been removed from the
> standard.


To my knowledge, the Unicode Standard still defines the BOM, and it is
actually very important for formats like UTF-16 (but not needed for
UTF-16LE or UTF-16BE, where the endianness is provided by the encoding
definition).
Tim Streater (01-05-19, 06:33 PM)
In article <2V4YD.191363$lx7.100121>, Richard Damon
<Richard> wrote:

>On 1/5/19 5:04 AM, J.O. Aho wrote:
>To my knowledge, the Unicode Standard still defines the BOM, and it is
>actually very important for formats like UTF-16 (but not needed for
>UTF-16LE or UTF-16BE, where the endianness is provided by the encoding
>definition).


Or UTF-8.
Luuk (01-05-19, 08:34 PM)
On 5-1-2019 17:33, Tim Streater wrote:
> In article <2V4YD.191363$lx7.100121>, Richard Damon
> <Richard> wrote:


Wikipedia is pretty clear about this:

Wikipedia:
In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file
or character stream to indicate the endianness (byte order) of all the
16-bit code units of the file or stream.

> Or UTF-8.


Wikipedia:
The Unicode Standard permits the BOM in UTF-8,[3] but does not require
or recommend its use.[4] Byte order has no meaning in UTF-8,[5] so its
only use in UTF-8 is to signal at the start that the text stream is
encoded in UTF-8, or that it was converted to UTF-8 from a stream that
contained an optional BOM. The standard also does not recommend removing
a BOM when it is there, so that round-tripping between encodings does
not lose information, and so that code that relies on it continues to
work.[6][7] The IETF recommends that if a protocol either (a) always
uses UTF-8, or (b) has some other way to indicate what encoding is being
used, then it "SHOULD forbid use of U+FEFF as a signature."[8]

( )