Basics of Unicode in Perl

Dave Rolsky

Character Sets

  • Mapping of numbers to characters
  • ASCII - 0-127
  • ISO-8859-1 (aka Latin-1) - 0-255
  • Unicode - 0-(2^32 - 1)

Character Encoding

  • Mapping of byte patterns to characters
  • ASCII & ISO-8859-1 use a single byte per character
  • Unicode is not a character encoding!
  • UTF-8, UTF-16, and UTF-32 are multibyte encodings for the Unicode set

Encoding vs Set Confusion

  • Often used interchangeably
  • Set is abstract
  • Encoding defines a concrete representation

Perl's Internals

  • Scalar contains bytes (0-255)
  • Bytes can be interpreted as UTF-8 characters
  • The "UTF-8 flag"

Bytes vs Characters


use strict;
use warnings;
use v5.16;
use Encode qw( decode );

my $bytes = join q{}, map { chr($_) } 240, 159, 152, 184;
say length $bytes; # 4

my $utf8 = decode('UTF-8', $bytes);
say length $utf8; # 1

binmode STDOUT, ':encoding(UTF-8)';
say $utf8;

Bytes vs Characters Output


$ perl code/bytes-vs-utf8
4
1
😸
        

decode and encode

  • decode - from any encoding to Perl's internal representation
  • encode - from Perl's internal representation to any encoding

When to decode and encode

  • Decode all incoming data
  • Encode all outgoing data

Handle (File) I/O


open my $fh, '<:encoding(UTF-8)', $file;
my $content = read_file( $file, binmode => ':encoding(UTF-8)' );
        

use open ':encoding(UTF-8)';
        

use open ':std', ':encoding(UTF-8)';
        

Web Pages & Services


my $ua = LWP::UserAgent->new;
my $response = $ua->get($url);
my $content = $response->decoded_content
        

my $content = JSON->new->utf8->decode($json);
        
  • Except that the decoded_content is kind of broken and may or may not actually decode the content the way you'd expect, depending on the content type.
  • It does for anything matching m{^text/}, but not for other types.

Databases


use DBD::Pg 3.0;
my $dbh = DBD::Pg->connect(...);
        

Unicode Characters in Your Code


use strict;
use warnings;
use v5.16;
my $bytes = "😸";
say length $bytes; # 4

use utf8;
my $utf8 = "😸";
say length $utf8; # 1
        

Unicode Characters in Your Code (Take Two)


use strict;
use warnings;
use v5.16;

my $utf8_by_code = "\x{1f638}";
say length $utf8_by_code;

use charnames ':full';
my $utf8_by_name = "\N{GRINNING CAT FACE WITH SMILING EYES}";
say length $utf8_by_name;
        

Regex Character Classes


use strict;
use warnings;
use v5.16;
use open ':std', ':encoding(UTF-8)';


my @strings = ( '12', "\x{ff11}\x{ff12}" );
for my $string (@strings) {
    if ( $string =~ /^\p{N}+$/ ) {
        say "Unicode Number $string";
    }

    if ( $string =~ /^\d+$/a ) {
        say "ASCII Number $string";
    }
}
        

Regex Character Classes Output


$ perl code/regex
Unicode Number 12
ASCII Number 12
Unicode Number 12
        

Advanced topics

  • Composing characters & normal forms
  • Sorting
  • Character properties
  • Unicode and fonts