Jcode - Japanese Charset Handler
use Jcode; # # traditional Jcode::convert(\$str, $ocode, $icode, "z"); # or OOP! print Jcode->new($str)->h2z->tr($from, $to)->utf8;
<Japanese document is now available as the Jcode::Nihongo manpage. >
Jcode.pm supports both object and traditional approach. With object approach, you can go like;
$iso_2022_jp = Jcode->new($str)->h2z->jis;
Which is more elegant than:
$iso_2022_jp = $str; &jcode::convert(\$iso_2022_jp, 'jis', &jcode::getcode(\$str), "z");
For those unfamiliar with objects, Jcode.pm still supports getcode()
and convert().
If the perl version is 5.8.1, Jcode acts as a wrapper to Encode, the standard charset handler module for Perl 5.8 or later.
Methods mentioned here all return Jcode object unless otherwise mentioned.
For perl 5.8.1 or better, $icode
can be any encoding name
that Encode understands.
$j = Jcode->new($european, 'iso-latin1');
When the object is stringified, it returns the EUC-converted string so you can <print $j> instead of <print $j->euc>.
Jcode->new(\$str);
This saves time a little bit. In exchange of the value of $str being converted. (In a way, $str is now ``tied'' to jcode object).
# converts mailbox to SJIS format my $jconv = new Jcode; $/ = 00; while(<>){ print $jconv->set(\$_)->mime_decode->sjis; }
new()
so you can go like;
In general, you can retrieve encoded string as $j->encoded.
$j->h2z->jis
.
Hankaku Kanas are forcibly converted to Zenkaku.
For perl 5.8.1 and better, you can also use any encoding names and aliases that Encode supports. For example:
$european = $j->iso_latin1; # replace '-' with '_' for names.
FYI: the Encode::Encoder manpage uses similar trick.
fallback($fallback)
my $unistr = "\x{262f}"; # YIN YANG my $j = jcode($unistr); # $j->euc is '?'
You can change this behavior by specifying fallback like Encode.
Values are the same as Encode. Jcode::FB_PERLQQ
,
Jcode::FB_XMLCREF
, Jcode::FB_HTMLCREF
are aliased to those
of Encode for convenice.
print $j->fallback(Jcode::FB_PERLQQ)->euc; # '\x{262f}' print $j->fallback(Jcode::FB_XMLCREF)->euc; # '☯' print $j->fallback(Jcode::FB_HTMLCREF)->euc; # '☯'
The global variable $Jcode::FALLBACK
stores the default fallback so you can override that by assigning the value.
$Jcode::FALLBACK = Jcode::FB_PERLQQ; # set default fallback scheme
with a newline string spefied by $newline_str (default: ``\n'').
Rudimentary kinsoku suppport is now available for Perl 5.8.1 and better.
To use methods below, you need the MIME::Base64 manpage. To install, simply
perl -MCPAN -e 'CPAN::Shell->install("MIME::Base64")'
If your perl is 5.6 or better, there is no need since the MIME::Base64 manpage is bundled.
For Perl 5.8.1 or better, you can also encode MIME Header as:
$mime_header = $j->MIME_Header;
In which case the resulting $mime_header
is MIME-B-encoded UTF-8
whereas $j->mime_encode()
returnes MIME-B-encoded ISO-2022-JP.
Most modern MUAs support both.
Jcode->new($str, 'MIME-Header')
h2z([$keep_dakuten])
You can retrieve the number of matches via $j->nmatch;
You can retrieve the number of matches via $j->nmatch;
To use ->m()
and ->s()
, you need perl 5.8.1 or
better.
tr/$from/$to/
on Jcode object where $from and $to are
EUC-JP strings. On perl 5.8.1 or better, $from and $to can
also be flagged UTF-8 strings.
If $opt
is set, tr/$from/$to/$opt
is applied. $opt
must
be 'c', 'd' or the combination thereof.
You can retrieve the number of matches via $j->nmatch;
The following methods are available only for perl 5.8.1 or better.
s/$pattern/$replace/$opt
. $pattern
and replace
must
be in EUC-JP or flagged UTF-8. $opt
are the same as regexp options.
See perlre for regexp options.
Like $j->tr()
, $j->s()
returns the object itself so
you can nest the operation as follows;
$j->tr("a-z", "A-Z")->s("foo", "bar");
m/$patter/$opt
. Note that this method DOES NOT RETURN
AN OBJECT so you can't chain the method like $j->s()
.
If you need to access instance variables of Jcode object, use access methods below instead of directly accessing them (That's what OOP is all about)
FYI, Jcode uses a ref to array instead of ref to hash (common way) to optimize speed (Actually you don't have to know as long as you use access methods instead; Once again, that's OOP)
getcode($str)
ascii Ascii (Contains no Japanese Code) binary Binary (Not Text File) euc EUC-JP sjis SHIFT_JIS jis JIS (ISO-2022-JP) ucs2 UCS2 (Raw Unicode) utf8 UTF8
When array context is used instead of scaler, it also returns how many character codes are found. As mentioned above, $str can be \$str instead.
jcode.pl Users: This function is 100% upper-conpatible with jcode::getcode() -- well, almost;
* When its return value is an array, the order is the opposite; jcode::getcode() returns $nmatch first.
* jcode::getcode() returns 'undef' when the number of EUC characters is equal to that of SJIS. Jcode::getcode() returns EUC. for Jcode.pm there is no in-betweens.
jcode.pl Users: This function is 100% upper-conpatible with jcode::convert() !
For perl is 5.8.1 or later, Jcode acts as a wrapper to Encode. Meaning Jcode is subject to bugs therein.
This package owes a lot in motivation, design, and code, to the jcode.pl for Perl4 by Kazumasa Utashiro <[email protected]>.
Hiroki Ohzaki <[email protected]> has helped me polish regexp from the very first stage of development.
JEncode by [email protected] has inspired me to integrate Encode to Jcode. He has also contributed Japanese POD.
And folks at Jcode Mailing list <[email protected]>. Without them, I couldn't have coded this far.
Encode
http://www.iana.org/assignments/character-sets
Copyright 1999-2005 Dan Kogai <[email protected]>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.