Unicode encodings and endianness — writing libuninum bindings

	December 2022
S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

The past few days I've been learning how to write bindings for Perl using XS so that I can use the many great libraries out there that I normally use in C or C++. Native bindings are very magical things because they glue together different languages that often don't have a direct mapping of semantics with respect to each other. XS is a bit quirky in that, while most language binding APIs require writing calls directly in C or C++, it is actually it's own DSL for making bindings. There is a preprocessor called xsubpp that generates the actual API calls to glue the Perl interpreter with the native code.

I actually wanted to start learning XS a few months back. In the past, I would put together rudimentary bindings using SWIG, but the results weren't very pleasant to use. It ends up creating bindings that look very much like calling C code and force you to deal with pointers and context directly. That pretty much defeats the purpose of creating a binding! So now that I have a bit more tuits, I started looking around for documentation on using XS. Coincidentally, I found a project that gathered many of the same notes I was using. Seems that I timed my learning process just right and I've been learning a great deal about Perl internals from the newly relaunched #xs channel on irc.perl.org.

As I usually do when I'm learning something new, I jump right into making something as I'm picking things up. I chose to work on something that was both simple, but non-trivial. Years ago on Freshmeat, I came across a project called libuninum that converts different number system strings into integers. Once you have these integers, you can use them in operations for arithmetic and sorting. Pretty useful if you have to deal with data in different languages.

Before I actually hack on the bindings, I need to think about how I'm going to distribute this code. Most people's systems aren't going to have access to the libuninum source code to build these bindings, so I'll need to somehow get the source code and build it on those systems. That's where Alien::Base comes in. It's a neat module that will download a tarball, extract it, build it, and place the dynamic library and headers in a place that can be accessed by other modules. I made a subclass of Alien::Base called Alien::Uninum that will do just that for libuninum. I even got a small patch in to Alien::Base to fix some issues I had. All I needed now to start hacking on the XS code is a way to tell the compiler where all the libuninum files are. With Alien::Base, I just send those to the package build process using the cflags and libs methods which is pretty much like using pkg-config (code).

I got to hacking and started on the simplest task: getting the list of all the number systems. I first approached this by just making a list of hashes that contained the name and ID of each number system (code). Not too bad. I then added caching of that list by storing that as a private attribute of my Unicode::Number class (code). Then I built on that and created a Unicode::Number::System class to store the number system name and ID so that I could return instances of that instead (code).

I then moved on to to the actual main function of the library: converting a Unicode number to an integer. This was a bit tricky because Unicode comes in many different encodings (e.g. UTF-8, UTF-16, UTF-32) and these encodings can also have different endianness. Since the libuninum library expects all strings to be in UTF-32, I converted Perl strings from UTF-8 to UTF-32 and sent them to the XS code, but the library was giving me an "illegal character" error. To debug this, I grabbed some of the data from an example file that came with libuninum and put it in my XS. Still not working. This didn't make sense because I could get it working in plain C, but not in the XS. So I put together a small script using Inline::C that let me call the libuninum function directly.

Posted Sun Sep 28 18:55:29 2014

It still wasn't working. So, as you can see above, I grabbed a function from uninum.c and renamed it to MyLaoToInt and called it directly. Still wasn't working. Only when I started to print out the contents of each character did I realise what was happening. In libuninum's unicode.h, the UTF32 typedef is defined as an unsigned long, however sizeof(unsigned long) is 8 (64-bits) on my system, not 4 (32-bits).

Posted Sun Sep 28 19:05:51 2014

That means that as the library iterates over each character, it is actually looking at two characters instead of one and of course, none of the comparisons were working. What it actually needed to use was a uint32_t from stdint.h. However, even though this typedef is in the C99 standard, there are some portability issues with using it. Instead, I used the integer type that Perl detected to be 32-bits wide and patched the code when I built it using Alien::Uninum (code). Now the file looked like this:

Posted Sun Sep 28 19:05:51 2014

Yay! Now the XS code was working on the test data. All I had to do now was get my string to libuninum and pass the result back. I tried that and libuninum was giving me errors again. Now what?! I decided I need to look at what the C was accessing, so I grabbed a hex dump routine from here and looked at it:

00 00 fe ff ...

As soon as I saw the first character, I knew what was going on. What I was looking at was the byte-order mark or BOM. Remember, I had converted the UTF-8 string to UTF-32 in Perl before sending it to C, but I never specified the endianness, so Perl used big-endian as the default endianness. Well, since the C code was using the native endianness of the machine, I needed to find the machine's endianness and encode either a little-endian or big-endian version of UTF-32. I just had to ask Perl the byte order it detected at compile time and use that (code).

Once I did that, my code was working and all my tests passed! There are still a couple of things I need to do in order to clean it up, but it's mostly done for now.