Assume that I need to insert the following document:

    title: 'Péter'

(note the é)

It gives me an error when I use the following PHP-code ... :

$db->collection->insert(array("title" => "Péter"));

... because it needs to be utf-8.

So I should use this line of code:

$db->collection->insert(array("title" => utf8_encode("Péter")));

Now, when I request the document, I still have to decode it ... :

$document = $db->collection->findOne(array("_id" => new MongoId("__someID__")));
$title = utf8_decode($document['title']);

Is there some way to automate this process? Can I change the character-encoding of MongoDB (I'm migrating a MySQL-database that's using cp1252 West Europe (latin1)?

I already considered changing the Content-Type-header, problem is that all static strings (hardcoded) aren't utf8...

Thanks in advance! Tim

Solution 1

JSON and BSON can only encode / decode valid UTF-8 strings, if your data (included input) is not UTF-8 you need to convert it before passing it to any JSON dependent system, like this:

$string = iconv('UTF-8', 'UTF-8//IGNORE', $string); // or
$string = iconv('UTF-8', 'UTF-8//TRANSLIT', $string); // or even
$string = iconv('UTF-8', 'UTF-8//TRANSLIT//IGNORE', $string); // not sure how this behaves

Personally I prefer the first option, see the iconv() manual page. Other alternatives include:

You should always make sure your strings are UTF-8 encoded, even the user-submitted ones, however since you mentioned that you're migrating from MySQL to MongoDB, have you tried exporting your current database to CSV and using the import scripts that come with Mongo? They should handle this...

EDIT: I mentioned that BSON can only handle UTF-8, but I'm not sure if this is exactly true, I have a vague idea that BSON uses UTF-16 or UTF-32 to encode / decode data, but I can't check now.

Solution 2

As @gates said, all string data in BSON is encoded as UTF-8. MongoDB assumes this.

Another key point which neither answer addresses: PHP is not Unicode aware. As of 5.3, anyway. PHP 6 will supposedly be Unicode-aware. What this means is you have to know what encoding is used by your operating system by default and what encoding PHP is using.

Let's get back to your original question: "Is there some way to automate this process?" ... my suggestion is to make sure you are always using UTF-8 throughout your application. Configuration, input, data storage, presentation, everything. Then the "automated" part is that most of your PHP code will be simpler since it always assumes UTF-8. No conversions necessary. Heck, nobody said automation was cheap. :)

Here's kind of an aside. If you created a little PHP script to test that insert() code, figure out what encoding your file is, then convert to UTF-8 before inserting. For example, if you know the file is ISO-8859-1, try this:

$title = mb_convert_encoding("Péter", "UTF-8", "ISO-8859-1");
$db->collection->insert(array("title" => $title));

See also

Solution 3

Can I change the character-encoding of MongoDB...

No data is stored in BSON. According to the BSON spec, all string are UTF-8.

Now, when I request the document, I still have to decode it ... : Is there some way to automate this process?

It sounds like you are trying to output the data to web page. Needing to "decode" text that was already encoded seems incorrect.

Could this output problem be a configuration issue with Apache+PHP? UTF8+PHP is not automatic, a quick online search brought up several tutorials on this topic.