I have collection which has two fields

{
name : 'text English',
descr: 'Texto largo en español'
}

I would like to create a multi-language search, with more preferences for a name. Up till now I was doing something like this:

db.items.ensureIndex({
        name : "text",
        descr : "text"
    },{
        default_language: "spanish",
        name : "searchIndex",
        weights : {
            name : 3,
            descr: 1
        }
    }
)

The problem is that it treats everything as Spanish. Looking in the documentation I found that they are using completely different schema. Is there any idea to achieve what I want?

Solution 1

The links are dead from both the question and the original answer given, but there is a way to define the schema for this which is supported in modern versions.

The recommended way would be to include a "language" property in the document or embedded documents next to the property being used for the text index. The term "next to" means at the "same level" and not specifically adjacent to the property in the index.

Something common would look like:

{
  "description": "Texto largo en español",
  "language": "spanish",
  "translation": [
    {
      "description": "Large text in Spanish",
      "language": "english"
    },
    {
      "description": "Grand texte en espagnol",
      "language": "french"
    }
  ]
},
{
  "description": "The quick brown fox",
  "translation": [
    {
      "description": "Le renard brun rapide",
      "language": : "french"
    }
  ]
}

And then presuming that we use the "default" text index language of "english" we can simply index with:

db.collection.createIndex({ "description": "text", "translation.description": "text" })

MongoDB will then use the "language" property as either shown in the document "root" or from "embedded documents" in the array, and where omitted it will simply use the default defined for the index. For instance the second document here has no language property on the "root" so "english" is presumed since it is the default on the index.

The items indexed need not be in any order, as also demonstrated by having the "english" entry inside the "translations" array with embedded documents by the first sample document. The rules for embedded items differs slightly in that we must include the "language" properties on the embedded documents or the actual language used with be that from the document "root". In this example any embedded document in the array without the "language" property would be considered to be using "spanish" since that is what is defined in the "root".

Searches are of course all done in consideration of all the paths present in the index, so on both the "description" and the embedded "translation.description" properties as defined here. The appropriate "search language" is still always used as specified with the $language option to the $text operator, as "stop words" and "stemming" are still considered in relation to this and the default index language set upon index creation.

The embedded format also gives you an easy point from which to retrieve the language information for "translating" between two languages where you have the content defined for both languages in question, so it's practicality is "two fold" in this case.

The specific documentation is now located at Create a text Index for a Collection in Multiple Languages as a section within the wider topic of Specify a Language for Text Index which includes links to all the other details, including specifying a different default language on the index.

Solution 2

You specifically meant: http://docs.mongodb.org/manual/tutorial/create-text-index-on-multi-language-collection/#use-any-field-to-specify-the-language-for-a-document I suppose, which allows you to override the language for a whole document with a specific field's value.

What you want from your question, you can not do yet in MongoDB, but this feature is planned for the upcoming version of MongoDB. You can track the ticket at https://jira.mongodb.org/browse/SERVER-9390