I have a collection with 9 million records. I am currently using the following script to update the entire collection:

simple_update.js

db.mydata.find().forEach(function(data) {
  db.mydata.update({_id:data._id},{$set:{pid:(2571 - data.Y + (data.X * 2572))}});
});

This is run from the command line as follows:

mongo my_test simple_update.js

So all I am doing is adding a new field pid based upon a simple calculation.

Is there a faster way? This takes a significant amount of time.

Solution 1

There are two things that you can do.

  1. Send an update with the 'multi' flag set to true.
  2. Store the function server-side and try using server-side code execution.

That link also contains the following advice:

This is a good technique for performing batch administrative work. Run mongo on the server, connecting via the localhost interface. The connection is then very fast and low latency. This is friendlier than db.eval() as db.eval() blocks other operations.

This is probably the fastest you'll get. You have to realize that issuing 9M updates on a single server is going to be a heavy operation. Let's say that you could get 3k updates / second, you're still talking about running for nearly an hour.

And that's not really a "mongo problem", that's going to be a hardware limitation.

Solution 2

I am using the: db.collection.update method

// db.collection.update( criteria, objNew, upsert, multi ) // --> for reference
db.collection.update( { "_id" : { $exists : true } }, objNew, upsert, true);

Solution 3

I won't recommend using {multi: true} for a larger data set, because it is less configurable.

A better way using bulk insert.

Bulk operation is really helpful for scheduler tasks. Say you have to delete data older that 6 months daily. Use bulk operation. Its fast and won't slow down server. The CPU, memory usage is not noticeable when you do insert, delete or update over a billion documents. I found {multi:true} slowing down the server when you are dealing with million+ documents(require more research in this.)

See a sample below. It's a js shell script, can run it in server as a node program as well.(use npm module shelljs or similar to achieve this)

update mongo to 3.2+

The normal way of updating multiple unique document is

let counter = 0;
db.myCol.find({}).sort({$natural:1}).limit(1000000).forEach(function(document){
    counter++;
    document.test_value = "just testing" + counter
    db.myCol.save(document)
});

It took 310-315 seconds when I tried. That's more than 5 minutes for updating a million documents.

My collection includes 100 million+ documents, so speed may differ for others.

The same using bulk insert is

    let counter = 0;
// magic no.- depends on your hardware and document size. - my document size is around 1.5kb-2kb
// performance reduces when this limit is not in 1500-2500 range.
// try different range and find fastest bulk limit for your document size or take an average.
let limitNo = 2222; 
let bulk = db.myCol.initializeUnorderedBulkOp();
let noOfDocsToProcess = 1000000;
db.myCol.find({}).sort({$natural:1}).limit(noOfDocsToProcess).forEach(function(document){
    counter++;
    noOfDocsToProcess --;
    limitNo--;
    bulk.find({_id:document._id}).update({$set:{test_value : "just testing .. " + counter}});
    if(limitNo === 0 || noOfDocsToProcess === 0){
        bulk.execute();
        bulk = db.myCol.initializeUnorderedBulkOp();
        limitNo = 2222;
    }
});

The best time was 8972 millis. So in average it took only 10 seconds to update a million documents. 30 times faster than old way.

Put the code in a .js file and execute as mongo shell script.

If someone found a better way, please update. Lets use mongo in a faster way.

Solution 4

Starting Mongo 4.2, db.collection.update() can accept an aggregation pipeline, finally allowing the update/creation of a field based on another field; and thus allowing us to fully apply this kind of query server-side:

// { Y: 456,  X: 3 }
// { Y: 3452, X: 2 }
db.collection.update(
  {},
  [{ $set: { pid: {
    $sum: [ 2571, { $multiply: [ -1, "$Y" ] }, { $multiply: [ 2572, "$X" ] } ]
  }}}],
  { multi: true }
)
// { Y: 456,  X: 3, pid: 9831 }
// { Y: 3452, X: 2, pid: 4263 }
  • The first part {} is the match query, filtering which documents to update (all documents in this case).

  • The second part [{ $set: { pid: ... } }] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline). $set is a new aggregation operator and an alias of $addFields. Note how pid is created directly based on the values of X ($X) and Y ($Y) from the same document.

  • Don't forget { multi: true }, otherwise only the first matching document will be updated.

Solution 5

Not sure if it will be any faster but you could do a multi-update. Just say update where _id > 0 (this will be true for every object) and then set the 'multi' flag to true and it should do the same without having to iterate through the entire collection.

Check this out: MongoDB - Server Side Code Execution