I have a very large collection on MongoDB and I want to remove the duplicate record from that collection. First thought comes to my mind is to drop the index and reconstruct the index with dropDups. However, the duplicated data is too many to be handled by MongoDB.

So I turns to MapReduce for help. Here is my current progress.

m = function () { 
    emit(this.myid, 1); 
}

r = function (k, vals) { 
    return Array.sum(vals); 
} 

res = db.userList.mapReduce(m,r, { out : "myoutput" });

And all the duplicate record's "myid" are stored in "myoutput" collection. However, I don't know how to remove the record from userList by referencing myoutput.myid. It supposes to be something like this:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
        db.userList.remove(xxxxxxxxx) // I don't know how to do so
})

Btw, using foreach seems will wipe all records with the sane myid. But I just want to remove duplicate records. Ex:

{ "_id" : ObjectId("4edc6773e206a55d1c0000d8"), "myid" : 0 }
{ "_id" : ObjectId("4edc6780e206a55e6100011a"), "myid" : 0 }

{ "_id" : ObjectId("4edc6784e206a55ed30000c1"), "myid" : 0 }

The final result should preserve only one record. Can someone give me some help on this?

Thank you. :)

Solution 1

the cleanest is probably to write a client-side script that deletes records:

db.myoutput.find({value: {$gt: 1}}).forEach(
    function(obj) {
    var cur = db.userList.find({ myid: obj._id }, {_id: 1});
    var first = true;
    while (cur.hasNext()) {
        var doc = cur.next();
        if (first) {first = false; continue;}
        db.userList.remove({ _id: doc._id });
    }
})

I have not tested this code so always double check if running against prod data..

Solution 2

While the above answer is quite effective, it is indeed extremely slow if you have 900K or 3M records in your database / collection.

If dealing with large amounts of data, I suggest taking the long road:

  • Select items using a GROUP BY analog - db.collection.group()
  • Store this data using the reduce function in an array
  • Save exported data as JSON
  • Import it again using mongoimport into a clean database.

For 900K entries, this took around 35s (group query).

Implementation in PHP:

$mongo_client = new MongoClient();
$collection = $mongo_client->selectCollection("main", "settings");

//Group by the field "code"
$keys = array("code" => 1);
//You must create objects for every field you wish to transfer (except the one grouped by - that gets auto-transferred)
$initial = array("location" => "", "name" => "", "score" => 0, "type" => "");
//The reduce function will set the grouped properties
$reduce = "function (obj, prev) { prev.location = obj.location; prev.name = obj.name;  prev.score = obj.score; prev.type = obj.type; }";

$fh = fopen("Export.json", "w");
$unique_set = $collection->group($keys, $initial, $reduce);
fwrite($fh, json_encode($unique_set['retval']));
fclose($fh);

If you have very few duplicates, running it on PHP might not be the best option, but my set had a huge number of duplicates, so the final dataset was easy to handle. Perhaps someone will find this useful for speed. (and transferring to mongo shell should be fairly easy.)

Remember, however, that you will have to re-format the final file to have 1 document per line for it to work with mongoimport. (A search/replace all should be fine here.)

Solution 3

/*
 * This map reduce will output a new collection: "duplicateinvoices"
 * { "_id" : "12345", "value" : 2 }
 * { "_id" : "23456", "value" : 2 }
 * ...
**/
m = function () { 
    emit(this.MlsId, 1); 
}

r = function (k, vals) { 
    return Array.sum(vals); 
} 

res = db.invoices.mapReduce(m,r, { out : "duplicateinvoices" });

/*
 * We have two approaches (we should test wich is faster/reliable, i didn't
**/

/* OPTION 1 */
// We iterate over duplicateinvoices and get the media-hash
// of the ones with value > 1 the duplicates
db.duplicateinvoices.find({value: {$gt: 1}}).forEach(
    function(invoice) {
        // temporary save one of this objects into a variable
        var obj = db.invoices.findOne({ media_hash: invoice._id });
        // remove all media-hash matched invoices from invoice collection
        db.invoices.remove({media_hash: invoice._id})
        // insert again the previously saved object into collection 
        db.invoices.insert(obj)
    }
)

/* OPTION 2 */
// We iterate over duplicateinvoices and get the media-hash
// of the ones with value > 1 the duplicates
db.duplicateinvoices.find({value: {$gt: 1}}).forEach(
    function(invoice) {
        // Invoices cursor with all the media_hash matched documents
        var cur = db.invoices.find({ media_hash: invoice._id });
        var first = true;
        while (cur.hasNext()) {
            var doc = cur.next();
            // Skip the first one
            if (first) {first = false; continue;}
            // Delete the others matched documents
            db.userList.remove({ _id: doc._id });
        }
    }
)

Sources:

How to remove duplicate record in MongoDB by MapReduce? http://openmymind.net/2011/1/20/Understanding-Map-Reduce/ http://docs.mongodb.org/manual/tutorial/map-reduce-examples/

Solution 4

actually there is no need for mapreduce here. what about this : ? paste code in mongo shell:

   function removeDupls (collectionName, keyField, reportEvery) {
    if (reportEvery === undefined) {reportEvery=10;}  
    sort = {};
    sort[keyField] = 1;
    var myidLast; 
    var res = {docsCnt:0,docsRemoved:0}
    db[collectionName].find().sort(sort).clone().forEach(
        function(doc) {
                res['docsCnt'] += 1; 
                if (doc.myid == myidLast) {db[collectionName].remove({_id:doc._id}); res['docsRemoved'] +=1;}
                else {myidLast = doc.myid;}
                if (res['docsCnt'] % reportEvery === 0) {print (JSON.stringify(res))} 
            } 
    );
    return res;
}

then call it:

removeDupls('users','myid',1000)

this will work and probably it will be faster than any mapreduce > remove job (depending on your quantity of duplicated documents) If you want to make it really fast you should store the _ids of documents to be removed in a temporary array then use batch remove.