mongodb

aggregation-framework

performance

I'm trying to use aggregation framework with $match and $group stages. Does $group stage use index data? I'm using latest available mongodb version - 2.5.4

Solution 1

$group does not use index data.

From the mongoDB docs:

The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.

The $geoNear pipeline operator takes advantage of a geospatial index. When using $geoNear, the $geoNear pipeline operation must appear as the first stage in an aggregation pipeline.

Solution 2

@ArthurTacca, as of Mongo 4.0 $sort preceding $group will speed up things significantly. See https://stackoverflow.com/a/56427875/92049.

Solution 3

As 4J41's answer says, $group does not (directly) use an index, although $sort does if it is the first stage in the pipeline. However, it seems possible that $group could, in principle, have an optimised implementation if it immediately follows a $sort, in which case you could make it effectively make use of an index by putting a $sort before hand.

There does not seem to be a straight answer either way in the docs about whether $group has this optimisation (although I bet there would be if it did, so this suggests it doesn't). The answer is in MongoDB bug 4507: currently $group does NOT have this implementation, so the top line of 4J41's answer is right after all. If you really need efficiency, depending on the application it may be quickest to use a regular query and do the grouping in your client code.

Edit: As sebastian's answer says, it seems that in practice using $sort (that can take advantage of an index) before a $group can make a very large speed improvement. The bug above is still open so it seems that it is not making the absolute best possible advantage of the index (that is, starting to group items as items are loaded, rather than loading them all in memory first). But it is still certainly worth doing.

Solution 4

Per Mongo's 4.2 $group documentation, there is a special optimization for $first:

Optimization to Return the First Document of Each Group

If a pipeline sorts and groups by the same field and the $group stage only uses the $first accumulator operator, consider adding an index on the grouped field which matches the sort order. In some cases, the $group stage can use the index to quickly find the first document of each group.

It makes sense, since only the first entry in an ordered index should be needed for each bin in the $group stage. Unfortunately, in my 3.6 testing, I haven't been able to get nearly the performance I would expect if the index were really being used. I've posted about that problem in detail in another question.

EDIT 2020-04-23

I confirmed with Atlas's MongoDB Support that this $first optimization was added in Mongo 4.2, hence my trouble getting it to work with 3.6. There is also a bug preventing it from working with a composite $group _id at the moment. Further details are available in the post that I linked above.

Solution 5

Changed in version 3.2: Starting in MongoDB 3.2, indexes can cover an aggregation pipeline. In MongoDB 2.6 and 3.0, indexes could not cover an aggregation pipeline since even when the pipeline uses an index, aggregation still requires access to the actual documents.

https://docs.mongodb.com/master/core/aggregation-pipeline/#pipeline-operators-and-indexes