In mongodb docs the author mentions it's a good idea to shorten property names:

Use shorter field names.

and in an old blog post from how to node (it is offline by now April, 2022 edit)

....oft-reported issue with mongoDB is the size of the data on the disk... each and every record stores all the field-names .... This means that it can often be more space-efficient to have properties such as 't', or 'b' rather than 'title' or 'body', however for fear of confusion I would avoid this unless truly required!

I am aware of solutions of how to do it. I am more interested in when is this truly required?

Solution 1

To quote Donald Knuth:

Premature optimization is the root of all evil (or at least most of it) in programming.

Build your application however seems most sensible, maintainable and logical. Then, if you have performance or storage issues, deal with those that have the greatest impact until either performance is satisfactory or the law of diminishing returns means there's no point in optimising further.

If you are uncertain of the impact of particular design decisions (like long property names), create a prototype to test various hypotheses (like "will shorter property names save much space"). Don't expect the outcome of testing to be conclusive, however it may teach you things you didn't expect to learn.

Solution 2

Keep the priority for meaningful names above the priority for short names unless your own situation and testing provides a specific reason to alter those priorities.

As mentioned in the comments of SERVER-863, if you're using MongoDB 3.0+ with the WiredTiger storage option with snappy compression enabled, long field names become even less of an issue as the compression effectively takes care of the shortening for you.

Solution 3

Bottom line up: So keep it as compact as it still stays meaningful.

I don't think that this is every truly required to be shortened to one letter names. Anyway you should shorten them as much as possible, and you feel comfortable with it. Lets say you have a users name: {FirstName, MiddleName, LastName} you may be good to go with even name:{first, middle, last}. If you feel comfortable you may be fine with name:{f, m,l}.
You should use short names: As it will consume disk space, memory and thus may somewhat slowdown your application(less objects to hold in memory, slower lookup times due to bigger size and longer query time as seeking over data takes longer).
A good schema documentation may tell the developer that t stands for town and not for title. Depending on your stack you may even be able to hide the developer from working with these short cuts through some helper utils to map it.

Finally I would say that there's no guideline to when and how much you should shorten your schema names. It highly depends on your environment and requirements. But you're good to keep it compact if you can supply a good documentation explaining everything and/or offering utils to ease the life of developers and admins. Anyway admins are likely to interact directly with mongodb, so I guess a good documentation shouldn't be missed.

Solution 4

I performed a little benchmark, I uploaded 252 rows of data from an Excel into two collections testShortNames and testLongNames as follows:

Long Names:

{
    "_id": ObjectId("6007a81ea42c4818e5408e9c"),
    "countryNameMaster": "Andorra",
    "countryCapitalNameMaster": "Andorra la Vella",
    "areaInSquareKilometers": 468,
    "countryPopulationNumber": NumberInt("77006"),
    "continentAbbreviationCode": "EU",
    "currencyNameMaster": "Euro"
}

Short Names:

{
    "_id": ObjectId("6007a81fa42c4818e5408e9d"),
    "name": "Andorra",
    "capital": "Andorra la Vella",
    "area": 468,
    "pop": NumberInt("77006"),
    "continent": "EU",
    "currency": "Euro"
}

I then got the stats for each, saved in disk files, then did a "diff" on the two files:

pprint.pprint(db.command("collstats", dbCollectionNameLongNames))

The image below shows two variables of interest: size and storageSize. My reading showed that storageSize is the amount of disk space used after compression, and basically size is the uncompressed size. So we see the storageSize is identical. Apparently the Wired Tiger engine compresses fieldnames quite well.

I then ran a program to retrieve all data from each collection, and checked the response time.

Even though it was a sub-second query, the long names consistently took about 7 times longer. It of course will take longer to send the longer names across from the database server to the client program.

-------LongNames-------
Server Start DateTime=2021-01-20 08:44:38
Server End   DateTime=2021-01-20 08:44:39
StartTimeMs= 606964546  EndTimeM= 606965328
ElapsedTime MilliSeconds= 782
-------ShortNames-------
Server Start DateTime=2021-01-20 08:44:39
Server End   DateTime=2021-01-20 08:44:39
StartTimeMs= 606965328  EndTimeM= 606965421
ElapsedTime MilliSeconds= 93

In Python, I just did the following (I had to actually loop through the items to force the reads, otherwise the query returns only the cursor):

results = dbCollectionLongNames.find(query)
for result in results:
    pass

Solution 5

Adding my 2 cents on this..

Long named attributes (or, "AbnormallyLongNameAttributes") can be avoided while designing the data model. In my previous organisation we tested keeping short named attributes strategy, such as, organisation defined 4-5 letter encoded strings, eg:

  1. First Name = FSTNM,
  2. Last Name = LSTNM,
  3. Monthly Profit Loss Percentage = MTPCT,
  4. Year on Year Sales Projection = YOYSP, and so on..)

While we observed an improvement in query performance, largely due to the reduction in size of data being transferred over the network, or (since we used JAVA with MongoDB) the reduction in length of "keys" in MongoDB document/Java Map heap space, the overall improvement in performance was less than 15%.

In my personal opinion, this was a micro-optimzation that came at an additional cost (and a huge headache) of maintaining/designing an additional system of managing Data Attribute Dictionary for each of the data models. This system was required to have an organisation wide transparency while debugging the application/answering to client queries.

If you find yourself in a position where upto 20% increase in the performance with this strategy is lucrative to you, may be it is time to scale up your MongoDB servers/choose some other data modelling/querying strategy, or else to choose a different database altogether.

Solution 6

If using verbose xml, trying to ameliorate that with custom names could be very important. A user comment in the SERVER-863 ticket said in his case; I'm ' storing externally-defined XML objects, with verbose naming: the fieldnames are, perhaps, 70% of the total record size. So fieldname tokenization could be a giant win, both in terms of I/O and memory efficiency.'

Solution 7

Collection with smaller name - InsertCompress Collection with bigger name - InsertNormal

I Performed this on our mongo sharded cluster and Analysis shows

  1. There is around 10-15% gain in shorter names while saving and seems purely based on network latency. I added bulk insert using multiple threads. So if single inserts it can save more.

  2. My avg data size for InsertCompress is 280B and InsertNormal is 350B and inserted 25 million records. So InsertNormal shows 8.1 GB and InsertCompress shows 6.6 GB. This is data size.

  3. Surprisingly Index data size shows as 2.2 GB for InsertCompress collection and 2 GB for InsertNormal collection

  4. Again the storage size is 2.2 GB for InsertCompress collection while InsertNormal its around 1.6 GB

Overall apart from network latency there is nothing gained for storage, so not worth to put efforts going in this direction to save storage. Only if you have much bigger document and smaller field names saves lot of data you can consider