Preface
Share some common data cleaning methods of Mongdb
Note: "Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in." causes and solutions:
- Reason
During big data calculation and data statistics, each calculation task (job or task) will use independent and limited memory space. mongodb does not provide a complex memory allocation model (task scheduling algorithm), which only limits each stage to use a maximum of 100M memory. If it exceeds this value, the calculation will be terminated and an error will be returned. To support the processing of large data sets, we can specify "allowed" The "iskuse" parameter writes the "overflow" data to the local temporary file (temporary collection), which we usually need to set to true.
- Solution:
{allowDiskUse:true}
Query duplicate data
db.feedImg_all.aggregate([ { $group: { _id: {'mvid': '$mvid','feed_id':'$feed_id'},count: {$sum: 1},dups: {$addToSet: '$_id'}} }, { $match: {count: {$gt: 1}} } ],{allowDiskUse:true})
Delete duplicate data
db.xiuxiu_all.aggregate([ { $group: { _id: {'mvid': '$mvid','feed_id':'$feed_id'},count: {$sum: 1},dups: {$addToSet: '$_id'}} }, { $match: {count: {$gt: 1}} } ],{allowDiskUse:true}).forEach(function(doc){ doc.dups.shift(); // Remove the first metadata ﹐ id of the duplicate group to get other tuples except the first one db.xiuxiu_all.remove({_id: {$in: doc.dups}}); // remove() deletes the duplicate data })
Delete duplicate data (python version)
# -*- coding:utf-8 -*- import pymongo from pymongo import DeleteOne ''' @author: luochenxi @time: 2018/11/15 @desc: ''' pipeline = [ { '$group': { '_id': {'mvid': '$mvid', 'feed_id': '$feed_id'}, 'count': {'$sum': 1}, 'dups': { '$addToSet': '$_id' } }, }, { '$match': { 'count': { '$gt': 1 } } } ] myclient = pymongo.MongoClient(host='m3005.test.com',port=3005,connect=False) db = myclient.deepnet_test if __name__ == '__main__': map_id = map(lambda doc: doc['dups'][1:], db['xiuxiu_all'].aggregate(pipeline=pipeline,allowDiskUse=True)) list_id = [item for sublist in map_id for item in sublist] print(db['xiuxiu_all'] .bulk_write(list(map(lambda _id: DeleteOne({'_id': _id}), list_id))) .bulk_api_result)
Copying data from a collection to another collection
db.xiuxiu_all.find().forEach(function(x){ db.xiuxiu_all_bak.insert(x); })
Filter duplicate fields and count total records
db.feedImg_all.aggregate( [ {$match:{"createTime": {"$gte": 1541606400, "$lt": 1541692800}}}, // Add filter condition {$project:{"feedId": true}}, {$group:{_id: "$feedId"}}, {$group:{_id: null, count: {$sum:1}}} ], {allowDiskUse: true})