Common data cleaning in Mongdb

Keywords: Python Lambda Big Data MongoDB

Preface

Share some common data cleaning methods of Mongdb

Note: "Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in." causes and solutions:

  • Reason
  During big data calculation and data statistics, each calculation task (job or task) will use independent and limited memory space. mongodb does not provide a complex memory allocation model (task scheduling algorithm), which only limits each stage to use a maximum of 100M memory. If it exceeds this value, the calculation will be terminated and an error will be returned. To support the processing of large data sets, we can specify "allowed" The "iskuse" parameter writes the "overflow" data to the local temporary file (temporary collection), which we usually need to set to true.
  • Solution:
{allowDiskUse:true}

Query duplicate data

db.feedImg_all.aggregate([
    {
        $group: { _id: {'mvid': '$mvid','feed_id':'$feed_id'},count: {$sum: 1},dups: {$addToSet: '$_id'}}
    },
    {
        $match: {count: {$gt: 1}}
    }
],{allowDiskUse:true})

Delete duplicate data

db.xiuxiu_all.aggregate([
    {
        $group: { _id: {'mvid': '$mvid','feed_id':'$feed_id'},count: {$sum: 1},dups: {$addToSet: '$_id'}}
    },
    {
        $match: {count: {$gt: 1}}
    }
],{allowDiskUse:true}).forEach(function(doc){
    doc.dups.shift();                       // Remove the first metadata ﹐ id of the duplicate group to get other tuples except the first one
    db.xiuxiu_all.remove({_id: {$in: doc.dups}}); // remove() deletes the duplicate data
})

Delete duplicate data (python version)

# -*- coding:utf-8 -*-
import pymongo
from pymongo import DeleteOne
'''
@author: luochenxi
@time: 2018/11/15
@desc:

'''

pipeline = [
    {
        '$group': {
            '_id': {'mvid': '$mvid', 'feed_id': '$feed_id'},
            'count': {'$sum': 1},
            'dups': {
                '$addToSet': '$_id'
            }
        },
    },
    {
        '$match': {
            'count': {
                '$gt': 1
            }
        }
    }
]

myclient = pymongo.MongoClient(host='m3005.test.com',port=3005,connect=False)
db = myclient.deepnet_test

if __name__ == '__main__':
    map_id = map(lambda doc: doc['dups'][1:], db['xiuxiu_all'].aggregate(pipeline=pipeline,allowDiskUse=True))
    list_id = [item for sublist in map_id for item in sublist]
    print(db['xiuxiu_all']
          .bulk_write(list(map(lambda _id: DeleteOne({'_id': _id}), list_id)))
          .bulk_api_result)

Copying data from a collection to another collection

db.xiuxiu_all.find().forEach(function(x){
    db.xiuxiu_all_bak.insert(x);
})

Filter duplicate fields and count total records

db.feedImg_all.aggregate(
  [
     {$match:{"createTime": {"$gte":  1541606400, "$lt": 1541692800}}}, // Add filter condition
     {$project:{"feedId": true}},  
     {$group:{_id: "$feedId"}}, 
     {$group:{_id: null, count: {$sum:1}}}
 ], {allowDiskUse: true})

Posted by matbennett on Tue, 10 Dec 2019 09:59:48 -0800