Solution to python Memory Error

Keywords: Python Data Analysis

   when using python to process data, sometimes large data sets may be encountered, and the problem of Memory Error may occur. After my attempt, I summarize the following schemes:

1. Modify the length of data type

  modify the length of the data type to compress the data in memory, so as to reduce the occupation of memory.

import time
# Memory compression of data
def reduce_mem_usage(df):
    starttime = time.time()
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if pd.isnull(c_min) or pd.isnull(c_max):
                continue
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100*(start_mem-end_mem)/start_mem))
        
    return df

2. Divide and rule

  when reading files, you can read them line by line or block by block to avoid reading all the data into the memory at one time, resulting in program collapse.

  • Read line by line:
data = []
with open(path, 'r',encoding='gbk',errors='ignore') as f:
    for line in f:
        data.append(line.split(','))
  • Block reading
       sometimes even if all data can be read at one time, it may not run when operating the data (such as reduce_mem_usage). At this time, it can be processed in blocks and then the results can be spliced.
user_log_file = r'./file.csv'
read_chunks = pd.read_csv(user_log_file,iterator=True,chunksize=500000)
user_log = pd.DataFrame()
for chunk in read_chunks:
    chunk = reduce_mem_usage(chunk)
    user_log = user_log.append(chunk)

3. Modify disk virtual memory

Expand disk virtual memory:
1. Open the control panel;
2. Find the system item;
3. Find advanced system settings;
4. Click the setting button of the performance module;
5. Select advanced and click change in the virtual memory module;
6. Select a disk to run your files on and click Custom size
Remember not to select "automatically manage the paging file size of all drives" and manually enter the initial size and maximum value. Of course, it's best not to be too large. After the change, you can check the disk usage and don't lose too much space.
7. After all settings are set, remember to click "Settings" and then confirm, otherwise it will be invalid. Finally, restart the computer. (at first, I didn't restart the computer, but I still couldn't run. I once suspected my hardware problem.)

  if there are other ways to solve the memory problem, please leave a message~

Posted by hairyjim on Sun, 31 Oct 2021 05:48:37 -0700