Kernel density estimation and visualization of data: a Python implementation

Keywords: Python data visualization


The main references of this paper are as follows:

  1. Silverman B.W. Density Estimation. Chapman and Hall, London, 1986
  2. Standard ISO 13528

Example data introduction

The sample data used in this blog are tables 2.1 and 2.2 in reference 1 of this blog, as follows:

Mental illness treatment duration data sheet

Geyser burst duration data sheet

Introduction to file placement

We put the above data table in another folder named "data" in the code folder:

code:
------->Xxx.py (code)
Data:
------->Geyser outbreak duration schedule
------->Suicidal tendency treatment time

Frequency distribution histogram method

Principle discussion and analysis

Set dataset as X = ( X 1 , X 2 , ⋯   , X n ) X=(X_1, X_2, \cdots, X_n) X=(X1​,X2​,⋯,Xn​):

The frequency distribution histogram requires us to provide the drawing origin x 0 x_0 x0 and a bandwidth h h h. Will fall on [ x 0 + ( m − 1 ) ⋅ h , x 0 + m ⋅ h ] [x_0 + (m-1)\cdot h, x_0 + m \cdot h] Count the number of points on [x0 + (m − 1) ⋅ h,x0 + m ⋅ H] and divide it by the bandwidth length:
f ^ ( x ) = 1 n h ( Fall on and x On the same bandwidth X i Number of ) \hat{f}(x) = \frac{1}{nh} (\text {number of} X_i \text {falling on the same bandwidth as} x \text}) f ^ (x)=nh1 (number of Xi falling on the same bandwidth as x)

Generally, when analyzing the data, we often use the frequency distribution histogram to preliminarily visualize the data. We draw the frequency distribution histogram between the mental illness treatment duration data table and the geyser outbreak duration data table, as shown below:



As can be seen from the above figure, even if the bandwidth is the same, different origins of histograms will lead to different visual effects of histograms. For the histogram of outbreak duration of indirect spring, although the first and second diagrams show obvious bimodal distribution, it is difficult to see the interval of bimodal for Figure 2.

For the histogram of psychiatric treatment duration, the second figure is in the area [ 150 , 250 ] [150, 250] At [150250], a secondary peak can be seen. For non statisticians, it is likely to cause the misunderstanding that the data show bimodal distribution. After all, from Figure 1, this is likely to be caused by randomness.

Python code implementation

# -*- coding: utf-8 -*-
"""
Created on Sun Sep 19 10:11:15 2021

@author: zhuozb

This code mainly shows the relevant information in Section II of 13528 reference 30 histogram Use case discussion.

This code mainly discusses the duration of geyser explosion and the frequency distribution of suicide psychotherapy time.
"""

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

def extract_data(path):
    # Reading Excel datasheets into Python
    table = pd.read_excel(path, header=None)
    # Convert data table to 2D sequence
    data = table.values
    # Eliminate non digital data and convert one-dimensional sequence to two-dimensional sequence
    data = data[~np.isnan(data)]
    return data




def plot_histogram(data, bins='auto', title=''):
    
    # 
    fig, ax = plt.subplots()
    
    # Draw directly with plt.hist
    freq, bins, patches = plt.hist(data, 
                                   bins=bins,
                                   density=True,
                                   edgecolor='black')
    # Label position of frequency distribution histogram
    bin_centers = np.diff(bins)*0.5 + bins[:-1]
    

    for fr, x, patch in zip(freq, bin_centers, patches):
      height = round(fr, 4)
      plt.annotate("{}".format(height),
                   xy = (x, height),             # top left corner of the histogram bar
                   xytext = (0,0.2),             # offsetting label position above its bar
                   textcoords = "offset points", # Offset (in points) from the *xy* value
                   ha = 'center', va = 'bottom',
                   fontfamily='arial')  # The label is also displayed in a font
    
    # Automatically adjust the display range of X-axis and Y-axis, although it is of no use
    ax.set_xlim(auto=True)
    ax.set_ylim(auto=True)
    
    # Label the X and Y axes
    if title != '':
        # If no title is given, the X-axis label is not displayed
        plt.xlabel(title, fontfamily='simhei')
    plt.ylabel('frequency', fontfamily='simhei')
    
    
    plt.grid(axis='y')

if __name__ == '__main__':
    eruption_lenght_path = r'../data/Geyser outbreak duration schedule.xlsx'
    treatment_spell_path = r'../data/Suicidal tendency treatment time.xlsx'
    # Read dataset
    # Read data table 2.1
    treatment_spell_data = extract_data(treatment_spell_path)
    # Draw table 2.2
    eruption_lenght_data = extract_data(eruption_lenght_path)
    
    # Draw figure 2.1
    plot_histogram(eruption_lenght_data, bins=np.arange(1.25, 5.5, 0.5), title='Burst duration')
    plot_histogram(eruption_lenght_data, bins=np.arange(1.5, 5.5, 0.5), title='Burst duration')
    
    # Draw Figure 2.2
    plot_histogram(treatment_spell_data, bins=np.arange(0, 1000, 50), title='Duration of suicide psychotherapy')
    plot_histogram(treatment_spell_data, bins=np.arange(-25, 1000, 50), title='Duration of suicide psychotherapy')

Simple kernel density estimation

Principle discussion and analysis

The principle of the simple kernel density estimation method is to set the bandwidth of the frequency distribution histogram close to 0. For a probability density function:
f ( x ) = lim ⁡ h → 0 1 2 h P ( x − h < X < x + h ) f(x) = \lim_{h\to0} \frac{1}{2h} P(x-h < X < x+h) f(x)=h→0lim​2h1​P(x−h<X<x+h)
Therefore, we can use the frequency distribution map with infinitesimal bandwidth to estimate the probability density:
f ^ ( x ) = 1 2 n h ( Fall in bandwidth ( x − h , x + h ) Upper X i Number of ) \hat{f}(x) = \frac{1}{2nh} (\text {number of} X_i \text {falling on bandwidth} (x-h, x+h) \text {) f ^ (x)=2nh1 (the number of Xi falling on the bandwidth (x − h,x+h))
For convenience of expression, we define a kernel density function as follows:
w ( x ) = { 1 2  if  ∣ x ∣ < 1 0  otherwise.  w(x)= \begin{cases}\frac{1}{2} & \text { if }|x|<1 \\ 0 & \text { otherwise. }\end{cases} w(x)={21​0​ if ∣x∣<1 otherwise. ​
therefore, f ^ ( x ) \hat{f}(x) f ^ (x) can be expressed as:
f ^ ( x ) = 1 h n ∑ i = 1 n w ( x − X i h ) \hat{f}(x) = \frac{1}{hn} \sum_{i=1}^{n} w ( \frac{x-X_i}{h} ) f^​(x)=hn1​i=1∑n​w(hx−Xi​​)
In practical application, of course, the bandwidth does not need to be infinitely close to 0. Taking the duration of geyser outbreak as an example, we can set the bandwidth to 0.25, and the simple kernel density diagram is as follows:

Due to our empirical density function f ^ ( x ) \hat{f}(x) f ^ (x) is not a continuous function and is non differentiable at the break point. It can also be seen intuitively from the above figure that the step part of the image may be confusing to scholars without learning.

Python code implementation

# -*- coding: utf-8 -*-
"""
Created on Sun Sep 19 11:37:22 2021

@author: zhuozb

This is ISO13528 References in, 2 in 30.3 Case drawing of section
"""


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from histogram_example_showing import extract_data


def digits_after_decimal(num):
    '''
    Return variable num Number of decimal places
    '''
    num = str(num)
    # Position of decimal point
    digit_loc = num.find('.')

    # Number of digits after decimal point
    digits_decimal = len(num) - digit_loc - 1
    
    return digits_decimal

def plot_naive_estimator(data, bandwidth=0.25, title=''):
    '''
     data : Data to be counted
     bandwidth: Drawing bandwidth
     title: Draw the title of the image. If not, the title will not be written by default
    '''
    fig, ax = plt.subplots()
    
    # n is the sample size
    n = len(data)
    
    # Returns the smallest number in the sequence and generates (limit) interval ranges based on this number
    min_num = min(data)
    # Find the number of digits after the decimal point
    digits_decimal = digits_after_decimal(min_num)
    
    # Drawing point interval
    data_interval = 0.1**(digits_decimal)   

    # Find the maximum number in the sequence to generate the drawing range
    max_num = max(data)
    
    # Generate drawing points of X axis
    xtick_array = np.arange(0.9*min_num if (min_num>0) else 1.1*min_num,
                            1.1*max_num, data_interval)
    
    # Generate drawing points of Y axis
    ytick_array = []
    
    def calculate_y(x, x_data, bandwidth):
        # Solve according to formula 2.1, X corresponds to Y
        y_sum = 0
        for x_i in data:
            # Generate a weight sequence from the data sequence according to (x-X)/h
            weight_array = np.abs((x - x_i)/bandwidth)
            # According to formula 2.1, sum the weight sequence according to conditions
            sum_weight = len(weight_array[(weight_array < 1)])/2
            y_sum = y_sum + sum_weight
        # 
        y = y_sum/(n*bandwidth)
        return y    
        
    ytick_array = [calculate_y(x, data, bandwidth) for x in xtick_array]
        
    # Label the X and Y axes
    if title != '':
        # If no title is given, the X-axis label is not displayed
        plt.xlabel(title, fontfamily='simhei')
    plt.ylabel('Probability density estimation', fontfamily='simhei')        
    plt.plot(xtick_array, ytick_array, linewidth=1)
    # Set gridlines
    plt.grid(axis='both')

if __name__ == '__main__':
    eruption_lenght_path = r'../data/Geyser outbreak duration schedule.xlsx'
    treatment_spell_path = r'../data/Suicidal tendency treatment time.xlsx'
    # Read dataset
    # Read data table 2.1
    treatment_spell_data = extract_data(treatment_spell_path)
    # Draw table 2.2
    eruption_lenght_data = extract_data(eruption_lenght_path)        
    plot_naive_estimator(eruption_lenght_data, title=f'Burst duration,h = 0.25')

Kernel density estimation

Principle discussion and analysis

As simple as the above, when kernel density estimation is generally used, a kernel density function needs to be defined first K ( x ) K(x) K(x), the kernel density function is a probability density, which requires that the integral area on the coordinate axis is the unit area. Generally, the standard normal distribution can be taken as the kernel density function:
K ( x ) = 1 2 π exp ⁡ ( − x 2 2 ) K(x) = \frac{1}{\sqrt{2\pi}} \exp(-\frac{x^2}{2}) K(x)=2π ​1​exp(−2x2​)
The empirical density function is defined f ^ ( x ) \hat{f}(x) f ^ (x) is as follows:
f ^ ( x ) = 1 n h ∑ i = 1 n K ( x − X i h ) \hat{f}(x) = \frac{1}{nh} \sum_{i=1}^{n} K ( \frac{x-X_i}{h} ) f^​(x)=nh1​i=1∑n​K(hx−Xi​​)
In kernel density estimation, parameter h is generally called window width, and the selection of window width will affect the visualization effect of data. We take 7 data as examples, although a small amount of data is not suitable for visualization by kernel density method in practical application:

The example data is: example_data = [-1.95, -1.5, -0.7, -0.65, -0.62, 0.1, 0.9], the effect when the window width is 0.2, 0.4 and 0.8 respectively can be drawn as follows:


It can be seen that when the selection of window length is small, some details of the data will be enlarged. For example, when the window length is 0.2, the whole data presents an effect similar to Dirac distribution. When the window length is wider, some details of the data will be smoothed. For example, when the window length is 0.8, some sub peak distributions of the data will be covered.

Let's use two distributions as N ( − 1 , 0.65 ) N(-1, 0.65) N(− 1,0.65) and N ( 1 , 0.65 ) N(1, 0.65) The cumulative distribution of N(1,0.65) produces 200 random data. Obviously, the cumulative distribution of the two normal distributions is a bimodal distribution, as shown below:

However, when we use random data and draw nuclear density according to different window lengths, the visualization effects are as follows:

From the above images, we also verified the previous conclusion and that if the window length is too small, some details may be enlarged, while if the window length is too large, some details may be covered and cannot be seen.

Next, we will use the data of geyser burst duration and the method of nuclear density to draw the picture. The visual effect is as follows:

One disadvantage of drawing with kernel density method is that in addition to the above selection of window length, some unidirectional trailing data may also cause errors in the expression of some details due to the selection of kernel density.

For example, in our data of psychiatric treatment duration, the value of the data must be greater than 0, while the kernel density function stipulates that the value range of the data is [ − ∞ , ∞ ] [-\infty, \infty] [−∞,∞]. Taking the example data as an example, we draw the visualization effect when the window length is 20 and 60:

We can see from Figure 1 that when the window length is selected to be small, certain noise will appear in the trailing part on the right. When the wound length is selected to be large, although these noises can be smoothly transited, the details of the main part cannot be fully displayed.

Python code implementation

# -*- coding: utf-8 -*-
"""
Created on Sun Sep 19 22:05:00 2021

@author: zhuozb

according to ISO13528 References, 2 of 30.4 Section and density estimation drawing
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import norm
from histogram_example_showing import extract_data
from naive_estimator_example_showing import digits_after_decimal
from scipy.stats import binom
import random

def kernel_function(x):
    '''
    Density function
    '''
    return norm.pdf(x)


def plot_kernel_estimate(data, fn, window_width=0.4, size=None,
                         title='', interval=None):
    
    # sample size
    n = len(data)
    # Returns the smallest number in the sequence,
    min_num = min(data)
    
    # Find the maximum number in the sequence to generate the drawing range
    max_num = max(data)
    
    if interval == None:
        # If the drawing interval is not given, it is automatically generated according to the digits after the decimal point of the minimum value in the data
        # Find the number of digits after the decimal point of the minimum value
        digits_decimal = digits_after_decimal(min_num)
        # Generate drawing range
        interval = 0.1**(digits_decimal)
    
    if size == None:
        # Generate drawing and drawing range
        xtick_array = np.arange(0.9*min_num if (min_num>0) else 1.1*min_num, 
                                1.1*max_num, interval)
    else:
        xtick_array = np.arange(size[0], size[1], interval)
    
    ytick_array = []
    for x in xtick_array:
        x_trans = (x - data)/window_width
        y_norm = fn(x_trans)
        y = np.sum(y_norm)/(n*window_width)
        ytick_array.append(y)
    
    plt.plot(xtick_array, ytick_array, linewidth=1)
    plt.ylim((0, 1.1*max(ytick_array)))
    # Label the X and Y axes

    plt.xlabel(title+f'h={window_width}', fontfamily='simhei')
    plt.ylabel('probability density', fontfamily='simhei')        
    # Set gridlines
    plt.grid(axis='both')

def example_showing(window_width=0.4, title=''):
    '''
    This function is mainly used to draw an example diagram
    '''
    example_data = [-1.95, -1.5, -0.7, -0.65, -0.62, 0.1, 0.9]
    plot_kernel_estimate(example_data, kernel_function, title=title, 
                         window_width=window_width, size=(-4, 3))

    # sample size
    n = len(example_data)

    # Returns the smallest number in the sequence,
    min_num = min(example_data)
    
    # Find the maximum number in the sequence to generate the drawing range
    max_num = max(example_data)
    
    size = (-4, 3)
    
    if size == None:
        # Generate drawing and drawing range
        xtick_array = np.arange(0.9*min_num if (min_num>0) else 1.1*min_num, 
                                1.1*max_num, 0.1)
    else:
        xtick_array = np.arange(size[0], size[1], 0.1)
        
    def plot_individual(x):
        # Draw the normal distribution of each point of the original data
        # Find the Y corresponding to the X axis, which has been used for drawing
        ytick_array = (xtick_array - x)/window_width
        ytick_array = norm.pdf(ytick_array)/(n*window_width)
        # Draw the normal distribution of the original data
        plt.plot(xtick_array, ytick_array)
        # Mark the original data points and draw a straight line parallel to the Y axis
        plt.plot(x, 0, marker='x', color='black', markersize=10)
        plt.vlines(x, 0, norm.pdf(0)/(n*window_width), color='black')
        
    for x in example_data:
        # Draw the normal distribution of each original data
        plot_individual(x)
        
def bimodal_example_showing(data, kernel_function, 
                            window_width, interval=0.01, title=''):
    plot_kernel_estimate(data, kernel_function, window_width,
                         size=(-3, 3), title=title, interval=0.01)
            

if __name__ == '__main__':
    # Draw the figure in the reference, 2.42.5
    for window_width in (0.2, 0.4, 0.8):
        fig = plt.figure()
        example_showing(window_width, title='Simple case')
        plt.show()
    
    # A bimodal distribution of data is generated and the amount of data is 200
    # The peak values of the data are - 1 and 1 respectively. The data uses normal distribution, and the variance of the normal distribution is 0.65
    np.random.seed(seed=3)
    def generate_bimodal(low1=-2.5, high1=0.5, mode1=-1,
                          low2=-0.5, high2=2.5, mode2=1):
        data = [norm.rvs(-1, 0.65)  if random.choice((0,1))
                else norm.rvs(1, 0.65) for i in range(200)]
    
        return data
    
    data = generate_bimodal()
    
    # Draw figure 2.6 in the references
    for window_width in (0.2, 0.4, 0.8):
        fig = plt.figure()
        bimodal_example_showing(data, kernel_function, 
                                window_width, title='Bimodal distribution')
        
    # Draw the original probability density distribution of bimodal distribution
    def plot_bimodal_kernal():
        data = [norm.rvs(-1, 0.65)  if random.choice((0,1))
                else norm.rvs(1, 0.65) for x in range(20000)]
        
        fig = plt.figure()
        plot_kernel_estimate(data, kernel_function, window_width=0.4, size=(-3,3),
                              interval=0.1)
        plt.ylabel('probability density', fontfamily='simhei')
        plt.xlabel('Original bimodal distribution', fontfamily='simhei')
        plt.show()
        
    plot_bimodal_kernal()
    
    # Read the sample data of the two tables and draw their and density diagrams
    
    # Read the geyser burst duration data sheet
    eruption_lenght_path = r'../data/Geyser outbreak duration schedule.xlsx'
    eruption_lenght_data = extract_data(eruption_lenght_path)
    # Draw figure 2.8
    fig = plt.figure()
    plot_kernel_estimate(eruption_lenght_data, kernel_function, window_width=0.25,
                         size=(0,6), title='Geyser burst duration ')
    plt.show()
    
    # Read suicide treatment duration data
    treatment_spell_path = r'../data/Suicidal tendency treatment time.xlsx'
    treatment_spell_data = extract_data(treatment_spell_path)
    # Draw figures 2.9a and 2.9b
    for h in (20, 60):
        fig = plt.figure()
        plot_kernel_estimate(treatment_spell_data, kernel_function, window_width=h,
                             size=(-200, 1000), title='Geyser burst duration ')
        plt.show()

Kernel density estimation of minimum nearest neighbor method

Principle discussion and analysis

From the discussion and analysis of the above methods, it can be seen that a fixed window width or bandwidth width will lead to either noise or covering the details of the main part. Therefore, can we dynamically determine the window length according to the distance between data? The minimum nearest neighbor method solves this problem:

Define current data x x x and raw data set X i , i ∈ ( 1 , 2 , ⋯   , n ) X_i, i\in(1,2,\cdots, n) The distance of Xi, i ∈ (1,2,..., n) is d i = ∣ x − X i ∣ d_i = |x- X_i| di = ∣ x − Xi ∣, sort the data in ascending order and find the k-th minimum distance d k d_k dk​. Where K is the number of nearest neighbors and is an integer greater than 0, so the empirical density function can be obtained as follows:
f ^ ( x ) = k 2 n d k \hat{f}(x) = \frac{k}{2n d_k} f^​(x)=2ndk​k​
This formula is actually very easy to understand, assuming that the probability density of data points is f ( x ) f(x) f(x), then when the bandwidth r r When r is small, we expect to have 2 r n f ( x ) 2 r n f(x) 2rnf(x) data falling in bandwidth [ x − r , x + r ] [x-r, x+r] [x−r,x+r]. Let's say, if there are k data falling in the interval [ x − d k , x + d k ] [x-d_k, x+d_k] In [x − dk, x+dk], i.e k = 2 d k n f ( x ) k = 2d_k n f(x) k=2dk nf(x), then the probability density function can be estimated as: f ^ ( x ) = k 2 n d k \hat{f}(x) = \frac{k}{2n d_k} f^​(x)=2ndk​k​.

about x < X m i n x < X_{min} For the point x < xmin , we need to ensure that our window width is not too large, so we need to d k = X ( k ) − x d_k = X_{(k)} - x dk​=X(k)​−x; Similarly, for x > X m a x x > X_{max} x> Xmax, you need to d k = x − X ( n − k + 1 ) d_k = x - X_{(n-k+1)} dk = x − X(n − k+1), where X ( i ) , i ∈ ( 1 , 2 , ⋯   , n ) X_{(i)}, i\in(1,2,\cdots,n) X(i), i ∈ (1,2,..., n) is the original data X i , i ∈ ( 1 , 2 , ⋯   , n ) X_i, i\in(1,2,\cdots,n) Xi, i ∈ (1,2,..., n).

One disadvantage of using the minimum nearest neighbor method for density estimation is that the kernel density function of the minimum nearest neighbor method is not a probability density function, and it is not a unit area for the integral area on the X-axis. Therefore, when using this method for visualization, the data tailing will be very serious.

In addition, when using this method for density estimation, the kernel density d k d_k At the break point of dk , the kernel density is non differentiable, which will lead to more burrs in the visualization effect. We have seen the example data of star weight burst duration as an example. The minimum nearest neighbor method is used for density estimation, and the visualization effect is as follows:

In order to alleviate this problem, we can combine the kernel density function of normal distribution with the minimum nearest neighbor method. The estimation formula of the empirical density function is as follows:

f ^ ( x ) = 1 n d k ∑ i = 1 n K ( x − X i d k ) \hat{f}(x)=\frac{1}{n d_{k}} \sum_{i=1}^{n} K\left(\frac{x-X_{i}}{d_{k}}\right) f^​(x)=ndk​1​i=1∑n​K(dk​x−Xi​​)

The visualization effect of the example data is as follows. It can be seen that although the problem is alleviated, the essential problem is still not solved, that is:

  1. The kernel density function is not a probability density function
  2. Non differentiable at some points

Python code implementation

# -*- coding: utf-8 -*-
"""
Created on Mon Sep 20 09:12:34 2021

@author: zhuozb

This code mainly shows the standards, ISO13528 References, 30 of 2.5 The minimum adjacent value method in the section draws the composite density diagram
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from histogram_example_showing import extract_data
from naive_estimator_example_showing import digits_after_decimal
from kernel_estimate_example_showing import kernel_function


def nearest_neighbour_method(data, k=20, fn='uniform', title='', size=None,
                             interval=None):

    # sample size
    n = len(data)
    # Returns the smallest number in the sequence,
    min_num = min(data)
    
    # Find the maximum number in the sequence to generate the drawing range
    max_num = max(data)
    
    if interval == None:
        # If the drawing interval is not given, it is automatically generated according to the digits after the decimal point of the minimum value in the data
        # Find the number of digits after the decimal point of the minimum value
        digits_decimal = digits_after_decimal(min_num)
        # Generate drawing range
        interval = 0.1**(digits_decimal)
    
    if size == None:
        # Generate drawing and drawing range
        xtick_array = np.arange(0.9*min_num if (min_num>0) else 1.1*min_num, 
                                1.1*max_num, interval)
    else:
        xtick_array = np.arange(size[0], size[1], interval)
        
    ytick_array = []
    # Sort raw data
    data_sort = np.sort(data)
    data_k = data_sort[0]
    data_n = data_sort[-1]
    
    for x in xtick_array:
        # If the current X point is less than a K minimum value of the original data or greater than the maximum value of the original data, d use another method to solve it
        if x < data_k:
            diff_k = data_sort[k-1] - x
        elif x > data_n:
            diff_k = x - data_sort[n-k]
        else:
            # Find the distance between the current X and the data point
            diff = np.abs(x - data)
            # Find the K-th minimum distance
            diff_sort = np.sort(diff)
            diff_k = diff_sort[k-1]
        
        if fn == 'uniform':
            # Tool 2.3, find F
            y = k/(2*n*diff_k)
        else:
            y_fn = fn((x-data)/diff_k)
            y = np.sum(y_fn)/(n*diff_k)
        
        ytick_array.append(y)
        
    plt.plot(xtick_array, ytick_array, linewidth=1)
    plt.ylim((0, 1.1*max(ytick_array)))
    # Label the X and Y axes

    plt.xlabel(title+f'k={k}', fontfamily='simhei')
    plt.ylabel('probability density', fontfamily='simhei')        
    # Set gridlines
    plt.grid(axis='both')
    

if __name__ == '__main__':
    # Read the geyser burst duration data sheet
    eruption_lenght_path = r'../data/Geyser outbreak duration schedule.xlsx'
    eruption_lenght_data = extract_data(eruption_lenght_path)
    # Draw figure 2.10
    
    nearest_neighbour_method(eruption_lenght_data, k=20,
                          size=(0,6), title='Geyser burst duration ')
    
    # Combining kernel density method and minimum nearest neighbor method
    nearest_neighbour_method(eruption_lenght_data, k=20, fn=kernel_function, 
                          size=(0,6), title='Geyser burst duration ')

Dynamic kernel density method

Principle discussion and analysis

Inspired by the minimum nearest neighbor method, we can dynamically adjust the width of the window in combination with the spacing between the data in the original data d j k d_jk dj k is the data set X i X_i Xi # and remaining data X j X_j Xj [ k / 2 ] [k/2] [k/2] minimum distances so that there is an empirical density function as follows:
f ^ ( t ) = 1 n ∑ j = 1 n 1 h d j , k K ( t − X j h d j , k ) \hat{f}(t)=\frac{1}{n} \sum_{j=1}^{n} \frac{1}{h d_{j, k}} K\left(\frac{t-X_{j}}{h d_{j, k}}\right) f^​(t)=n1​j=1∑n​hdj,k​1​K(hdj,k​t−Xj​​)
Using this method, we need to select the minimum number of nearest neighbors k and window length (I really think it solves one problem and another problem). Taking the data of psychotherapy duration as an example, we use dynamic and density methods k = 8 , h = 5 k=8, h=5 k=8,h=5, the visualization effect is as follows:

It can be seen from the above figure that the dynamic kernel density method can smoothly transition the noise of the trailing part on the right and reflect the specific details of the main part. Compared with the kernel density of the minimum net spirit method, it also solves the problem that the kernel density is probability density and differentiable, so it is a good method. But one disadvantage is that two parameters need to be selected, which is more subjective.

Python code implementation

# -*- coding: utf-8 -*-
"""
Created on Mon Sep 20 10:02:00 2021

@author: zhuozb

This code implements the standard ISO13528 The publicity in reference 30, 2.6 Variable kernel density estimation
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from histogram_example_showing import extract_data
from naive_estimator_example_showing import digits_after_decimal
from kernel_estimate_example_showing import kernel_function

def variable_kernel_estimate(data, fn=kernel_function, window_width=5, k=8, title='',
                             size=None, interval=None):

    # sample size
    n = len(data)
    # Returns the smallest number in the sequence,
    min_num = min(data)
    
    # Find the maximum number in the sequence to generate the drawing range
    max_num = max(data)
    
    if interval == None:
        # If the drawing interval is not given, it is automatically generated according to the digits after the decimal point of the minimum value in the data
        # Find the number of digits after the decimal point of the minimum value
        digits_decimal = digits_after_decimal(min_num)
        # Generate drawing range
        interval = 0.1**(digits_decimal)
    
    if size == None:
        # Generate drawing and drawing range
        xtick_array = np.arange(0.9*min_num if (min_num>0) else 1.1*min_num, 
                                1.1*max_num, interval)
    else:
        xtick_array = np.arange(size[0], size[1], interval)
    
    ytick_array = []
    for x in xtick_array:
        # For the data on the most beautiful coordinate axis X, calculate the data on the Y axis according to formula 2.6
        y_sum = 0
        for x_data in data:
            # For each point of the original data, find out the k-th minimum nearest neighbor distance between it and other N-1 points
            diff = np.abs(data-x_data)
            diff_sort = np.sort(diff)
            diff_jk = diff_sort[round(k/2)]
            
            # Calculate the density estimate according to formula 2.6
            x_trans = (x-x_data)/(window_width*diff_jk)
            y_fn = fn(x_trans)/(window_width*diff_jk)
            y_sum = y_sum + y_fn
            
        y = y_sum/n
        ytick_array.append(y)
    
    plt.plot(xtick_array, ytick_array, linewidth=1)
    plt.ylim((0, 1.1*max(ytick_array)))
    # Label the X and Y axes

    plt.xlabel(title+f'window_width={window_width}, k={k}', fontfamily='simhei')
    plt.ylabel('probability density', fontfamily='simhei')        
    # Set gridlines
    plt.grid(axis='both')
    
if __name__ == '__main__':
    # Read suicide treatment duration data
    treatment_spell_path = r'../data/Suicidal tendency treatment time.xlsx'
    treatment_spell_data = extract_data(treatment_spell_path)
    # Draw figure 2.11
    variable_kernel_estimate(treatment_spell_data, window_width=5, 
                             k=8, fn=kernel_function, 
                             size=(-200,1000), title='Suicidal tendency treatment time ',
                             interval=1)

Posted by Darkness Soul on Mon, 20 Sep 2021 04:22:14 -0700