Databend built in scalar function development guide

Keywords: Rust

Original text: https://databend.rs/development/how-to-write-scalar-functions/

What is a scalar function? ¶

Scalar functions (sometimes referred to as user-defined functions / UDFs) return a single value for each record, rather than as a result SET, and can be used in most places in queries or SET statements, except for the FROM clause.

One to One Mapping execution

┌─────┐                    ┌──────┐
│  a  │                    │   x  │
├─────┤                    ├──────┤
│  b  │                    │   y  │
├─────┤    ScalarFunction  ├──────┤
│  c  │                    │   z  │
├─────┼────────────────────►──────┤
│  d  │     Exec           │   u  │
├─────┤                    ├──────┤
│  e  │                    │   v  │
├─────┤                    ├──────┤
│  f  │                    │   w  │
└─────┘                    └──────┘

Introduction to trait

All scalar functions implement Function trait. We register these functions in a global static function factory, which is just an index map, key: scalar function name.

⚠️   Function names in the Databend are case insensitive.

pub trait Function: fmt::Display + Sync + Send + DynClone {
    fn name(&self) -> &str;

    fn num_arguments(&self) -> usize {
        0
    }

    // (1, 2) means we only accept [1, 2] arguments
    // None means it's not variadic function
    fn variadic_arguments(&self) -> Option<(usize, usize)> {
        None
    }

    // return monotonicity node, should always return MonotonicityNode::Function
    fn get_monotonicity(&self, _args: &[MonotonicityNode]) -> Result<MonotonicityNode> {
        Ok(MonotonicityNode::Function(Monotonicity::default(), None))
    }

    fn return_type(&self, args: &[DataType]) -> Result<DataType>;
    fn nullable(&self, _input_schema: &DataSchema) -> Result<bool>;
    fn eval(&self, columns: &DataColumnsWithField, _input_rows: usize) -> Result<DataColumn>;
}

How to understand?

Analyze the meaning of the functions in the above trait:

  • Name ⇒ indicates the name of this function, such as log and sign. However, sometimes we should store names inside functions, because different names may share the same function, such as POW and power. We can use power as an alias (synonym) function of pow.
  • num_arguments ⇒ indicates how many parameters the scalar function can accept.
  • variadic_arguments ⇒ mark that the function can accept variable arguments. For example, round() accepts one or two functions whose range is [1,2], and we use closed intervals here.
  • get_monotonicity ⇒ indicates the monotonicity of this function, indicating that it can be used to optimize execution.
  • return_type ⇒ indicates the return type of the function. We can also verify args in the function.
  • Nullable ⇒ indicates whether a column with nullable field can be returned (at present, it is OK to return true/false).
  • eval ⇒ eval is the main function that executes ScalarFunction:

    1. Columns → input columns
    2. input_rows → enter the number of rows

We will explain how to write the eval function below.

Pre knowledge

Before writing eval functions, you may need the following knowledge.

data type

In Databend, data types are divided into two forms: logical type and physical type.

The logical data type is the data type we use in the Databend, and the physical data type is the data type we use in the execution / computing engine. For example, Date32 is a logical data type, but its physical type is Int32, so its columns are represented by DFInt32Array.

There are several internal ways to return the above two data types:

  1. Data through DataField_ Type() gets the logical data type
  2. Data through DataColumn_ Type() gets the physical data type
  3. DataColumnsWithField has data_type(), which returns the logical data type

Memory layout

The memory layout of Databend is based on Arrow. For the memory layout of Arrow, you can [here ]Yes.

Take the array of primitive type int32 as an example. [1, null, 2, 4, 8] looks like this:

* Length: 5, Null count: 1
* Validity bitmap buffer:

  |Byte 0 (validity bitmap) | Bytes 1-63            |
  |-------------------------|-----------------------|
  | 00011101                | 0 (padding)           |

* Value Buffer:

  |Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 |
  |------------|-------------|-------------|-------------|-------------|-------------|
  | 1          | unspecified | 2           | 4           | 8           | unspecified |

Constant column

Sometimes column is a constant in a block, for example: select 3 from table, column: 3 is always 3, so we can use a constant column to represent it. This is useful for saving memory during calculations.

Therefore, the DataColumn of the Databend is represented as:

pub enum DataColumn {
    // Array of values. Series is wrap of arrow's array
    Array(Series),
    // A Single value.
    Constant(DataValue, usize),
}

Some guidelines

  1. Column conversion

In order to execute a scalar function, we certainly need to traverse the input column of the parameter. Because we are already in return_ The data type is checked in the type function, so we can use the i32 function to cast the input column to a specific type of column, such as DFInt32Array.

  1. Constant column

We mentioned earlier that you should pay attention to constant column matching to improve memory usage.

  1. Combination of column iteration and valid bitmap

When we need to iterate the column, we can use column.iter() to generate an iterator. The iteration item is option < T >, and None indicates null. However, this method is inefficient because we need to check the null value every time we iterate over the column within the loop, which will pollute the CPU cache.

According to the memory layout of Arrow, we can directly use the validity bitmap of the original column to represent null values. So we have ArrayApply trait to help you iterate the columns. If there are two zip iterators, we can use the binary function to merge the validity bitmaps of the two columns:

  • ArrayApply
let array: DFUInt8Array = self.apply_cast_numeric(|a| {
    AsPrimitive::<u8>::as_(a - (a / rhs) * rhs)
});
  • binary
binary(x_series.f64()?, y_series.f64()?, |x, y| x.pow(y))
  1. Nullable check

Sometimes Nullable is annoying, but in most cases we can accept DataType::Null parameter.

  1. Implicit conversion

Databend can accept implicit conversion, for example: pow('3', 2), sign('1232') we can use cast_with_type converts parameters to specific columns.

reference resources

As you can see above, adding a new scalar function to the Databend is not as difficult as you think. However, before you start adding, you can also refer to other examples of scalar functions, such as sign, expr, tan and atan.

test

In order to be a good engineer, don't forget to test your code. Please add unit tests and stateless tests after you complete the new scalar function.

summary

We welcome all community users to contribute more powerful functions to Databend. If you find any problems, please feel free to contact us on GitHub Mention an issue , we will try our best to help you.

Posted by thankqwerty on Mon, 06 Dec 2021 15:53:24 -0800