In the wake of ARROW-8792, this issue is to serve as an umbrella issue for follow up work and associated "buildout" which includes things like:
-
Implementation of many new function types and adding new kernel cases to existing functions
-
Adding implicit casting functionality to function execution
-
Creation of "bound" physical array expressions and execution thereof
-
Pipeline execution (executing multiple kernels while eliminating temporary allocation)
-
Parallel execution of scalar and aggregate kernels (including parallel execution of pipelined kernels)
There's quite a few existing JIRAs in the project that I'll attach to this issue and I'll open plenty more issues as things occur to me to help organize the work.
Reporter: Wes McKinney / @wesm
Related issues:
- [C++] String algorithm library for StringArray/BinaryArray (is a parent of)
- [C++] Add casting option to set unsafe casts to null rather than some garbage value (is a parent of)
- [C++] Implement kernel function that converts a dense array to dictionary given known dictionary (is a parent of)
- [C++] Parallelize execution of ScalarAggregateFunction (is a parent of)
- [C++] Implement hashing, dictionary-encoding for StructArray (is a parent of)
- [C++] Add function to "conform" a dictionary array to a target new dictionary (is a parent of)
- [C++] Support temporal arithmetic ({time,date}{32,64}, timestamp, interval) (is a parent of)
- [C++] Kernel functions for determining monotonicity (ascending or descending) for well-ordered types (is a parent of)
- [C++] Implement "fill null" kernels that replace null values with some scalar replacement value (is a parent of)
- [C++] Implement "drop null" kernels that return array without nulls (is a parent of)
- [C++] Forward, backward fill kernel functions (is a parent of)
- [C++] Implement "any" reduction kernel for boolean data (is a parent of)
- [C++] Implement casts from one struct type to another (with same field names and number of fields) (is a parent of)
- [R] Add bindings for sum and mean compute kernels (is a parent of)
- [C++] Support lossy casts from decimal128 to float32 and float64/double (is a parent of)
- [C++] Implement casts from float/double to decimal128 (is a parent of)
- [C++] Implement example string scalar kernel function to assist with string kernels buildout per ARROW-555 (is a parent of)
- [C++] Add timestamp subtract kernel aliased to int64 subtract implementation (is a parent of)
- [C++] Add "parse_strptime" function for string to timestamp conversions using the kernels framework (is a parent of)
- [R] Provide binding for arrow::compute::CallFunction (is a parent of)
- [C++][Compute] Add strftime kernel (is a parent of)
- [C++] Refactor filter/take kernels to use Datum instead of overloads (is a parent of)
- [C++] Incremental Variance, Standard Deviation aggregators (is a parent of)
- [C++] Cast to/from halffloat not implemented (is a parent of)
- [C++] Implement support for using selection vectors in scalar aggregate function kernels (is a parent of)
- [C++] Add options to ValueCount/Unique/DictEncode kernel to toggle null behavior (is a parent of)
- [C++][Python] Support ExtensionType arrays in more kernels (is a parent of)
- [C++] Allow automatic String -> LargeString promotions when concatenating tables (is a parent of)
- [C++] Determine strategy for propagating failures in initializing built-in function registry in arrow/compute (is a parent of)
- [C++] Determine desirable maximum length for ExecBatch in pipelined and parallel execution of kernels (is a parent of)
- [C++] Add "TypeResolver" class interface to replace current OutputType::Resolver pattern (is a parent of)
- [C++] Parallelize execution of arrow::compute::ScalarFunction (is a parent of)
- [C++] Add VectorFunction wrapping arrow::Concatenate (is a parent of)
- [C++] Deprecate or remove Scalar::Parse and Scalar::CastTo (is a parent of)
- [C++] Arithmetic kernels for numeric arrays (is a parent of)
- [C++][Compute] Extract preallocation logic from KernelExecutor (is a parent of)
- [C++][Compute] Dispatch* should examine options as well as input types (is a parent of)
- [C++] Implement hash_aggregate kernels (umbrella issue) (is a parent of)
- [C++/Python] Implement Array.isvalid/notnull/isnull as scalar functions (is a parent of)
- [Python/C++] Add index() method to find first occurence of Python scalar (is a parent of)
- [Python] Expose compare kernels on Array class (is a parent of)
- [C++] Possible to reduce object code generated in compute/kernels/take.cc? (is a parent of)
- [R] Add bindings for compare and boolean kernels (is a parent of)
- [C++] Refactor AddKernel to support other operations and types (is a parent of)
- [C++][Compute] Consolidate fill_null and coalesce (is a parent of)
- [C++] Implement cast to Binary and FixedSizeBinary (is a parent of)
- [C++] Use selection vectors in Filter implementation for record batches, tables (is a parent of)
- [C++] Implement casts from date types to Timestamp (is a parent of)
- [C++] Add C++ unit tests for filter and take functions on temporal type inputs, including timestamps (is a parent of)
- [C++] Reimplement dictionary unpacking in Cast kernels using Take (is a parent of)
- [C++] Reduce number of take kernels (is a parent of)
- [C++] Implement optimized "unsafe take" for use with selection vectors for kernel execution (is a parent of)
- [C++][Compute] Formalize "metafunction" concept (is a parent of)
- [C++] Add cast "metafunction" to FunctionRegistry that addresses dispatching to appropriate type-specific CastFunction (is a parent of)
- [C++] Add "DispatchBest" APIs to compute::Function that selects a kernel that may require implicit casts to invoke (is a parent of)
- [C++] Improve usability of arrow::compute::CallFunction by moving ExecContext* argument to end and adding default (is a parent of)
- [C++] Improve docstrings in new public APIs in arrow/compute and fix miscellaneous typos (is a parent of)
- [C++] Measure microperformance associated with ExecBatchIterator (is a parent of)
- [C++] Change compute::Arity:VarArgs min_args default to 0 (is a parent of)
- [C++] Reduce generated code in vector_hash.cc (is a parent of)
- [C++] Reduce generated code in compute/kernels/scalar_compare.cc (is a parent of)
- [C++] compute::CallFunction can't Filter/Take with ChunkedArray (is a parent of)
- [C++] Implement BitBlockCounter interface for blockwise popcounts of validity bitmaps (is a parent of)
- [C++] Improve and expand Take/Filter benchmarks (is a parent of)
- [C++] Add sum/mean kernels for Boolean type (is a parent of)
- [C++] Support scalar aggregation over scalars (is a parent of)
- [C++][Compute] Add ExecNode hierarchy (is a parent of)
- [C++][Compute] Promote Expression to the compute namespace (is a parent of)
- [C++][Dataset][Compute] Refactor Dataset scans to use an ExecNode graph (is a parent of)
- [C++][Compute][R] Add ScalarAggregateOptions to Any and All kernels (is a parent of)
- [C++] Kernels to extract datetime components should be timezone aware (is a parent of)
- [C++] Support filter/take for union data type. (is a parent of)
- [C++][Compute] Enhance FunctionOptions with equality, debug representability, and serializability (is a parent of)
- [C++] Kernel to localize naive timestamps to a timezone (preserving clock-time) (is a parent of)
- [C++] Add option to specify the first day of the week for the "day_of_week" temporal kernel (is a parent of)
- [C++] Add a general "if, ifelse, ..., else" kernel ("CASE WHEN") (is a parent of)
- [C++] Add a 'choose' kernel/scalar compute function (is a parent of)
- [C++] Support variable-width types in case_when function (is a parent of)
- [C++] Implement datediff kernel (is a parent of)
- [C++] Implement timestamp to date/time cast that extracts value (is a parent of)
- [C++] Implement casting Binary <-> LargeBinary (is a parent of)
- [C++] Implement casting List <-> LargeList (is a parent of)
- [C++] SortToIndices kernel must support FixedSizeBinary (is a parent of)
- [C++] ArgSort kernel should not materialize the output internal (is a parent of)
- [C++] Option for Filter kernel how to handle nulls in the selection vector (is a parent of)
- [C++] Determine the feasibility and build a prototype to replace compute/kernels with gandiva kernels (is a parent of)
- [Python] Add relevant glue for implementing each kind of FunctionOptions (is a parent of)
- [C++] Implement aggregate compute functions for decimal datatypes (is a parent of)
- [C++] Refactor temporal casts to work with Scalar inputs (is a parent of)
- [C++] Improved declarative compute function / kernel development framework, normalize calling conventions (relates to)
- [C++] Add Subtract and Multiply arithmetic kernels with wrap-around behavior (is related to)
- [C++] Arrow-native C++ Data Frame-style programming interface for analytics (umbrella issue) (is related to)
- [C++] Implement List Flatten kernel (is related to)
- [C++] Sketch out design for kernels and "query" execution in compute layer (is related to)
- [C++] Optimize Take implementation (is related to)
- [C++] Add utf8proc library to toolchain (is related to)
- [C++] Split non-cast compute kernels into a separate shared library (is related to)
- [Python] Expose more compute kernels (is related to)
- [C++] Move arrow::ArrayData to a separate header file (is related to)
- [C++] Document available functions in compute::FunctionRegistry (is related to)
- [C++] Optimize Filter implementation (is related to)
- [C++] Expand SumKernel benchmark to more types (is related to)
- [C++] Support casting between decimal types with compatible precision/scales (is related to)
- [C++] Flatbuffers based serialization protocol for Expressions (is related to)
- [C++] Collapse Take APIs from 8 to 1 or 2 (is related to)
- [C++] [Python] Proposal for several Array utility functions (is related to)
Note: This issue was originally created as ARROW-8894. Please see the migration documentation for further details.
In the wake of ARROW-8792, this issue is to serve as an umbrella issue for follow up work and associated "buildout" which includes things like:
Implementation of many new function types and adding new kernel cases to existing functions
Adding implicit casting functionality to function execution
Creation of "bound" physical array expressions and execution thereof
Pipeline execution (executing multiple kernels while eliminating temporary allocation)
Parallel execution of scalar and aggregate kernels (including parallel execution of pipelined kernels)
There's quite a few existing JIRAs in the project that I'll attach to this issue and I'll open plenty more issues as things occur to me to help organize the work.
Reporter: Wes McKinney / @wesm
Related issues:
Note: This issue was originally created as ARROW-8894. Please see the migration documentation for further details.