Skip to content

Performance Improvement: Replace Py4J-based Implementation with Native PyArrow #49

@chenghuichen

Description

@chenghuichen

Current Implementation and Issues

Currently, paimon-python leverages Py4J to reuse Java's read/write capabilities, with data serialization between Java and Python processes handled through ArrowUtils.serializeToIpc. This implementation has several performance bottlenecks:

  1. Process Communication Overhead: The Py4J bridge requires inter-process communication (IPC) between Java and Python processes, introducing significant latency.
  2. Serialization/Deserialization Cost: Each data transfer requires serialization to Arrow IPC format and subsequent deserialization, which is computationally expensive.
  3. Memory Management Complexity: The current implementation requires careful management of memory allocators and resources across process boundaries.

Proposed Solution

We propose to refactor paimon-python to use native PyArrow implementations for read/write operations. This would:

  1. Eliminate Process Communication: Remove the need for Py4J bridge and IPC, allowing direct memory access.
  2. Reduce Serialization Overhead: Enable zero-copy data transfer between Python and native code.
  3. Simplify Memory Management: Leverage PyArrow's built-in memory management capabilities.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions