Member-only story

Improve Apache Beam Performance with Grouping Records

when need to call external API, we should do so by batching multiple records

Suraj Mishra
3 min readAug 27, 2022

Originally Published @ https://asyncq.com/

Introduction

  • Apache Beam provides immutable PCollection to store read input data. If we read input file as string , the read input data will be stored into
    PCollection <String>.
  • Its not uncommon to have interaction with external API through REST interface , where we have to send bunch of records to a service and expect some output.
  • If we read n number of records parrellely and call external api parellely, we will overwhelm the external API service and might hit the api limit rate .This is very important use cases where we need to batch input records.
  • In this article we will see how to batch input records so that we send batches of records to the rest endpoint and respect api limit.

Use Case

  • Very generic use case would be to read sensitive data from input file and send to some dlp service to mask it before we can write to our data warehouse.

Solution

  • If we go through apache beam documentation we will come across a function called…

--

--

Suraj Mishra
Suraj Mishra

Written by Suraj Mishra

Staff Software Engineer @PayPal ( All opinions are my own and not of my employer )

No responses yet