One of hte most important factor to consider at this stage is the type of data ingestion. There are two types in terms of data ingestion;
bounded and unbounded.
Bounded data has clear temporal border in data ingestion. For example, daily batch loading job loads data that has been accumulated during the last 24 hours. In this case, every loaded data has boundary of 24 hours, hence bounded data. Since bounded data is the result of accumulation during a given time, batch processing is an appropriate choice and ETL (Extract, Transform, Load) is the most well-known technology for it.
Benefits of Batch Processing
1. Completeness and consistency in data
2. Relatively simple and reliable
Drawback of Batch Processing
1. Must wait until next batch for new data.
2. Regular dedicated hour, mostly during night time, for batch loading.
Unbounded data, in contrast to bounded data, has no or very short temporal boundary. Since data is flowing in continuously, it needs to be processed upon arriving. This processing method is called
streaming processing, and it is essential for real-time or near real-time data analysis. Data is inherently unbounded because it is created at any given moment at source level. Depending on analysis requirements, one should decide whether to set up short-term boundary in incoming data flow, or to process flowing data as it is.
Micro-batch, which is adopted by Spark Streaming, is one of the well-known streaming processing. Also
CEP (Complex Event Processing) is a widely known term, which is about aggregating and categorizing incoming data flow in meaningful units.
Benefits of Streaming Processing
1. Low latency between data ingestion and analysis.
2. When properly designed, no more batch processing.
Drawbacks of Streaming Processing
1. Limited market awareness and technical understanding.
2. Far more complex than batch processing.