- Hadoop Map-Reduce was developed at a time when hardware failure was very common. So, it was designed accordingly. It use to write every thing back to disk after every operation. And, as we know, IO is super slow. So when spark came out, it was able to beat the shit out of hadoop Map-Reduce, as spark did every in memory. Since, hardwares where more reliable and would fail less often. So, it was okay to recompute if it fails. Still you will end up getting an better overall throughput.
- Some due diligence is good but there is not point wasting time trying to anticipate all possible challenges. Agile is the only proven way to build something as complex as software.
4. Encoding and Evolving
- There are two types of things
- when your data is in memory. Example, bytes
- when your data is on disk. Example, JSON, XML, etc.
- Message passing using kafka has become a standard. You can easily decouple the producer and the consumers, enabling horizontal scaling.
- Its very easy to horizontal scale stateless systems/components but scaling stateful components like databases is very challenging. Concurrency with read only (example CDNs for static files) is also easy but concurrent write operations and syncing replicas in distributed setup is both complex and expensive.