Intro to Kafka

Intro to Kafka

Tags
Engineering
Kafka
Created
Mar 4, 2022 07:12 AM
Edited
Oct 4, 2022
Description
Kafka learning notes
  • Distributed event streaming platform
a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss

Core Capabilities

  • High throughput
  • Scalable
  • Permanent storage
  • High availability
 
 
  • Batching
  • Message/event order
  • Connect
 
Data format → Serialization + de-serialization
Kafka message: key-value
key: can be int, string, or something more complex
 
Use key and secret to access Kafka cluster
 

Topic

  • Named container like a table in a database
  • Stored similar events
  • Events are immutable
  • Events are durable and stored on disks
  • Retention period is configurable

Partition

  • Separate a topic into different pieces stored in different places
  • Feed the partition key to a hash function
  • The partition key will guarantee the order of the events
Kafka guarantees that any consumer of a given topic-partition will always read that partition's events in exactly the same order as they were written.

Broker

  • An computer, instance, or container running the Kafka process
  • Manage partition
  • Manage replication of partition
  • Handle write and read

Replication

  • Copies of data for fault tolerance
  • Copies are called follower replicas
  • One lead partition and N-1 followers
  • The main partition is the leader replica
  • Write and read to the leader, followers will follow up
A common production setting is a replication factor of 3

Producer

  • Client application
  • Puts messages into topics
  • Connection pooling
  • Network buffering
  • Partitioning (decide where or which broker to send)

Consumer

  • Client application
  • Read messages from topics
  • Connection pooling
  • Network protocol
  • Horizontally and elastically scalable
  • Maintains ordering within partitions at scale
  • Consumer does not destroy messages, and multiple consumers can read from the same topic
  • rebalance between consumer nodes in a consumer group
  • Preserve message order by consumer groups

Consumer groups

  • A list of consumers
The partitions of all the topics are divided among the consumers in the group
  • More consumers in a group reduce bottleneck
  • One consumer from the group takes events from a partition of a topic
    • Larger number of partition allows adding more consumers to process data
    • Better throughput as it scales
  • One application → one consumer group

Cluster and Topic

    Service account

    • Define how we access Confluent Cloud
    • Usually tied with ACLs
    • Similar to IAM User

    ACL (Access Control List)

    • Define permission or authorization of Kafka clusters. For example, read or write to a specific topic
    • Tie to API Keys and service account
      • Service account is the Pincipal
    • Similar to IAM Policy or IoT Policy

    Principal

    • An entity authenticated by the authorizer
    • Different security protocol has different principals
      • Ex: User:admin, Group:developers, or User:CN=quickstart.confluent.io,OU=TEST,O=Sales,L=PaloAlto,ST=Ca,C=US

    SSL

    • Subject name

    SASL/GSSAPI

    • A host and a path

    SASL/PLAIN or SCRAM

    • simple text string
    • Ex: admin

    API Key

    • Used to control access to Confluent Cloud components and resources
    • Similar to IoT Certificate
    • Bind with service account (required) and managed clusters (optional)
      • Service account is the owner of the API Key

    Authentication

    SASL

    • Simple Authentication Security Layer

    SASL/PLAIN

    • Still use TLS to encrypt
    • Use username and password

    bootstrap server

    • A server with an url that gives metadata of topics, partitions, brokers
    • Used by producers or consumers to talk to Kafka clusters

    Source