Saturday, 30 March 2019

Apache Kafka CCDAK Exam Notes

Hi Readers,

click here for updated kafka notes

If you are planning or preparing for Apache Kafka Certification then this is the right place for you.There are many Apache Kafka Certifications are available in the market but CCDAK (Confluent Certified Developer for Apache Kafka) is the most known certification as Kafka is now maintained by Confluent.


Confluent has introduced CCOAK certification recently. CCOAK is mainly for devOps engineer focusing on build and manage Kafka cluster. CCDAK is mainly for developers and Solution architects focusing on design, producer and consumer. If you are still not sure, I recommend to go for CCDAK as it is more comprehensive exam as compared to CCOAK.


From here onward, we will talk about how to prepare for CCDAK.


I have recently cracked CCDAK and would suggest that prepare well for the exam. This exam verify your theoretical as well as practical understanding of Kafka. You need to answer 60 questions in 90 minutes from your laptop under the supervision of online proctor. There is no mention of number of questions need to be correct in order to pass the exam. They just tell you pass or fail at the end of exam. At least 40-50 hours of preparation is required.


I have prepared for CCDAK using following:



You should prepare well in following areas:
  1. Kafka Architecture
    • Read Confluent Kafka Definitive Guide PDF
    • Read Apache Kafka Documentation
    • Once you read all these, revise using KAFKA THEORY section in this blog. You can expect most of the questions from these notes.
  2. Kafka Java APIs
    • Read Apache Kafka Documentation how to create producer and consumer in Java
  3. Kafka CLI
    • Read Confluent Kafka Definitive Guide PDF
    • Once you read all these, revise using KAFKA CLI section in this blog. You can expect most of the questions from these notes.
  4. Kafka Streams
    • Read Confluent Kafka Definitive Guide PDF
  5. Kafka Monitoring (Metrics)
    • Read Confluent Kafka Definitive Guide PDF
    • Read Apache Kafka Documentation for important metrics
    • Read Confluent Kafka Documentation as well
  6. Kafka Security
    • Read Apache Kafka Documentation
  7. Confluent KSQL
    • Read about KSQL from Confluent Documentation
  8. Confluent REST Proxy
    • Read about KSQL from Confluent Documentation
  9. Confluent Schema Registry

Questions from CCDAK Exam

  1. Kafka Theory
    • Kafka is a .... ? pub-sub system
    • Mostly Kafka is written in which language? Scala
  2. Kafka Streams (Read Kafka Streams notes to get answers of below questions)
    • Which of the Kafka Stream operators are stateful ?
    • Which of the Kafka Stream operators are stateless ?
    • Which window is not having gap ?
  3. Confluent Schema Registry (Read Confluent Schema Registry notes to get answers of below questions)
    • Which of the following is not a primitive type of Avro ?
    • Which of the following in not a complex type of Avro?
    • Which of the following is not a required field in Avro Schema?
    • Delete a field without default value in Avro schema is ...... compatibility? 

KAFKA THEORY

1. Cluster
2. Rack
3. Broker
  • Every broker in Kafka is a "bootstrap server" which knows about all brokers, topics and partitions (metadata) that means Kafka client (e.g. producer,consumer etc) only need to connect to one broker in order to connect to entire cluster.
  • At all times, only one broker should be the controller, and one broker must always be the controller in the cluster
4. Topic
  • Kafka takes bytes as input without even loading them into memory (that's called zero copy)
  • Brokers have defaults for all the topic configuration parameters    
5. Partition
  • Topic can have one or more partition.
  • It is not possible to delete a partition of topic once created.
  • Order is guaranteed within the partition and once data is written into partition, its immutable!
  • If producer writes at 1 GB/sec and consumer consumes at 250MB/sec then requires 4 partition!
6. Segment
  • Partitions are made of segments (.log files)
  • At a time only one segment is active in a partition
  • log.segment.bytes=1 GB (default) Max size of a single segment in bytes
  • log.segment.ms=1 week (default) Time kafka will wait before closing the segment if not full
  • Segment come with two indexes (files)
    • An offset to position index (.index file): Allows kafka where to read to find a message
    • A timestamp to offset index (.timeindex file): Allows kafka to find a message with a timestamp
  • log.cleanup.policy=delete (Kafka default for all user topics) Delete data based on age of data (default is 1 week)
  • log.cleanup.policy=compact Delete based on keys of your messages. Will delete old duplicate keys after the active segment is committed. (Kafka default  for topic __consumer_offsets)
  • Log cleanup happen on partition segments. Smaller/more segments means the log cleanup will happen more often!
  • The cleaner checks for work every 15 seconds (log.cleaner.backoff.ms) 
  • log.retention.hours= 1 week (default) number of hours to keep data for
  • log.retention.bytes = -1 (infinite default) max size in bytes for each partition
  • Old segments will be deleted based on log.retention.hours or log.retention.bytes rule
  • The offset of message is immutable.
  • Deleted records can still be seen by consumers for a period of delete.retention.ms=24 hours (default)
7. Offset
  • Partition is having its own offset starting from 0.
8. Topic Replication
  • Replication factor = 3 and partition = 2 means there will be total 6 partition distributed across Kafka cluster. Each partition will be having 1 leader and 2 ISR (in-sync replica).
  • Broker contains leader partition called leader of that partition and only leader can receive and serve data for partition.
  • Replication factor can not be greater then number of broker in the kafka cluster. If topic is having a replication factor of 3 then each partition will live on 3 different brokers.
9. Producer
  • Automatically recover from errors: LEADER_NOT_AVAILABLE, NOT_LEADER_FOR_PARTITION, REBALANCE_IN_PROGRESS
  • Non retriable errors: MESSAGE_TOO_LARGE
  • When produce to a topic which doesn't exist and auto.create.topic.enable=true then kafka creates the topic automatically with the broker/topic settings num.partition and default.replication.factor
10. Producer Acknowledgment
  • acks=0: Producer do not wait for ack (possible data loss)
  • acks=1: Producer wait for leader ack (limited data loss)
  • acks=all: Producer wait for leader+replica ack (no data loss)
acks=all must be used in conjunction with min.insync.replicas (can be set at broker or topic level)
min.insync.replica=2 implies that at least 2 brokers that are ISR(including leader) must acknowledge
e.g. replication.factor=3, min.insync.replicas=2,acks=all can only tolerate 1 broker going down, otherwise the producer will receive an exception NOT_ENOUGH_REPLICAS on send

11. Safe Producer Config
  • min.insync.replicas=2 (set at broker or topic level)
  • retries=MAX_INT: number of reties by producer in case of transient failure/exception. (default is 0)
  • max.in.flight.per.connection number=5: number of producer request can be made in parellel (default is 5)
  • acks=all
  • enable.idempotence=true: producer send producerId with each message to identify for duplicate msg at kafka end. When kafka receives duplicate message with same producerId which it already committed. It do not commit it again and send ack to producer (default is false)
12. High Throughput Producer using compression and batching
  • compression.type=snappy: value can be none(default), gzip, lz4, snappy. Compression is enabled at the producer level and doesn't require any config change in broker or consumer Compression is more effective in case of bigger batch of messages being sent to kafka
  • linger.ms=20:Number of millisecond a producer is willing to wait before sending a batch out. (default 0). Increase linger.ms value increase the chance of batching.
  • batch.size=32KB or 64KB: Maximum number of bytes that will be included in a batch (default 16KB). Any message bigger than the batch size will not be batched
10. Message Key
  • Producer can choose to send a key with message. 
  • If key= null, data is send in round robin
  • If key is sent, then all message for that key will always go to same partition. This can be used to order the messages for a specific key since order is guaranteed in same partition.
  • Adding a partition to the topic will loose the guarantee of same key go to same partition.
  • Keys are hashed using "murmur2" algorithm by default.
13. Consumer
  • Per thread one consumer is the rule. Consumer must not be multi threaded.
  • records-lag-max (monitoring metrics) The maximum lag in terms of number of records for any partition in this window. An increasing value over time is your best indication that the consumer group is not keeping up with the producers.
14. Consumer Group

15. Consumer Offset
  • When consumer in a group has processed the data received from Kafka, it commits the offset in Kafka topic named _consumer_offset which is used when a consumer dies, it will be able to read back from where it left off.
14. Delivery Semantics
  • At most once : Offset are committed as soon as message batch is received. If the processing goes wrong, the message will be lost (it won't be read again)
  • At least once (default): Offset are committed after the message is processed.If the processing goes wrong, the message will be read again. This can result in duplicate processing of message. Make sure your processing is idempotent. (i.e. re-processing the message won't impact your systems). For most of the application, we use this and ensure processing are idempotent.
  • Exactly once: Can only be achieved for Kafka=>Kafka workflows using Kafka Streams API. For Kafka=>Sink workflows, use an idempotent consumer.
16. Consumer Offset commit strategy
  • enable.auto.commit=true & synchronous processing of batches: with auto commit, offset will be committed automatically for you at regular interval (auto.commit.interval.ms=5000 by default) every time you call .poll(). If you don't use synchronous processing, you will be in "at most once" behavior because offsets will be committed before your data is processed.
  • enable.auto.commit=false & manual commit of offsets (recommended)
17. Consumer Offset reset behavior
  • auto.offset.reset=latest: will read from the end of the log
  • auto.offset.reset=earliest: will read from the start of the log
  • auto.offset.reset=none: will throw exception of no offset is found
  • Consumer offset can be lost if hasn't read new data in 7 days. This can be controlled by broker setting offset.retention.minutes 
18. Consumer Poll Behavior
  • fetch.min.bytes = 1 (default): Control how much data you want to pull at least on each request. Help improving throughput and decreasing request number. At the cost of latency.
  • max.poll.records = 500 (default): Controls how many records to receive per poll request. Increase if your messages are very small and have a lot of available RAM.
  • max.partition.fetch.bytes = 1MB (default): Maximum data returned by broker per partition. If you read from 100 partition, you will need a lot of memory (RAM)
  • fetch.max.bytes = 50MB (default): Maximum data returned for each fetch request (covers multiple partition). Consumer performs multiple fetches in parallel.  
19. Consumer Heartbeat Thread
  • Heartbeat mechanism is used to detect if consumer application in dead.
  • session.timeout.ms=10 seconds (default) If heartbeat is not sent in 10 second period, the consumer is considered dead. Set lower value to faster consumer rebalances
  • heartbeat.interval.ms=3 seconds (default) Heartbeat is sent in every 3 seconds interval. Usually 1/3rd of session.timeout.ms
20. Consumer Poll Thread
  • Poll mechanism is also used to detect if consumer application is dead.
  • max.poll.interval.ms = 5 minute (default) Max amount of time between two .poll() calls before declaring consumer dead. If processing of message batch takes more time in general in application then should increase the interval.
21. Kafka Guarantees
  • Messages are appended to a topic-partition in the order they are sent
  • Consumer read the messages in the order stored in topic-partition
  • With a replication factor of N, producers and consumers can tolerate upto N-1 brokers being down
  • As long as number of partitions remains constant for a topic ( no new partition), the same key will always go to same partition
22. Client Bi-Directional Compatibility
  • an Older client (1.1) can talk to Newer broker (2.0)
  • a Newer client (2.0) can talk to Older broker (1.1)
23. Kafka Connect
  • Source connect: Get data from common data source to Kafka
  • Sink connect: Publish data from Kafka to common data source
24. Zookeeper
  • ZooKeeper servers will be deployed on multiple nodes. This is called an ensemble. An ensemble is a set of 2n + 1 ZooKeeper servers where n is any number greater than 0. The odd number of servers allows ZooKeeper to perform majority elections for leadership. At any given time, there can be up to n failed servers in an ensemble and the ZooKeeper cluster will keep quorum. If at any time, quorum is lost, the ZooKeeper cluster will go down. 
  • In Zookeeper multi-node configuration, initLimit and syncLimit are used to govern how long following ZooKeeper servers can take to initialize with the current leader and how long they can be out of sync with the leader. 
    • If tickTime=2000, initLimit=5 and syncLimit=2 then a follower can take (tickTime*initLimit) = 10000ms to initialize and may be out of sync for up to (tickTime*syncLimit) = 4000ms
  • In Zookeeper multi-node configuration, The server.* properties set the ensemble membership. The format is server.<myid>=<hostname>:<leaderport>:<electionport>. Some explanation:
    • myid is the server identification number. In this example, there are three servers, so each one will have a different myid with values 1, 2, and 3 respectively. The myid is set by creating a file named myid in the dataDir that contains a single integer in human readable ASCII text. This value must match one of the myid values from the configuration file. If another ensemble member has already been started with a conflicting myid value, an error will be thrown upon startup.
    • leaderport is used by followers to connect to the active leader. This port should be open between all ZooKeeper ensemble members.
    • electionport is used to perform leader elections between ensemble members. This port should be open between all ZooKeeper ensemble members.

KAFKA CLI

1. Start a zookeeper at default port 2181

> bin/zookeeper-server-start.sh config/zookeeper.properties

2. Start a kafka server at default port 9092


> bin/kafka-server-start.sh config/server.properties

3. Create a kafka topic with name my-first-topic

> bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --create --replication-factor 1 --partitions 1

4. List all kafka topics

> bin/kafka-topics.sh --zookeeper localhost:2181 --list

5. Describe kafka topic my-first-topic

> bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --describe

6. Delete kafka topic my-first-topic

> bin/kafka-topics.sh --zookeeper localhost:2181 --topic my-first-topic --delete
Note: This will have no impact if delete.topic.enable is not set to true


7. Find out all the partitions without a leader

> bin/kafka-topics.sh --zookeeper localhost:2181 --describe --unavailable-partitions

8. Produce messages to Kafka topic my-first-topic

> bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-first-topic --producer-property acks=all
>hello ashish
>learning kafka
>^C

9. Start Consuming messages from kafka topic my-first-topic

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --from-beginning
>hello ashish
>learning kafka

10. Start Consuming messages in a consumer group from kafka topic my-first-topic

> bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic my-first-topic --group my-first-consumer-group --from-beginning

11. List all consumer groups

> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

12. Describe consumer group

> bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-first-consumer-group

13. Reset offset of consumer group to replay all messages

 > bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe -group my-first-consumer-group --reset-offsets --to-earliest --execute --topic my-first-topic

14. Shift offsets by 2 (forward) as another strategy

> bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group my-first-consumer-group --reset-offsets --shift-by 2 --execute --topic my-first_topic

15. Shift offsets by 2 (backward) as another strategy

> bin/kafka-consumer-groups --bootstrap-server localhost:9092 --group my-first-consumer-group --reset-offsets --shift-by -2 --execute --topic my-first_topic


KAFKA API

  • Click here to find out how we can create a Safe and high throughput Kafka Producer using Java.
  • Click here to find out how we can create a Kafka consumer using Java with manual auto commit enabled.

Default Ports

  • Zookeeper: 2181
  • Zookeeper Leader Port 3888
  • Zookeeper Election Port (Peer port) 2888
  • Broker: 9092
  • REST Proxy: 8082
  • Schema Registry: 8081
  • KSQL: 8088

Top CSS Interview Questions

These CSS interview questions are based on my personal interview experience. Likelihood of question being asked in the interview is from to...