Free Shipping

Secure Payment

easy returns

24/7 support

  • Home
  • Blog
  • Loading Files To Local File System Using Flume

Loading Files To Local File System Using Flume

 July 14  | 0 Comments
Apache Flume is one of the most preferred options to provide a distributed, reliable, and available service for efficient collection, aggregation, and movement of the large volume of data. Moving data in the large volume is a very complex task and to minimize the latency in transfer Flume is configured.
First, we’ll see how to setup flume.
Users can follow Installation of Flume and fetching data from twitter using Flume to understand flume installation steps and the steps to use it.
 We also recommend users to go through our blog using flume to copy a file from local file system to HDFS using Spool directory.

Before we proceed let us understand the architecture of the Flume.

  • Event – It is a singular unit of data that is transported by Flume (usually a single log entry).
  • Source – It is an entity through which data enters into the Flume channel. Sources either actively sample the data or passively waits for data to be delivered to flume channel.
  • Sink – It is a unit that delivers the data to the destination by streaming it to a range of destinations. Example: HDFS sink writes events to the HDFS.
  • Channel – It is the connecting medium between the Source and the Sink. Event is ingested into the Channel from the source and  then from the channel, it is drained into the Sink.(location of sink is specified in flume configuration directory)
  • Agent – It is a physical Java virtual machine running Flume. It is a collection of Sources, Sinks, and Channels.
  • Client – It produces and transmits the Event to the Source operating within the AgentIn this post, we are creating a spool directory to transfer file locally. This file transfer is also called “rolling of files.”
    • First, create a configuration file inside the Twitter folder conf directoryIn this case, we have named the configuration file as “AcadgildLocal.conf.” Continue reading for more details about the configuration file.
      *Note: Create two directories named source_sink_dir and destination_sink_dir, and update the same in conf.
      You can also download the configuration file HERE.
      Refer to the screenshot below that displays the configuration file for a memory channel defined as “agnet1.”

      Explanation for the configuration file

      Property Name Default Description
      Channel Memory
      Type The component type name needs to be file_roll
      sink.directory The directory where files will be stored sink.
      apool.directory The directory where files will be spooled from.
      Optional
      rollInterval 30 Roll the file every 30 seconds. Specifying 0 will disable rolling and cause all events to be written to a single file.
      sink.serializer TEXT Other possible options include avro_event or the FQCN of an implementation of EventSerializer.Builder interface.
      batchSize 100

      *Note: Make sure Hadoop daemons are up and running.

      • Now fire up your Flume that includes the configured file with the complete path.

      If everything is lined up correctly, a message stating “spool started” will show up. (Please refer to the screenshot below.)

      • Once started, you can find a file inside destination_sink. This Indicates that the event of Flume is running efficiently and reporting at the correct destination folder.
      • Next, we will apply file roles to a few test files.
      • Let us drag and drop the files inside the sink. (Please refer to the screenshot below.)
        • Once the files have been dropped into the sink_directory, we find that the name for the files change that now comes with a newly extension, “COMPLETED.”

        In destinatio_directory it is observed that we find newly generated files. (Please refer to the following screenshot for better understanding.)You will find several files in the destination directory, as the default timer for rolling is 30 seconds. This happens because the data flows in continuously to the sink as long as the Flume agent runs. (Refer to the screenshot below for the result of test file rolled, that we kept in the destination_sink_ directory.)

        • To stop the Flume agent, press ctrl + z inside the terminal running the agent. Every time you need this agent to work, we need to start it manually using the configuration file.

        For more trending big data topics, keep visiting our website acadgild.com.

>