From the course: Big Data Analytics with Hadoop and Apache Spark
Unlock the full course today
Join today to access over 24,700 courses taught by industry experts.
Parallel writes with bucketing
From the course: Big Data Analytics with Hadoop and Apache Spark
Parallel writes with bucketing
- [Instructor] As reviewed in the earlier videos, bucketing can be used to partition data where there are a large number of unique values for a given column. In this case, we create buckets, again, based on the product column. We will create three buckets. In order to do bucketing, we use the bucket by method. We specify the number of buckets and the column to bucket by. We also want to save this data as a hive table, adding a savers table with the table name saves the data to hive. The hive table is created under this park warehouse folder as we are using a dummy HD of a system, we run an example query from the table to verify its contents. Let's execute this code now. We can see the list of databases and the locations using the spark list databases method. This shows the folder where the actual file is stored. We can go to the HTFS, under this park warehouse folder, to examine its contents. We can see the product bucket table stored here. This is stored with multiple parts, each…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.