AWS Glue : TIL Best Practice for S3 Folder Design with long names
Update : Jun 6 Upon further testing I realized that it was not folder names but folder count and the schema’s within them that was messing up my table creation via automatic crawl! I think default crawler has some percentage match algorithm where is considers folders with similar schemas to be partitions and not tables
The premise of this learning is now incorrect…
Amazon Athena and AWS Glue are really powerful tools that help curate and build an efficient Datalake and ETL infrastructure and provide a simple mechanism to query data in S3
The documentation however is a bit sparse on how to design your bucket/folders to ensure that when the default crawler creates the right tables/partitions
Working Folder Names :
The figure below represents a bucket structure where the folders become tables
global-sales → global_sales ( table name)
regional-sales → regional_sales (table name)
Incorrect Folder Names :
The figure below represents a bucket structure where the folders do not become table names