External Tables in Azure Synapse

Loading

Azure Synapse with External Table ...

Comprehensive Guide to External Tables in Azure Synapse Analytics


Table of Contents

  1. Introduction
    • What are External Tables?
    • Benefits of Using External Tables in Azure Synapse
  2. Architecture Overview
    • Dedicated SQL Pool vs. Serverless SQL Pool
    • Components Involved in External Tables
  3. Prerequisites
    • Required Permissions
    • Necessary Azure Resources
    • Supported File Formats
  4. Creating External Tables
    • Step-by-Step Guide
    • Example: Creating a Parquet External Table
    • Example: Creating a CSV External Table
  5. Querying External Tables
    • Writing Queries on External Tables
    • Performance Considerations
  6. Managing External Tables
    • Viewing Existing External Tables
    • Modifying External Tables
    • Dropping External Tables
  7. Security and Access Control
    • Using Database Scoped Credentials
    • Managing Permissions
    • Securing Access to External Data
  8. Advanced Features
    • Partition Elimination
    • Filter Predicate Pushdown
    • Handling Appendable Files
  9. Best Practices
    • Optimizing Performance
    • Managing Large Datasets
    • Monitoring and Troubleshooting
  10. Use Cases
    • Data Warehousing
    • Data Lakes Integration
    • Real-Time Analytics
  11. Conclusion
    • Summary of Key Points
    • Future Trends in Data Integration

1. Introduction

What are External Tables?

External tables in Azure Synapse Analytics allow you to query data stored outside the dedicated SQL pool, such as in Azure Blob Storage or Azure Data Lake Storage. They provide a way to access and analyze external data without the need to load it into the SQL pool, enabling more efficient data processing and analytics.

Benefits of Using External Tables in Azure Synapse

  • Cost Efficiency: Avoids the need to load large datasets into the SQL pool, saving on storage and compute costs.
  • Flexibility: Enables querying of diverse data formats and sources.
  • Performance: Utilizes optimized data access methods for faster query execution.
  • Scalability: Supports large-scale data processing across distributed environments.

2. Architecture Overview

Dedicated SQL Pool vs. Serverless SQL Pool

  • Dedicated SQL Pool: A provisioned compute environment where resources are allocated for data processing. External tables in this pool can access data in formats like Parquet and CSV.
  • Serverless SQL Pool: An on-demand compute environment that allows querying of data without pre-provisioning resources. It supports a wider range of file formats, including Delta Lake and JSON.

Components Involved in External Tables

  • External Data Source: Defines the connection to the external data storage.
  • External File Format: Specifies the format of the data files (e.g., Parquet, CSV).
  • External Table: Represents the structure of the external data and maps it to a SQL table.

3. Prerequisites

Required Permissions

  • CREATE EXTERNAL DATA SOURCE: Permission to create external data sources.
  • CREATE EXTERNAL FILE FORMAT: Permission to define external file formats.
  • CREATE EXTERNAL TABLE: Permission to create external tables.

Necessary Azure Resources

  • Azure Storage Account: Where the external data is stored.
  • Database Scoped Credential: For authenticating access to the external data.

Supported File Formats

  • Parquet: A columnar storage file format optimized for analytical queries.
  • CSV: A common text-based format for tabular data.
  • Delta Lake: A storage layer that brings ACID transactions to Apache Spark and big data workloads.
  • JSON: A lightweight data-interchange format.

4. Creating External Tables

Step-by-Step Guide

  1. Create a Database Scoped Credential CREATE DATABASE SCOPED CREDENTIAL MyCredential WITH IDENTITY = 'SHARED ACCESS SIGNATURE', SECRET = '<SAS_Token>';
  2. Create an External Data Source CREATE EXTERNAL DATA SOURCE MyDataSource WITH ( LOCATION = 'https://<storage_account>.blob.core.windows.net', CREDENTIAL = MyCredential );
  3. Create an External File Format CREATE EXTERNAL FILE FORMAT MyFileFormat WITH ( FORMAT_TYPE = PARQUET );
  4. Create an External Table CREATE EXTERNAL TABLE MyExternalTable ( Column1 INT, Column2 VARCHAR(100) ) WITH ( LOCATION = 'path/to/data/', DATA_SOURCE = MyDataSource, FILE_FORMAT = MyFileFormat );

Example: Creating a Parquet External Table

CREATE EXTERNAL TABLE SalesData (
    SaleID INT,
    ProductName VARCHAR(100),
    Quantity INT,
    Price DECIMAL(10, 2)
)
WITH (
    LOCATION = 'sales_data/',
    DATA_SOURCE = MyDataSource,
    FILE_FORMAT = MyFileFormat
);

Example: Creating a CSV External Table

CREATE EXTERNAL FILE FORMAT CsvFileFormat
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = ',',
        STRING_DELIMITER = '"'
    )
);

CREATE EXTERNAL TABLE CustomerData (
    CustomerID INT,
    CustomerName VARCHAR(100),
    ContactName VARCHAR(100),
    Country VARCHAR(50)
)
WITH (
    LOCATION = 'customers/',
    DATA_SOURCE = MyDataSource,
    FILE_FORMAT = CsvFileFormat
);

5. Querying External Tables

Writing Queries on External Tables

SELECT * FROM MyExternalTable
WHERE Column1 > 100;

Performance Considerations

  • Data Proximity: Ensure that the Synapse workspace and the external data source are in the same region to minimize latency.
  • File Format Optimization: Use columnar formats like Parquet for better performance on large datasets.
  • Partitioning: Organize data into partitions to improve query performance.

6. Managing External Tables

Viewing Existing External Tables

SELECT * FROM sys.external_tables;

Modifying External Tables

To modify an external table, you need to drop and recreate it with the desired changes.

Dropping External Tables

DROP EXTERNAL TABLE MyExternalTable;

7. Security and Access Control

Using Database Scoped Credentials

Database scoped credentials are used to authenticate access to external data sources. They can be created using SAS tokens or managed identities.

Managing Permissions

Ensure that appropriate permissions are granted to users and roles for accessing external tables.

Securing Access to External Data

  • SAS Tokens: Provide limited access to resources in Azure Storage.
  • Managed Identity: Use Azure Active Directory identities to access resources securely.

8. Advanced Features

Partition Elimination

Partition elimination allows queries to skip irrelevant partitions, improving performance. This is supported in native external tables when the data

Leave a Reply

Your email address will not be published. Required fields are marked *