# Data Delivery Setup Guide

Peach delivers your loan and portfolio data as Parquet files to Google Cloud Storage (GCS), from which you ingest into your preferred data platform. This guide walks you through configuring data delivery and connecting it to your environment.

Data delivery is provisioned as part of your Peach implementation. Your Customer Success contact will confirm your delivery configuration and access details.

**What you get:**

- Data delivered as Parquet files (compressed with GZIP) to a Peach-managed GCS bucket: `gs://peach-data-outbox/{client_name}/`
- Configurable delivery frequency: Daily (6 AM local), Twice Daily (6 AM and 6 PM local), or Hourly
- Native compatibility with Snowflake, Databricks, BigQuery, AWS Athena, and most modern ETL tools


Data delivery pricing and frequency options are defined in your order form. Contact your Customer Success representative for details.

## Setup Steps

| Step | Action | Owner | Details |
|  --- | --- | --- | --- |
| 1 | Confirm data delivery in your order form | Client + Peach CS | Select your delivery frequency (Daily, Twice Daily, or Hourly) and file format (Parquet). |
| 2 | Create service account and share with Peach | Client | Create a service account in your cloud environment (GCP, Azure, or AWS). Send the service account email address to your Peach contact. No keys or secrets are exchanged. |
| 3 | Peach grants access to GCS bucket | Peach | Peach adds your service account to the GCS bucket with read permissions. After this step, your data is accessible. |
| 4 | Configure ingestion pipeline | Client | Set up your ETL pipelines to pull data from the GCS bucket into your data warehouse. See platform-specific instructions below. |
| 5 | Validate data | Client | Confirm data completeness and accuracy in your environment before going live. |


## Which Instructions Apply to You?

The setup steps above are the same for all clients. The platform-specific instructions below depend on your data infrastructure. Navigate to the section that matches your data destination:

- [Databricks Delta](#databricks-delta)
- [Azure Storage](#azure-storage)
- [Amazon S3 via Fivetran](#amazon-s3-via-fivetran)
- [Amazon S3 Push Delivery](#amazon-s3-push-delivery)
- [BigQuery / Snowflake](#bigquery--snowflake)
- [DuckDB (Local Testing)](#duckdb-local-testing)


If your platform is not listed, contact your Peach technical representative.

### Alternative Delivery Methods

In addition to the GCS-based options above, Peach also offers Snowflake-native data sharing:

- **Snowflake Reader Accounts:** Access your data via a Peach-provided Snowflake Reader Account using Snowflake Secure Data Sharing. See [Access Peach Data w/ Snowflake Reader Accounts](/data-reporting/snowflake-reader-accounts) for setup instructions.
- **Snowflake Direct Share:** Peach shares your data directly to your existing Snowflake account via Snowflake Secure Data Sharing. See [Access Peach Data w/ Snowflake Direct Share](/data-reporting/snowflake-direct-share) for setup instructions.


## Databricks Delta

This section covers ingesting Parquet files from GCS into Databricks as Delta tables.

### Prerequisites

- GCP Project with Owner or Storage Admin permissions
- Databricks workspace with cluster access
- Databricks CLI installed
- gcloud CLI installed


### Install Required Tools

1. Install gcloud CLI: Follow Google's installation guide for your OS.
2. Authenticate gcloud:


```shell
gcloud auth login
gcloud auth application-default login
```

1. Install Databricks CLI: Follow Databricks CLI installation guide.
2. Authenticate Databricks CLI:


```shell
databricks configure --token
# Host: https://<YOUR-DATABRICKS-WORKSPACE-URL>
# Token: <YOUR-PAT>
```

### Create GCP Service Account


```shell
# Create the service account
gcloud iam service-accounts create databricks-sa \
  --display-name="Databricks GCS Import SA"

# Create and download key
gcloud iam service-accounts keys create ./gcs-key.json \
  --iam-account=databricks-sa@$PROJECT_ID.iam.gserviceaccount.com
```

> **Hand-off Point:** Send the service account email (databricks-sa@$PROJECT_ID.iam.gserviceaccount.com) to your Peach contact. Peach will grant read access to your GCS folder.


### Store Credentials in Databricks Secret Manager


```shell
# Create a secret scope
databricks secrets create-scope --scope gcs-creds

# Upload the service account key
databricks secrets put-secret gcs-creds service-account.json \
  --string-value "$(cat ./gcs-key.json)"
```

> **Important:** All secrets must live in Databricks Secret Manager. Do not embed credentials in notebooks.


### Create Python Ingestion Script

Save this as gcs_to_delta.py:


```python
# Load the JSON key from Secret Manager
key_json = dbutils.secrets.get("gcs-creds", "service-account.json")

# Write it to DBFS so Spark can pick it up
dbutils.fs.put("dbfs:/tmp/gcs-key.json", key_json, True)

# Read from GCS (replace {client_name} with your folder name)
df = spark.read.parquet(
    "gs://peach-data-outbox/{client_name}/transactions"
)

display(df)

# Write as a managed Delta table
df.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("default.transactions")
```

### Deploy Databricks Job

1. Upload the script:


```shell
databricks fs mkdirs dbfs:/FileStore/scripts
databricks fs cp ./gcs_to_delta.py dbfs:/FileStore/scripts/gcs_to_delta.py
```

1. Create job_spec.json:


```json
{
  "name": "GCS_to_Delta_Import",
  "tasks": [
    {
      "task_key": "run_ingestion_script",
      "new_cluster": {
        "spark_version": "17.0.x-scala2.13",
        "node_type_id": "e2-standard-16",
        "num_workers": 1,
        "spark_conf": {
          "spark.hadoop.fs.gs.auth.service.account.enable": "true",
          "spark.hadoop.fs.gs.auth.service.account.json.keyfile": "/dbfs/tmp/gcs-key.json"
        }
      },
      "spark_python_task": {
        "python_file": "dbfs:/FileStore/scripts/gcs_to_delta.py"
      }
    }
  ],
  "max_concurrent_runs": 1,
  "format": "MULTI_TASK",
  "timeout_seconds": 3600
}
```

1. Create and run the job:


```shell
databricks jobs create --json @job_spec.json
databricks jobs list
databricks jobs run-now <JOB_ID>
```

### Verify

1. Run `databricks jobs runs list --job-id <JOB_ID>` to check status
2. In your Databricks workspace, navigate to Data > Tables
3. Verify that default.transactions exists and contains data


## Azure Storage

This section covers transferring data from GCS to Azure Blob Storage using keyless authentication.

### Prerequisites

- Azure subscription with Owner or Contributor access
- Azure Portal access


### Create Azure Storage Account

1. In the Azure Portal, search for Storage accounts
2. Click + Create
3. Configure:
  - Resource group: Choose existing or create new
  - Storage account name: Choose a globally unique name (e.g., peachdatainbox)
  - Region: Select a region close to your workloads
4. Click Review + Create → Create


### Create a Container

1. Navigate into your new Storage Account
2. Under Data storage, select Containers
3. Click + Container
4. Configure:
  - Name: e.g., peach-data
  - Public access level: Private (no anonymous access)
5. Click Create


### Register an Application in Microsoft Entra ID

1. In Azure Portal, search for Microsoft Entra ID → App registrations
2. Click + New registration
3. Enter:
  - Name: gcs-to-azure-ingest
  - Leave Redirect URI blank
4. Click Register
5. Save the Application (client) ID and Directory (tenant) ID


### Grant Storage Permissions to the App

1. Go to your Storage Account
2. Select Access control (IAM)
3. Click + Add → Add role assignment
4. Configure:
  - Role: Storage Blob Data Contributor
  - Scope: Apply at the container level (preferred)
  - Assign access to: User, group, or service principal
  - Select: Your app registration (gcs-to-azure-ingest)
5. Save


### Configure Federated Identity Credential

1. Return to your App registration
2. Go to Certificates & secrets → Federated credentials
3. Click + Add credential
4. Configure:
  - Federated credential scenario: Other issuer
  - Issuer: https://accounts.google.com
  - Type: Explicit subject identifier
  - Value: 117789079584147916823 (Peach's Google Service Account identifier)
  - Name: peach-data
  - Audience: api://AzureADTokenExchange
5. Save


### Hand-off Point

Send the following to your Peach contact:

| Item | Example |
|  --- | --- |
| Storage account name | peachdatainbox |
| Container name | peach-data |
| Tenant ID | a1b2c3d4-e5f6-7890-abcd-ef1234567890 |
| Client ID | 12345678-abcd-ef90-1234-567890abcdef |


Peach will configure a job to export your data into your Azure Storage Container.

## Amazon S3 via Fivetran

This section covers using Fivetran to pull Parquet files from GCS into your Amazon S3 bucket.

### Prerequisites

- Fivetran account
- Amazon S3 bucket owned by your organization
- IAM roles and bucket policies configured


### Set Up S3 Destination in Fivetran

1. Log in to Fivetran
2. Set up an Amazon S3 Data Lake destination following Fivetran's S3 destination guide
3. Configure IAM roles, encryption, and bucket policies as required


### Create GCS Files Connector in Fivetran

1. Navigate to Connectors > Add Connector > Files > Google Cloud Storage
2. Configure:
  - Bucket Name: (Provided by Peach)
  - Folder Path: (Agreed upon with Peach)
  - File Type: parquet
  - Compression: gzip
3. After entering the bucket name, Fivetran displays a unique service account email: `fivetran-connector-XXXXXX@fivetran-gcs-prod.iam.gserviceaccount.com`


### Hand-off Point

Send the Fivetran service account email to your Peach contact.

> **Note:** Do not download or exchange any keys — this is a keyless setup.


Peach will grant the Fivetran service account roles/storage.objectViewer (read-only) access to your GCS folder.

### Complete Connector Setup

1. After Peach confirms access is granted, return to Fivetran
2. Click Test Connection to verify connectivity
3. Set sync frequency to match your delivery cadence (e.g., every 15 minutes for hourly files)
4. Save and run the connector


### Verify

1. Monitor the connector's status dashboard in Fivetran
2. Confirm rows processed and file ingestion success
3. Validate data appears in your S3 bucket as expected


## Amazon S3 Push Delivery

This section covers configuring your AWS environment so that Peach can push data files directly into your Amazon S3 bucket. Unlike the Fivetran option above, this is a direct delivery — Peach writes files to your S3 bucket using OIDC federation. No static credentials are exchanged.

Peach assumes an IAM role in your AWS account using Web Identity Federation. Peach obtains a short-lived token and exchanges it for temporary AWS credentials via `sts:AssumeRoleWithWebIdentity`.

### What Peach Provides

Before you begin, Peach will share the following:

| Item | Description |
|  --- | --- |
| GCP Service Account Email | `s3-file-delivery-sa@<project>.iam.gserviceaccount.com` |
| GCP Service Account Unique ID | A numeric identifier used in the IAM trust policy (not a secret) |


### File Organization

Files are delivered as compressed Parquet under a prefix you choose. The path structure mirrors the source layout:


```
s3://<your-bucket>/<prefix>/table_name/year=YYYY/month=MM/day=DD/hour=HH/batch=ID/file.parquet
```

For example, if you configure the prefix `vendor/peach/`, a delivery might look like:


```
s3://your-data-lake/vendor/peach/transactions/year=2026/month=04/day=03/hour=12/batch=202604031200/0000.parquet
```

Peach only writes objects under the prefix you designate. No reads, deletes, or writes outside that prefix are required.

### Step 1: Create an S3 Bucket (or Use an Existing One)

If you do not already have a destination bucket:

1. In the AWS Console, navigate to S3 > Create bucket.
2. Choose a bucket name (e.g., `your-company-data-lake`) and region.
3. Under Object Ownership, select **Bucket owner enforced** (recommended — this disables ACLs and ensures all objects are owned by your account).
4. Leave other settings as default or adjust to your requirements.
5. Click Create bucket.


### Step 2: Create an OIDC Identity Provider (One-Time per AWS Account)

> **Already have Google as a provider?** In IAM > Identity providers, look for `accounts.google.com`. If it exists, verify that both `sts.amazonaws.com` and the service account unique ID (provided by Peach) are listed in its audiences. If so, skip to Step 3. If either audience is missing, click on the provider and select Add audience.


If you need to create the provider:

**Via AWS Console:**

1. Navigate to IAM > Identity providers > Add provider.
2. Select OpenID Connect.
3. Enter:
  - Provider URL: `https://accounts.google.com`
  - Audience: `sts.amazonaws.com`
4. Click Add provider.
5. Click on the newly created provider and select **Add audience**. Add the service account unique ID provided by Peach.


**Via AWS CLI:**


```shell
aws iam create-open-id-connect-provider \
  --url https://accounts.google.com \
  --client-id-list sts.amazonaws.com <SA_UNIQUE_ID> \
  --thumbprint-list 0000000000000000000000000000000000000000
```

> **Note:** AWS no longer validates thumbprints for well-known OIDC providers like Google — it uses its own trusted CA library instead. The CLI requires the `--thumbprint-list` parameter syntactically, but the value is not used for validation. The console handles this automatically.


### Step 3: Create an IAM Role

**Via AWS Console:**

1. Navigate to IAM > Roles > Create role.
2. Select **Web identity** as the trusted entity type.
3. Choose the identity provider `accounts.google.com` and audience `sts.amazonaws.com`.
4. Name the role (e.g., `PeachDataDelivery`).
5. Complete the wizard.


> **Important:** The console wizard generates a trust policy with an `aud` condition but no `sub` condition. After the role is created, you must replace the entire trust policy with the JSON below to restrict access to Peach's specific service account.


1. Go to the role > Trust relationships > Edit trust policy. Replace the entire policy with the following, substituting `<YOUR_ACCOUNT_ID>` with your AWS account ID and `<SA_UNIQUE_ID>` with the numeric ID Peach provides:


```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<YOUR_ACCOUNT_ID>:oidc-provider/accounts.google.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "accounts.google.com:sub": "<SA_UNIQUE_ID>"
        }
      }
    }
  ]
}
```

**Via AWS CLI:**

Save the trust policy JSON above to a file called `trust-policy.json` (replacing the placeholders), then run:


```shell
aws iam create-role \
  --role-name PeachDataDelivery \
  --assume-role-policy-document file://trust-policy.json
```

> **Important:** The `sub` condition ensures that only Peach's specific service account can assume this role. Do not remove it. Audience validation is handled automatically by AWS via the OIDC provider's registered client ID list.


### Step 4: Attach an S3 Write Policy

Create and attach an inline or managed policy to the role that grants write access only to your chosen prefix:


```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::<YOUR_BUCKET>/vendor/peach/*"
    }
  ]
}
```

Replace `<YOUR_BUCKET>` with your bucket name and `vendor/peach/*` with the prefix you want Peach to write under.

**Via AWS CLI:**

Save the policy JSON above to `s3-policy.json`, then run:


```shell
aws iam put-role-policy \
  --role-name PeachDataDelivery \
  --policy-name PeachS3Write \
  --policy-document file://s3-policy.json
```

If you use KMS encryption, also add the following statement to the policy:


```json
{
  "Effect": "Allow",
  "Action": [
    "kms:GenerateDataKey",
    "kms:Decrypt"
  ],
  "Resource": "arn:aws:kms:<REGION>:<YOUR_ACCOUNT_ID>:key/<KMS_KEY_ID>"
}
```

> **Note:** `kms:Decrypt` is not strictly required for the write operation, but is included so you can read the delivered files with the same role and for compatibility with future multipart uploads.


### Step 5: Verify Your Setup (Optional)

Before handing off to Peach, you can confirm the role is configured correctly:


```shell
aws iam get-role --role-name PeachDataDelivery --query 'Role.AssumeRolePolicyDocument'
```

Verify the output matches the trust policy JSON from Step 3, with your account ID and Peach's service account unique ID filled in.

### Hand-off Point

Send the following to your Peach contact:

| Item | Example |
|  --- | --- |
| S3 bucket name | `your-company-data-lake` |
| S3 prefix | `vendor/peach/` |
| IAM Role ARN | `arn:aws:iam::123456789012:role/PeachDataDelivery` |
| AWS region | `us-east-1` |
| KMS key ARN (optional) | Required only if you use server-side encryption with a customer-managed key |


Peach will configure the delivery pipeline and begin sending files to your bucket.

### Recommendations

- **Prefix scoping:** Always scope the IAM policy to the narrowest prefix possible. Peach does not need access to any objects outside the delivery prefix.
- **Versioning:** Consider enabling S3 versioning on your bucket for an additional layer of protection against accidental overwrites.
- **Encryption:** If you require server-side encryption with a customer-managed KMS key (SSE-KMS), share the KMS key ARN with Peach during the hand-off. Otherwise, S3 default encryption (SSE-S3) will apply.


### Troubleshooting

| Symptom | Likely Cause | Resolution |
|  --- | --- | --- |
| Files are not arriving | The IAM trust policy `sub` condition does not match Peach's service account ID | Verify the numeric ID in the trust policy matches the value Peach provided |
| AccessDenied on S3 writes | The IAM policy Resource does not cover the write path | Ensure the policy includes the full prefix (e.g., `arn:aws:s3:::bucket/vendor/peach/*`) |
| KMS.AccessDeniedException | The delivery role does not have `kms:GenerateDataKey` on the KMS key | Add KMS permissions to the role's policy. If your KMS key has a restrictive key policy, also add the role ARN to the key policy. |


If you encounter issues not covered above, reach out to your Peach contact and we will help resolve them.

## BigQuery / Snowflake

This section covers ingesting Parquet files from GCS directly into BigQuery or Snowflake.

> **Note:** If you prefer Snowflake-native data sharing instead of GCS-based ingestion, Peach also offers [Snowflake Reader Accounts](/data-reporting/snowflake-reader-accounts) and [Snowflake Direct Share](/data-reporting/snowflake-direct-share) as alternative delivery methods that do not require GCS ingestion.


### Prerequisites

- GCP Project with Owner or Storage Admin permissions
- Google Cloud Service Account
- For BigQuery: roles/bigquery.dataEditor on your dataset
- For Snowflake: Account with CREATE INTEGRATION privileges


### Create and Share Service Account


```shell
# Create the service account
gcloud iam service-accounts create peach_data \
  --display-name="Peach Data Ingestion"

# Get the service account email
SERVICE_ACCOUNT=peach_data@YOUR_PROJECT.iam.gserviceaccount.com
```

> **Hand-off Point:** Send the service account email to your Peach contact. Peach will grant access to your GCS folder (gs://peach-data-outbox/{client_name}).


### Grant BigQuery Permissions (if using BigQuery)


```shell
gcloud projects add-iam-policy-binding YOUR_PROJECT \
  --member="serviceAccount:$SERVICE_ACCOUNT" \
  --role="roles/bigquery.dataEditor"
```

### BigQuery Ingestion

### Option 1: External Table via SQL


```sql
CREATE OR REPLACE EXTERNAL TABLE
`YOUR_PROJECT.YOUR_DATASET.TABLE_NAME`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://peach-data-outbox/{client_name}/TABLE_NAME/*']
);
```

### Option 2: Dataform-Based Ingestion

1. Generate table list:


```shell
export TABLES=$(gsutil ls gs://peach-data-outbox/{client_name}/ \
  | sed 's#gs://peach-data-outbox/{client_name}/##; s#/$##' \
  | paste -sd"," -)
```

1. Configure dataform.json:


```json
{
  "defaultSchema": "YOUR_DATASET",
  "defaultCredentials": {
    "project_id": "YOUR_PROJECT",
    "keyFile": "/path/to/your-service-account.json"
  }
}
```

1. Create definitions/incremental_tables.js:


```javascript
const tables = process.env.TABLES.split(',');

tables.forEach(name => {
  publish(name, {
    type: "external",
    bigquery: {
      external: {
        sourceFormat: "PARQUET",
        compression: "GZIP",
        sourceUris: [
          `gs://peach-data-outbox/{client_name}/${name}/*`
        ]
      }
    }
  });
});
```

1. Run Dataform:


```shell
dataform compile
dataform run
```

### Snowflake Ingestion

1. Create Storage Integration:


```sql
CREATE STORAGE INTEGRATION gcs_peach_int
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = GCS
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ('gcs://peach-data-outbox/{client_name}/')
  STORAGE_GCP_SERVICE_ACCOUNT = '$SERVICE_ACCOUNT';
```

1. Create Stage:


```sql
CREATE OR REPLACE STAGE peach_data_outbox
  URL = 'gcs://peach-data-outbox/{client_name}/'
  STORAGE_INTEGRATION = gcs_peach_int
  FILE_FORMAT = (TYPE = PARQUET COMPRESSION = GZIP);
```

1. Create External Table:


```sql
CREATE OR REPLACE EXTERNAL TABLE YOUR_DB.YOUR_SCHEMA.TABLE_NAME
  WITH LOCATION = @peach_data_outbox/TABLE_NAME/
  AUTO_REFRESH = TRUE
  ENABLE_SCHEMA_EVOLUTION = TRUE
  FILE_FORMAT = (TYPE = PARQUET COMPRESSION = GZIP);
```

## DuckDB (Local Testing)

This section covers querying and ingesting Parquet files locally using *DuckDB*—an open-source, lightweight, serverless SQL database management system. This is ideal for quick data exploration, validation, or teams without existing cloud data warehouse infrastructure. It is designed to run complex analytical queries on large datasets efficiently without requiring a separate server process.

> **Note:** DuckDB is best suited for local testing and data exploration, not production data pipelines. To access your Parquet files in GCS, you will need a GCP account and a service account. Share the service account email with Peach so we can grant read access to your GCS bucket. No keys or secrets are exchanged.


Key features of DuckDB:

- **In-process**: Runs directly within your application, no separate server needed
- **Fast analytics**: Optimized for analytical workloads with columnar storage
- **SQL compliant**: Supports standard SQL with advanced analytical functions
- **Zero dependencies**: Single binary installation, no external dependencies
- **Native Parquet support**: First-class support for reading and writing Parquet files


### Installation

### Option 1: Command Line Interface (CLI)

The simplest way to get started with DuckDB is through its CLI tool.

**macOS**


```shell
# Using Homebrew
brew install duckdb
```

**Linux**


```shell
# Download the latest release
wget https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip
unzip duckdb_cli-linux-amd64.zip
chmod +x duckdb
sudo mv duckdb /usr/local/bin/
```

**Windows**


```shell
# Using winget
winget install DuckDB.cli

# Or download directly from:
# https://github.com/duckdb/duckdb/releases/latest
```

**Verify Installation**


```shell
duckdb --version
```

### Option 2: Python Integration

If you're working with Python, install the DuckDB Python package:


```shell
pip install duckdb
```

### Ingesting Parquet Files into DuckDB

### Method 1: Using DuckDB CLI

This method is perfect for quick data exploration and one-off queries.

**Step 1: Launch DuckDB**

Open your terminal and start DuckDB:


```shell
# Start with an in-memory database (temporary)
duckdb

# Or create/open a persistent database file
duckdb mydata.db
```

You should see the DuckDB prompt at this point.

**Step 2: Query Parquet Files Directly**

DuckDB can query Parquet files without importing them first:


```sql
-- Use glob patterns to query multiple files
SELECT * FROM 'data/*.parquet';

-- Query multiple files with different patterns
SELECT * FROM 'data/year=2024/month=*/day=*/*.parquet';
```

**Step 3: Create a View (Optional)**

For easier querying, create a view that points to your Parquet file(s):


```sql
-- Create a view
CREATE VIEW my_data AS
SELECT * FROM 'path/to/data.parquet';

-- Now query the view
SELECT * FROM my_data WHERE category = 'A';
```

**Step 4: Import Parquet Data into a Table**

To permanently store the data in your DuckDB database:


```sql
-- Method A: Create table from Parquet file directly
CREATE TABLE my_table AS
SELECT * FROM 'data.parquet';

-- Method B: Create table with explicit schema first
CREATE TABLE my_table (
    id INTEGER,
    name VARCHAR,
    value DOUBLE,
    timestamp TIMESTAMP
);

-- Then insert data from Parquet
INSERT INTO my_table
SELECT * FROM 'data.parquet';

-- Method C: Import multiple Parquet files at once
CREATE TABLE combined_data AS
SELECT * FROM 'data/*.parquet';
```

**Step 5: Verify the Import**


```sql
-- Check row count
SELECT COUNT(*) FROM my_table;

-- Inspect schema
DESCRIBE my_table;

-- Preview data
SELECT * FROM my_table LIMIT 5;

-- Get basic statistics
SELECT
    COUNT(*) as total_rows,
    COUNT(DISTINCT id) as unique_ids,
    MIN(timestamp) as earliest_date,
    MAX(timestamp) as latest_date
FROM my_table;
```

### Method 2: Using Python with DuckDB

This method is ideal for data pipelines and programmatic workflows.

**Step 1: Setup**


```python
import duckdb

# Create an in-memory database connection
conn = duckdb.connect()

# Or connect to a persistent database
conn = duckdb.connect('mydata.db')
```

**Step 2: Query Parquet Files Directly**


```python
# Simple query
result = conn.execute("""
    SELECT * FROM 'data.parquet' LIMIT 10
""").fetchall()

# Or use fetchdf() to get a pandas DataFrame
df = conn.execute("""
    SELECT * FROM 'data.parquet'
    WHERE category = 'A'
""").fetchdf()

print(df.head())
```

**Step 3: Create a Table from Parquet**


```python
# Method A: Direct table creation
conn.execute("""
    CREATE TABLE my_table AS
    SELECT * FROM 'data.parquet'
""")

# Method B: Using read_parquet function
conn.execute("""
    CREATE TABLE my_table AS
    SELECT * FROM read_parquet('data/*.parquet')
""")

# Method C: From pandas DataFrame (if you load Parquet via pandas)
import pandas as pd
df = pd.read_parquet('data.parquet')
conn.execute("CREATE TABLE my_table AS SELECT * FROM df")
```

**Step 4: Work with the Data**


```python
# Query the table
result = conn.execute("""
    SELECT category, COUNT(*) as count, AVG(value) as avg_value
    FROM my_table
    GROUP BY category
""").fetchdf()

print(result)

# Update data
conn.execute("""
    UPDATE my_table
    SET value = value * 1.1
    WHERE category = 'A'
""")

# Create indexes for faster queries
conn.execute("""
    CREATE INDEX idx_category ON my_table(category)
""")
```

**Step 5: Close Connection**


```python
conn.close()
```

### Troubleshooting

**Issue: Memory Errors with Large Files**


```sql
-- Increase memory limit
SET memory_limit='16GB';

-- Or use streaming/chunked processing
SELECT * FROM 'large.parquet' WHERE id BETWEEN 1 AND 1000;
```

**Issue: Schema Mismatch**


```sql
-- Check actual schema
DESCRIBE SELECT * FROM 'data.parquet';

-- Use explicit SELECT to handle mismatches
CREATE TABLE my_table AS
SELECT
    column1,
    column2,
    CAST(column3 AS INTEGER) as column3
FROM 'data.parquet';
```

### Best Practices

1. **Use Persistence:** For production workloads, always use a file-based database (duckdb mydata.db) rather than in-memory.
2. **Leverage Direct Queries:** For read-only operations, query Parquet files directly without importing.
3. **Index Frequently Queried Columns:** Create indexes on columns used in WHERE clauses and JOINs
4. **Compress Output:** Always use compression when exporting to Parquet (GZIP for better compression ratio, SNAPPY for faster performance)
5. **Set Appropriate Memory Limits:** Configure memory_limit based on your system resources
6. **Use Explicit Schemas:** When creating tables, specify schemas explicitly for better type control
7. **Batch Operations:** For large imports, use single CREATE TABLE AS statements rather than multiple INSERTs


### Additional Resources

- Official Documentation
- Parquet Guide
- Python API
- GitHub


## Frequently Asked Questions

**How often are files delivered?**

Depends on your selected plan: Daily (once at 6 AM local), Twice Daily (6 AM and 6 PM), or Hourly.

**What format and compression are used?**

Parquet files compressed with GZIP, including embedded schema metadata.

**What permissions does my service account need?**

Your service account requires storage.objectViewer on the GCS folder. Additional permissions depend on your destination platform (e.g., bigquery.dataEditor for BigQuery).

**How long is data retained in GCS?**

Default retention is 365 days of files. Retention can be extended on request.

**What if I have a sandbox environment?**

If you have replica access in sandbox, it will be configured concurrently with production. Both can be set up in parallel.

**What if my platform isn't listed here?**

Contact your Peach technical representative. The documentation will vary based on your specific data ingestion setup.

## Next Steps

Your Customer Success contact will:

1. Confirm your data delivery configuration as part of your order form
2. Coordinate technical setup steps with your team
3. Provide your Peach-managed GCS bucket path once provisioned


## Questions

Contact [support@peachfinance.com](mailto:support@peachfinance.com)