Multiple Catalog
- RFC PR: datafuselabs/databend#8255
- Tracking Issue: datafuselabs/databend#8300
Summary
Allow users to maintain multiple catalogs for the databend.
Motivation
Databend organize data in three layers:
catalog -> database -> table
- catalog: the biggest layer for databend, contains all databases and tables, provided by- Catalog
- database: the container of tables, provided by- Database
- table: the smallest unit in databend, provided by- Table
By default, all databases and tables will be stored in default catalog (powered by metasrv).
Databend supports multiple catalogs now, but only in a static way.
To allow accessing the hive catalog, users need to configure hive inside databend-query.toml in this way:
[catalog]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"
Users can't add/alter/remove the catalogs during runtime.
By allowing users to maintain multiple catalogs for the databend, we can integrate more catalogs like iceberg more quickly.
Guide-level explanation
After this RFC has been implemented, users can create new catalogs like:
CREATE CATALOG my_hive
  TYPE=HIVE
  CONNECTION = (URL='<hive-meta-store>' THRIFT_PROTOCOL=BINARY);
SELECT * FROM my_hive.DB.table;
Besides, users can alter or drop a catalog:
DROP CATALOG my_hive;
Users can add more catalogs like:
CREATE CATALOG my_iceberg
  TYPE=ICEBERG
  CONNECTION = (URL='s3://my_bucket/path/to/iceberg');
SELECT * FROM my_iceberg.DB.table;
With this feature, users can join data from different catalogs now:
select
    my_iceberg.DB.purchase_records.Client_ID,
    my_iceberg.DB.purchase_records.Item,
    my_iceberg.DB.purchase_records.QTY
from my_hive.DB.vip_info
inner join my_iceberg.DB.purchase_records
    on my_hive.DB.vip_info.Client_ID = my_iceberg.DB.purchase_records.Client_ID;
Reference-level explanation
Databend has a framework for multiple catalogs now. The only change for us is to store catalog-related information in metasrv instead.
To make it possible to start a query without metasrv, we will also support configuring catalogs in config like:
[catalogs.my_hive]
meta_store_address = "127.0.0.1:9083"
protocol = "binary"
[catalogs.my_iceberg]
URL = "s3://bucket"
Static catalogs will always be loaded from configs and can't be altered or dropped.
Drawbacks
None.
Rationale and alternatives
None.
Prior art
Presto
Presto is an open-source SQL query engine that's fast, reliable, and efficient at scale. It doesn't have persisted states, so all its connectors will be configured.
Take iceberg as an example:
connector.name=iceberg
hive.metastore.uri=hostname:port
iceberg.catalog.type=hive
While using:
USE iceberg.tpch;
CREATE TABLE IF NOT EXISTS ctas_nation AS (SELECT * FROM nation);
DESCRIBE ctas_nation;
Unresolved questions
None.
Future possibilities
Iceberg Catalog
Discussed in RFC Iceberg External Table
Delta Sharing Catalog
Discussed in Tracking issues of integration with delta sharing