Data Types
Sail supports Arrow data types that is a superset of data types available in Spark SQL. For more background information, you can refer to the Data Types guide for the Spark DataFrame API.
The following table shows the SQL type syntax along with the corresponding Spark data types and Arrow data types. Many data types have aliases not supported in JVM Spark. These are extensions in Sail.
Many Arrow data types do not have a corresponding SQL type syntax, but they are still supported in Sail. You can work with these types in Python UDFs or data sources.
SQL Type Syntax | Spark Data Type | Arrow Data Type |
---|---|---|
NULL VOID | NullType | Null |
BOOLEAN BOOL | BooleanType | Boolean |
BYTE TINYINT INT8 | ByteType | Int8 |
SHORT SMALLINT INT16 | ShortType | Int16 |
INTEGER INT INT32 | IntegerType | Int32 |
LONG BIGINT INT64 | LongType | Int64 |
UNSIGNED BYTE UNSIGNED TINYINT UINT8 | - | UInt8 |
UNSIGNED SHORT UNSIGNED SMALLINT UINT16 | - | UInt16 |
UNSIGNED INTEGER UNSIGNED INT UINT32 | - | UInt32 |
UNSIGNED LONG UNSIGNED BIGINT UINT64 | - | UInt64 |
- | - | Float16 |
FLOAT REAL FLOAT32 | FloatType | Float32 |
DOUBLE FLOAT64 | DoubleType | Float64 |
DATE DATE32 | DateType | Date32 |
DATE64 | - | Date64 |
- | - | Time32(Second) Time32(Millisecond) Time64(Microsecond) Time64(Nanosecond) |
TIMESTAMP[(p)] | TimestampType TimestampNTZType | Timestamp(_, _) |
TIMESTAMP_LTZ[(p)] TIMESTAMP[(p)] WITH [LOCAL ]TIME ZONE | TimestampType | Timestamp(_, TimeZone(_)) |
TIMESTAMP_NTZ[(p)] TIMESTAMP[(p)] WITHOUT TIME ZONE | TimestampNTZType | Timestamp(_, NoTimeZone) |
STRING | StringType | Utf8 LargeUtf8 |
TEXT | - | LargeUtf8 |
CHAR(n) CHARACTER(n) | CharType(n) | Utf8 LargeUtf8 |
VARCHAR(n) | VarcharType(n) | Utf8 LargeUtf8 |
- | - | Utf8View |
BINARY BYTEA | BinaryType | Binary LargeBinary |
- | - | FixedSizeBinary BinaryView |
- | - | Decimal32 Decimal64 |
DECIMAL[(p[, s])] DEC[(p[, s])] NUMERIC[(p[, s])] | DecimalType | Decimal128 Decimal256 |
- | - | Duration(Second) Duration(Millisecond) Duration(Nanosecond) |
INTERVAL YEAR INTERVAL YEAR TO MONTH INTERVAL MONTH | YearMonthIntervalType | Interval(YearMonth) |
- | - | Interval(DayTime) |
INTERVAL DAY INTERVAL DAY TO HOUR INTERVAL DAY TO MINUTE INTERVAL DAY TO SECOND INTERVAL HOUR INTERVAL HOUR TO MINUTE INTERVAL HOUR TO SECOND INTERVAL MINUTE INTERVAL MINUTE TO SECOND INTERVAL SECOND | DayTimeIntervalType | Duration(Microsecond) |
INTERVAL | CalendarIntervalType | Interval(MonthDayNano) |
ARRAY<type> | ArrayType | List |
- | - | LargeList FixedSizeList ListView LargeListView |
MAP<key-type, value-type> | MapType | Map |
STRUCT<name[:] type(, name[:] type)*> | StructType | Struct |
- | - | Union |
- | - | Dictionary |
- | - | RunEndEncoded |
Notes
- The SQL string types (except
TEXT
) are mapped to either the Utf8 or LargeUtf8 type in Arrow, depending on thespark.sql.execution.arrow.useLargeVarTypes
configuration option. - The SQL binary types are mapped to either the Binary or LargeBinary type in Arrow, depending on the
spark.sql.execution.arrow.useLargeVarTypes
configuration option. - The SQL
TIMESTAMP
type can either represent timestamps with local time zone (TIMESTAMP_LTZ
, the default) or timestamps without time zone (TIMESTAMP_NTZ
), depending on thespark.sql.timestampType
configuration option. - For the SQL timestamp types, the optional
p
parameter specifies the precision of the timestamp. A number of0
,3
,6
, or9
represents second, millisecond, microsecond, or nanosecond precision respectively. The default value is6
(microsecond precision). Note that only the microsecond precision timestamp is compatible with Spark. - For the SQL decimal types, the optional
p
ands
parameters specify the precision and scale of the decimal number respectively. The default precision is10
and the default scale is0
. The decimal type maps to either Decimal128 or Decimal256 type in Arrow depending on the specified precision. - The SQL
INTERVAL
type is mapped to the Interval(MonthDayNano) Arrow type which has nanosecond precision. CalendarIntervalType in Spark has microsecond precision so the supported data range is different.