00:00:00

Share Your Feedback 🏝️

Table LLM*

Table LLM*

MinWoo(Daniel) Park | Tech Blog

Read more
Previous: Survey | Datasets for LLMs Next: Model | Plan GPT

Table LLM*

  • Related Project: Private
  • Category: Paper Review
  • Date: 2024-03-02

Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding – A Survey

  • url: https://arxiv.org/abs/2402.17944
  • pdf: https://arxiv.org/pdf/2402.17944
  • abstract: Recent breakthroughs in large language modeling have facilitated rigorous exploration of their application in diverse tasks related to tabular data modeling, such as prediction, tabular data synthesis, question answering, and table understanding. Each task presents unique challenges and opportunities. However, there is currently a lack of comprehensive review that summarizes and compares the key techniques, metrics, datasets, models, and optimization approaches in this research domain. This survey aims to address this gap by consolidating recent progress in these areas, offering a thorough survey and taxonomy of the datasets, metrics, and methodologies utilized. It identifies strengths, limitations, unexplored territories, and gaps in the existing literature, while providing some insights for future research directions in this vital and rapidly evolving field. It also provides relevant code and datasets references. Through this comprehensive review, we hope to provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

Contents

TL;DR


1. 소개

LLM은 대규모 데이터를 통해 훈련된 심층 학습 모델로, 자연어 처리(NLP) 작업을 넘어 다양한 문제 해결 능력을 제공합니다. 최근 연구에서는 LLM의 향상된 few-shot 학습 능력과 같은 새로운 능력이 밝혀졌습니다. 이런 LLM의 향상된 성능은 학계와 산업계 모두에서 큰 관심을 불러일으켰으며, 이 시대의 인공 일반 지능(AGI)의 기반으로 여겨지기도 합니다.

과거에는 NLP 및 데이터 관리 작업을 위해 타블러 데이터를 신경망과 통합하는 방법이 연구되었습니다. 현재 연구자들은 예측, 테이블 이해, 정량적 인퍼런스, 데이터 생성 등 다양한 작업에서 타블러 데이터를 활용하는 LLM의 능력을 조사하는 데 관심이 있습니다.

이 연구에서는 LLM을 사용한 타블러 데이터 모델링의 최근 진전에 대해 종합적으로 검토합니다. 첫 번째 섹션에서는 타블러 데이터의 특성을 소개하고 이 분야에 맞춤화된 전통적인 딥러닝 및 LLM 방법에 대해 간략히 검토합니다. 2장에서는 LLM에 타블러 데이터를 적용하기 위한 주요 기술을 소개하며, 이어서 3장에서는 예측 작업, 4장에서는 데이터 증강 및 풍부화 작업, 5장에서는 질문 응답/테이블 이해 작업을 다룹니다. 마지막으로, 6장에서는 한계와 미래 방향에 대해 논의하고 7장에서 결론을 맺습니다.

1.1 타블러 데이터의 특성

타블러 데이터, 일반적으로 구조화된 데이터라고 알려져 있으며, 각 열이 특정 특성을 나타내는 행과 열로 구성됩니다. 이 하위 섹션에서는 타블러 데이터의 공통적인 특성과 내재된 챌린지에 대해 논의합니다.

  1. 이질성: 타블러 데이터는 범주형, 수치형, 이진형, 텍스트형 등 다양한 특성 유형을 포함할 수 있습니다.
  2. 희소성: 임상 시험, 역학 연구, 사기 탐지 등 실제 응용에서는 불균형한 클래스 레이블과 결측값을 다루게 되며, 이는 훈련 샘플에서 긴 꼬리 분포를 초래합니다.
  3. 전처리 의존성: 타블러 데이터를 다룰 때 데이터 전처리는 필수적이며 응용 프로그램에 따라 다릅니다. 수치 값의 경우 일반적인 기술로는 데이터 정규화나 스케일링, 범주형 값 인코딩, 결측값 대체, 이상치 제거 등이 있습니다.
  4. 맥락 기반 상호 연결: 타블러 데이터에서는 특성들이 상관관계를 가질 수 있습니다. 예를 들어, 인구 통계 테이블에서 나이, 교육 및 음주는 상호 연결되어 있습니다.
  5. 순서 불변성: 타블러 데이터에서는 샘플과 특성을 정렬할 수 있지만, 텍스트나 이미지 데이터와 달리 타블러 데이터는 위치에 상대적으로 불변입니다.
  6. 선행 지식 부족: 이미지나 오디오 데이터에서는 데이터의 공간적 또는 시간적 구조에 대한 선행 지식이 종종 있어 모델 훈련 중에 활용할 수 있습니다. 그러나 타블러 데이터에서는 이런 선행 지식이 종종 부족하여 모델이 특성 간의 내재적 관계를 이해하기 어렵게 만듭니다.

1.2 타블러 데이터에서의 전통적 및 딥러닝

전통적인 트리 기반 앙상블 방법은 타블러 데이터에 대한 예측에서 여전히 최고의 상태를 유지하고 있습니다. 부스팅 앙상블 방법에서는 기본 학습자들이 순차적으로 학습되어 이전 학습자의 오류를 줄이며, 이는 단일 학습자보다 안정적이고 정확합니다. 그러나 딥러닝 모델과 비교할 때 몇 가지 한계가 있습니다. 예를 들어, 트리 기반 모델은 특히 범주형 특성에서 특성 공학에 민감한 반면, 딥러닝은 훈련 중에 암시적으로 표현을 학습할 수 있습니다.

최근 몇 년 동안, 타블러 데이터 모델링을 위해 딥러닝을 사용하는 작업이 많이 이루어졌습니다. 이 방법은 데이터 변환, 차별화 가능한 트리, 주의 기반 방법, 정규화 방법 등으로 크게 그룹화할 수 있습니다. 이런 연구 노력에도 불구하고, GBDT 알고리즘들은 대부분의 데이터셋에서 여전히 딥러닝 방법보다 우수한 성능을 보여줍니다.

1.3 대규모 언어모델(LLM) 개요

언어 모델(LM)은 단어 시퀀스에서 미래 또는 누락된 토큰의 생성 가능성을 예측하는 확률 모델입니다. LLM은 이전의 PLM보다 더 큰 데이터와 모델 크기를 활용하여 더 향상된 성능을 추구합니다. 이런 대규모 PLM은 전통적인 언어 모델링을 넘어 보다 일반적이고 복잡한 작업을 해결할 수 있는 능력, 즉 신흥 능력을 보여줍니다.

1.3.1 타블러 데이터에서의 LLM 응용

LM은 NLP 작업을 처리하는 데 인상적인 능력을 보여주었지만, 타블러 데이터 학습에서는 데이터 구조의 차이로 인해 그 활용이 제한되었습니다. 일부 연구 노력은 PLM, 특히 BERT 기반 모델을 사용하여 타블러 데이터의 컨텍스트 표현을 학습하는 데 중점을 두었습니다. 이런 접근 방식은 타블러 데이터를 텍스트로 변환하고 PLM을 파인튜닝하는 마스

크 언어 모델링(MLM) 접근 방식을 사용합니다.

1.3.2 타블러 데이터 모델링에서의 LLM 기회

LLM을 사용하여 다양한 타블러 데이터 작업에 대한 가능성을 탐구하는 많은 연구가 있습니다. 이런 탐구는 in-context 학습, 지시 사항 따르기, 단계별 인퍼런스과 같은 LLM의 독특한 능력에 의해 주도됩니다. 타블러 데이터를 LLM이 읽을 수 있는 자연어로 변환하는 것은 고차원 범주형 데이터의 차원의 저주를 해결하는 데 도움이 됩니다.

1.4 기여

이 작업의 주요 기여는 다음과 같습니다.

  1. LLM의 타블러 데이터 응용에 대한 주요 기술을 체계적으로 분류하고, 이런 기술을 연구자와 실무자가 사용할 수 있는 분류 체계로 조직했습니다.
  2. LLM의 타블러 데이터 응용에 대한 지표의 조사 및 분류를 통해 각 응용 프로그램을 평가할 수 있는 폭넓은 지표를 제공합니다.
  3. 각 응용 프로그램에 대해 일반적으로 사용되는 벤치마크 데이터셋을 식별하고, 이를 기반으로 추천 데이터셋을 제공합니다.
  4. 각 단계별로 주요 기술을 분해하여 연구자와 실무자가 주요 기술을 쉽게 찾고 이해할 수 있도록 돕습니다.
  5. 편향 문제 해결, 향상된 수치 데이터 표현 개발, 모델 용량 향상, 표준 벤치마크 형성, 모델 해석성 개선, 통합 워크플로 설계, 더 나은 파인튜닝 전략 개발, 하위 응용 프로그램 성능 향상 등 향후 연구가 해결해야 할 주요 문제와 챌린지를 제시합니다.


2 주요 기술: 표 데이터에 대한 대규모 언어모델 적용

  • 대규모 언어모델(LLM)을 통한 표 데이터 모델링 방법 연구
  • 표 데이터 직렬화, 표 조작, 프롬프트 엔지니어링 및 종단간 시스템 구축 포함
  • 특정 애플리케이션에 따른 LLM의 파인튜닝은 섹션 3 및 5에서 논의

2.1 직렬화

직렬화는 표 데이터를 텍스트 형식으로 변환하는 과정입니다. 이는 LLM이 시퀀스 대 시퀀스 모델이기 때문에 필수적인 과정입니다.

\[\text{직렬화} : \text{표 데이터} \to \text{텍스트}\]
  • 텍스트 기반: 표를 직접적으로 프로그래밍 언어가 읽을 수 있는 데이터 구조로 입력하거나, Markdown 형식으로 변환하는 방법이 일반적입니다.
  • 임베딩 기반: BERT와 같은 사전 학습된 언어 모델을 기반으로 표 데이터를 수치적 표현으로 인코딩합니다. 예를 들어 TAPAS, TABERT 등의 모델이 사용됩니다.
  • 그래프 및 트리 기반: 덜 일반적이지만, 표를 그래프나 트리 데이터 구조로 변환한 후 텍스트로 재변환하는 방법도 있습니다.

Markdown과 같은 마크업 언어는 LLM이 웹 데이터에서 자주 접하는 형식이므로, 이런 형식이 LLM에 의해 더 잘 이해된다는 가설이 있습니다. 실제로, HTML 또는 XML 형식이 GPT 모델들이 표와 관련된 질문과 답변 작업에서 더 나은 성능을 보이는 것으로 나타났습니다.

2.2 표 조작

표 데이터는 구조와 내용에서 다양성을 가지며, LLM이 효율적으로 데이터를 소화하기 위해서는 표의 크기를 조절하는 것이 중요합니다.

  • 테이블 컨텍스트 길이 조절: 작은 테이블은 프롬프트에 전체를 포함시킬 수 있지만, 큰 테이블의 경우 몇 가지 도전이 있습니다.
    • 일부 모델은 짧은 컨텍스트 길이를 지원합니다.
    • LLM은 긴 문장을 효율적으로 처리하지 못하는 경향이 있습니다.
    • 더 긴 프롬프트는 더 높은 비용을 발생시킵니다.
\[\text{테이블 크기 조절} : \text{최적화된 LLM 입력} \to \text{성능 및 비용 향상}\]

테이블의 크기를 조절함으로써 LLM이 중요한 정보에 더 빠르고 정확하게 접근할 수 있도록 합니다. 이는 불필요한 정보의 소거와 중요 정보의 강조를 통해 이루어집니다.

2.3 프롬프트 엔지니어링

프롬프트 엔지니어링은 LLM에 입력되는 텍스트를 설계하는 과정으로, 특히 테이블을 포함한 작업에 있어 중요한 연구 주제입니다.

  • 인컨텍스트 학습: 유사한 예시를 포함시켜 LLM이 원하는 출력을 이해하도록 합니다.
  • 사고의 연쇄 및 자체 일관성: 복잡한 인퍼런스 문제에 대해 다양한 사고 경로를 샘플링하고, 가장 일관된 답변을 선택합니다.
\[\text{프롬프트 설계} : \text{테이블 데이터} + \text{연구 전략} \to \text{LLM 성능 향상}\]

프롬프트를 통해 LLM이 테이블 데이터를 이해하고, 원하는 작업을 수행하도록 유도합니다. 좋은 프롬프트는 LLM의 성능을 크게 향상시킬 수 있습니다.

2.4 종단간 시스템

종단간 시스템은 자연어 질의를 데이터베이스나 스프레드시트에서 실행 가능한 구조화된 질의로 변환하는 모델을 포함합니다.

  • SQL 쿼리 생성: 자연어 질의를 SQL 쿼리로 변환하여 데이터베이스에서 실행합니다.
  • 반복적 질의 최적화: LLM을 이용하여 SQL 오류를 수정하고, 반복적으로 질의를 최적화합니다.
\[\text{종단간 시스템} : \text{자연어 질의} \to \text{구조화된 질의} \to \text{데이터 처리}\]

종단간 시스템은 LLM을 활용하여 사용자의 자연어 질의를 정확하고 효율적으로 데이터베이스 질의로 변환하고, 결과를 반환하는 과정을 자동화합니다. 이는 LLM의 텍스트 생성 능력과 프로그래밍 언어 해석 능력을 결합한 것입니다.


3 LLMs for Predictions

3.1 Dataset

이 연구에서는 타겟 특화 학습을 위한 데이터셋으로 UCI ML, OpenML, 그리고 Manikandan et al. (2023)에 의해 생성된 9개 데이터셋의 조합이 사용되었습니다. 특히, 다양한 특징과 큰 데이터셋 크기를 포함하는 9개 데이터셋의 조합이 추천됩니다.

3.2 Tabular Prediction

타블렛 예측 방법은 텍스트 기반 직렬화에서 부터 시작하여, 데이터를 효과적으로 처리하는 LLM을 사용하여 예측 정확도를 향상시키는 다양한 기술들을 포함합니다. 예를 들어, “Column name is Value” 형태로 자연스러운 문장으로 직렬화하는 방법이 낮은 차원의 작업에서 더 높은 예측 정확도를 달성하는 것으로 나타났습니다.

3.3 Time Series Forecasting

시계열 예측은 숫자적 특성과 시간적 관계에 더 큰 주의를 기울입니다. 이 분야에서는 시계열 데이터를 LLM에 적합한 형태로 직렬화하고, 적절한 출력 매핑을 통해 결과를 예측하는 방법이 중요합니다.

이 연구에서는 주로 LLMs의 타겟 맞춤형 학습을 위한 효과적인 전처리, 직렬화 및 대상 증강 방법을 사용합니다. 직렬화 과정에서, 데이터셋의 특성을 텍스트 형태로 변환하는 것이 중요하며, 이는 LLM이 효과적으로 처리할 수 있도록 돕습니다. 예를 들어, 데이터를 텍스트로 변환하는 과정은 다음과 같은 수식으로 나타낼 수 있습니다.

\[\text{Serialized Data} = f(\text{Data})\]

$f(\text{Data})$는 데이터를 LLM이 처리 가능한 텍스트 형태로 변환하는 함수입니다.

또한, 타겟 증강 방법은 LLM의 텍스트 출력을 명확한 레이블로 매핑하는 과정을 포함합니다. 이 과정은 다음과 같이 표현할 수 있습니다.

\[\text{Label} = g(\text{LLM Output})\]

$g(\text{LLM Output})$는 LLM의 출력을 구체적인 타겟 레이블로 변환하는 함수입니다.

실험 및 연구

각각의 데이터셋과 방법에 대한 실험을 통해, 이런 직렬화 및 타겟 증강 기술이 LLM의 예측 성능을 어떻게 향상시키는지를 검증하였습니다. 특히, 다양한 직렬화 기술들이 특정 타스크와 데이터셋에서 어떻게 다르게 작동하는지를 분석함으로써, 가장 효과적인 방법을 선택할 수 있도록 도왔습니다.

이 연구를 통해, LLM을 사용한 타블렛 데이터 예측은 다양한 직렬화 및 타겟 증강 방법을 통해 크게 향상될 수 있음을 확인할 수 있었습니다. 각각의 방법과 기술이 특정 조건과 데이터셋에서 어떻게 최적화될 수 있는지에 대한 이해를 바탕으로, LLM의 범용성과 예측력을 높이는 방향으로 연구가 진행될 수 있을 것입니다.


4 LLMs를 활용한 테이블 데이터 생성

  1. 데이터 생성과 이해를 위한 LLMs 활용
  2. 합성 데이터의 품질 및 개인정보 보호 평가
  3. LLMs를 활용한 테이블 기반 질의응답(QA) 방법 탐구

4.1 방법

본 섹션은 테이블 데이터 생성에 필수적인 데이터 생성 과정을 다룹니다. 다양한 방법이 소개되며, 각각의 방법이 어떻게 데이터의 실제적 특성을 모사하는지에 대해 집중적으로 설명합니다.

  • GReaT 모델

    GReaT 모델은 테이블 데이터를 의미있는 텍스트로 변환하는 과정을 포함합니다. 이는 수학적으로 $x$가 원본 데이터의 특성을 나타내면, 변환 함수 $f(x)$는 텍스트 $t$로의 매핑을 의미합니다. 즉, $t = f(x)$이고, 이 과정은 GPT-2 모델을 이용하여 테이블 데이터의 특성을 유지하면서 텍스트로 재구성하는 것입니다. 이런 방법은 데이터의 다양성과 실제성을 확보하는 데 중점을 둡니다.

  • REaLTabFormer 모델

    REaLTabFormer는 GReaT을 확장하여 비관계적 및 관계적 테이블 데이터를 생성합니다. 이 모델은 GPT-2를 사용하여 주 테이블을 생성하고, 이를 기반으로 시퀀스-투-시퀀스 모델을 조건부로 사용하여 관계 데이터셋을 생성합니다. 수학적으로 주어진 부모 테이블 $P$에 대해, 조건부 확률 모델 $M$을 사용하여 자식 테이블 $C$를 생성합니다.

    \[C = M(P)\]

    이 방법은 테이블 간의 관계 구조를 포착하는 데 향상된 성능을 보이며, 기존의 데이터 복제를 방지하기 위한 타깃 마스킹 기법을 도입합니다.

  • TAPTAP 모델

    TAPTAP는 수치 인코딩 개선을 포함하여 여러 개선 사항을 도입합니다. 이는 GBDT와 같은 외부 모델을 사용하여 가짜 레이블을 생성하는 방법입니다. 이런 접근은 기존 언어 모델 기반 접근과 차별화되며, 수학적 인코딩 방식의 개선은 다음과 같이 모델 사용에 따라 달라집니다.

    \[\text{New Label} = \text{GBDT}(X)\]

    $X$는 원본 데이터를 나타냅니다.

4.2 평가

실험적 평가는 데이터의 질을 다차원적으로 분석하며, 다음과 같은 수학적 평가 방법을 포함합니다.

  • 저차원 통계

    각 열의 밀도와 열 쌍 간의 상관관계를 평가합니다.

    \[\rho(X_i, X_j) = \frac{\text{Cov}(X_i, X_j)}{\sigma_{X_i} \sigma_{X_j}}\]

    $X_i$와 $X_j$는 데이터셋의 열을 나타내며, $\rho$는 상관 계수를 나타냅니다.

  • 고차원 메트릭

    $\alpha$-Precision와 $\beta$-재현율 점수를 계산하여 데이터의 전반적인 충실도와 다양성을 측정합니다.

    \[\alpha\text{-precision} = \frac{TP}{TP + FP}, \quad \beta\text{-recall} = \frac{TP}{TP + FN}\]
  • 개인 정보 보호

    가장 가까운 기록까지의 거리(DCR)를 계산하여 원본 데이터의 개인 정보 보호 수준을 평가합니다.

    \[\text{DCR} = \text{Median}(\min_{i \neq j} d(x_i, x_j))\]

    $d$는 두 레코드 간의 거리를 나타냅니다.


1 Introduction

Large language models (LLMs) are deep learning models trained on extensive data, endowing them with versatile problem-solving capabilities that extend far beyond the realm of natural language processing (NLP) tasks (Fu & Khot, 2022). Recent research has revealed emergent abilities of LLMs, such as improved performance on few-shot prompted tasks (Wei et al., 2022b). The remarkable performance of LLMs have incited interest in both academia and industry, raising beliefs that they could serve as the foundation for Artificial General Intelligence (AGI) of this era (Chang et al., 2024; Zhao et al., 2023b; Wei et al., 2022b). A noteworthy example is ChatGPT, designed specifically for engaging in human conversation, that demonstrates the ability to comprehend and generate human language text (Liu et al., 2023g).

Before LLMs, researchers have been investigating ways to integrate tabular data with neural network for NLP and data management tasks (Badaro et al., 2023). Today, researchers are keen to investigate the abilities of LLMs when working with tabular data for various tasks, such as prediction, table understanding, quantitative reasoning, and data generation (Hegselmann et al., 2023; Sui et al., 2023c; Borisov et al., 2023a).

Tabular data stands as one of the pervasive and essential data formats in machine learning (ML), with widespread applications across diverse domains such as finance, medicine, business, agriculture, education, and other sectors that heavily rely on relational databases (Sahakyan et al., 2021; Rundo et al., 2019; Hernandez et al., 2022; Umer et al., 2019; Luan & Tsai, 2021).

In the current work, we provide a comprehensive review of recent advancements in modeling tabular data using LLMs. In the first section, we introduce the characteristics of tabular data, then provide a brief review of traditional, deep-learning and LLM methods tailored for this area. In Section 2, we introduce key techniques related to the adaptation of tabular data for LLMs. Subsequently, we cover the applications of LLMs in prediction tasks (Section 3), data augmentation and enrichment tasks (Section 4), and question answering/table understanding tasks (Section 5). Finally, Section 6 discusses limitations and future directions, while Section 7 concludes. The overview of this paper is shown in Figure 1 and Figure

Figure 1: Overview of LLM on Tabular Data: the paper discusses application of LLM for prediction, data generation, and table understanding tasks

1.1 Characteristics of tabular data

Tabular data, commonly known as structured data, refers to data organized into rows and columns, where each column represents a specific feature. This subsection discusses the common characteristics and inherited challenges with tabular data:

  1. Heterogeneity: Tabular data can contain different feature types: categorical, numerical, binary, and textual. Therefore, features can range from being dense numerical features to sparse or high-cardinality categorical features (Borisov et al., 2022).
  2. Sparsity: Real-world applications, such as clinical trials, epidemiological research, fraud detection, etc., often deal with imbalanced class labels and missing values, which results in long-tailed distribution in the training samples (Sauber-Cole & Khoshgoftaar, 2022).
  3. Dependency on pre-processing: Data pre-processing is crucial and application-dependent when working with tabular data. For numerical values, common techniques include data normalization or scaling, categorical value encoding, missing value imputation, and outlier removal. For categorical values, common techniques include label encoding or one-hot encoding. Improper pre-processing may lead to information loss, sparse matrix, and introduce multi-collinearity (e.g. with one-hot encoding) or synthetic ordering (e.g. with ordinal encoding) (Borisov et al., 2023a).
  4. Context-based interconnection: In tabular data, features can be correlated. For example, age, education, and alcohol consumption from a demographic table are interconnected: it is hard to get a doctoral degree at a young age, and there is a minimum legal drinking age. Including correlated regressors in regressions lead to biased coefficients, hence, a modeler must be aware of such intricacies (Liu et al., 2023d).
  5. Order invariant: In tabular data, samples and features can be sorted. However, as opposed to text-based and image-based data that is intrinsically tied to the position of the word/token or pixel in the text or image, tabular data are relatively order-invariant. Therefore, position-based methodologies (e.g., spatial correlation, impeding inductive bias, convolutional neural networks (CNN)) are less applicable for tabular data modeling (Borisov et al., 2022).
  6. Lack of prior knowledge: In image or audio data, there is often prior knowledge about the spatial or temporal structure of the data, which can be leveraged by the model during training. However, in tabular data, such prior knowledge is often lacking, making it challenging for the model to understand the inherent relationships between features (Borisov et al., 2022; 2023a).

1.2 Traditional and deep learning in tabular data

Traditional tree-based ensemble methods such as gradient-boosted decision trees (GBDT) remain the stat-eof-the-art (SOTA) for predictions on tabular data (Borisov et al., 2022; Gorishniy et al., 2021). In boosting ensemble methods, base learners are learned sequentially to reduce previous learner’s error until no significant improvement are made, making it relatively stable and accurate than a single learner (Chen & Guestrin, 2016). Traditional tree-based models are known for its high performance, efficiency in training, ease of tuning, and ease of interpretation. However, they have limitations compared to deep learning models: 1. Tree-based models can be sensitive to feature engineering especially with categorical features while deep learning can learn representation implicitly during training (Goodfellow et al., 2016). 2. Tree-based models are not naturally suited for processing sequential data, such as time series while deep learning models such as Recurrent Neural Networks (RNNs) and transformers excel in handling sequential dependencies. 3. Tree-based models sometimes struggle to generalize to unseen data particularly if the training data is not representative of the entire distribution, while deep learning methods may generalize better to diverse datasets with their ability to learn intricate representations (Goodfellow et al., 2016).

In the recent years, many works have delved into using deep learning for tabular data modeling. The methodologies can be broadly grouped into the following categories:

  1. Data transformation. These models either strive to convert heterogenous tabular input into homogenous data more suitable to neural networks, like an image, on which CNN-like mechanism can be applied (SuperTML (Sun et al., 2019), IGTD (Zhu et al., 2021b), 1D-CNN (Kiranyaz et al., 2019)), or methods focusing on combining feature transformation with deep neural networks (Wide&Deep (Cheng et al., 2016; Guo & Berkhahn, 2016), DeepFM (Guo et al., 2017), DNN2LR (Liu et al., 2021)).
  2. Differentiable trees. Inspired by the performance of ensembled trees, this line of methods seeks to make trees differentiable by smoothing the decision function (NODE (Popov et al., 2019), SDTR (Luo et al., 2021), Net-DNF (Katzir et al., 2020)). Another subcategory of methods combine tree-based models with deep neural networks, thus can maintain tree’s capabilities on handling sparse categorical features (DeepGBM (Ke et al., 2019a)), borrow prior structural knowledge from the tree (TabNN (Ke et al., 2019b)), or exploit topological information by converting structured data into a directed graph (BGNN (Ivanov & Prokhorenkova, 2021).
  3. Attention-based methods. These models incorporate attention mechanisms for feature selection and reasoning (TabNet (Arik & Pfister, 2020)), feature encoding (TransTab (Wang & Sun, 2022), TabTransformer (Huang et al., 2020)), feature interaction modeling (ARMnet (Cai et al., 2021)), or aiding intrasample information sharing (SAINT (Somepalli et al., 2021), NPT (Kossen et al., 2022)).
  4. Regularization methods. The importance of features varies in tabular data, in contrast to image or text data. Thus, this line of research seeks to design an optimal and dynamic regularization mechanism to adjust the sensitivity of the model to certain inputs (e.g. RLN (Shavitt & Segal, 2018), Regularization Cocktails (Kadra et al., 2021). In spite of rigorous attempts in applying deep learning to tabular data modeling, GBDT algorithms, including XGBoost, LightGBM, and CatBoost (Prokhorenkova et al., 2019), still outperform deep-learning methods in most datasets with additional benefits in fast training time, high interpretability, and easy optimization (Shwartz-Ziv & Armon, 2022; Gorishniy et al., 2021; Grinsztajn et al., 2022). Deep learning models, however, may have their advantages over traditional methods in some circumstances, for example, when facing very large datasets, or when the data is primarily comprised of categorical features (Borisov et al., 2022).

Another important task for tabular data modeling is data synthesis. Abilities to synthesize real and high-quality data is essential for model development. Data generation is used for augmentation when the data is sparse (Onishi & Meguro, 2023), imputing missing values (Jolicoeur-Martineau et al., 2023), and class rebalancing in imbalanced data (Sauber-Cole & Khoshgoftaar, 2022). Traditional methods for synthetic data generation are mostly based on Copulas (Patki et al., 2016; Li et al., 2020) and Bayesian networks (Zhang et al., 2017; Madl et al., 2023) while recent advancement in generative models such as Variational Autoencoders (VAEs) (Ma et al., 2020; Darabi & Elor, 2021; Vardhan & Kok, 2020; Liu et al., 2023d; Xu et al., 2023b)), generative adversarial networks (GANs) (Park et al., 2018; Choi et al., 2018; Baowaly et al., 2019; Xu et al., 2019), diffusion (Kotelnikov et al., 2022; Xu et al., 2023a; Kim et al., 2022b;a; Lee et al., 2023; Zhang et al., 2023c), and LLMs, opened up many new opportunities. These deep learning approaches have demonstrated superior performance over classical methods such as Bayesian networks ((Xu et al., 2019)).

Table question answering (QA) is a natural language research problem from tabular data. Many earlier methods fine-tune BERT (Devlin et al., 2019) to become table encoders for table-related tasks, like TAPAS (Herzig et al., 2020), TABERT (Yin et al., 2020b), TURL (Deng et al., 2022a), TUTA (Wang et al., 2021) and TABBIE (Iida et al., 2021). For example, TAPAS extended BERT’s masked language model objective to structured data by incorporating additional embeddings designed to capture tabular structure. It also integrates two classification layers to facilitate the selection of cells and predict the corresponding aggregation operator. A particular table QA task, Text2SQL, involves translating natural language question into structured query language (SQL). Earlier research conducted semantic parsing through hand-crafted features and grammar rules (Pasupat & Liang, 2015b). Semantic parsing is also used when the table is not coming from non-database tables such as web tables, spreadsheet tables, and others (Jin et al., 2022). Seq2SQL is a sequence-to-sequence deep neural network using reinforcement-learning to generate conditions of query on WikiSQL task (Zhong et al., 2017a). Some methodologies are sketch-based, wherein a natural language question is translated into a sketch. Subsequently, programming language techniques such as type-directed sketch completion and automatic repair are utilized in an iterative manner to refine the initial sketch, ultimately producing the final query (e.g. SQLizer (Yaghmazadeh et al., 2017)). Another example is SQLNet (Xu et al., 2017) which uses column attention mechanism to synthesize the query based on a dependency graph-dependent sketch. A derivative of SQLNet is TYPESQL (Yu et al., 2018a) which is also a sketchbased and slot-filling method entails extracting essential features to populate their respective slots. Unlike the previous supervised end-to-end models, TableQuery is a NL2SQL model pretrained on QA on free text that obviates the necessity of loading the entire dataset into memory and serializing databases.

Figure 2: Tabular data characteristics and machine learning models for tabular data prediction, data synthesis and question answering before LLMs.

1.3 Overview of large language models (LLMs)

A language model (LM) is a probabilistic model that predicts the generative likelihood of future or missing tokens in a word sequence. Zhao et al. (2023b) thoroughly reviewed the development of LMs, and characterized the it into four different stages: The first stage is Statistical Language Models (SLM), which learns the probability of word occurrence in an example sequence from previous words (e.g. N-Gram) based on Markov assumption (Saul & Pereira, 1997). Although a more accurate prediction can be achieved by increasing the context window, SML is limited by the curse of high dimensionality and high demand for computation power (Bengio et al., 2000). Next, Neural Language Models (NLM) utilize neural networks (e.g. Recurrent neural networks (RNN)) as a probabilistic classifier (Kim et al., 2016). In addition to learn the probabilistic function for word sequence, a key advantage of NLM is that they can learn the distributed representation (i.e. word embedding) of each word so that similar words are mapped close to each other in the embedding space (e.g. Word2Vec), making the model generalize well to unseen sequences that are not in the training data and help alleviate the curse of dimensionality (Bengio et al., 2000). Later, rather than learning a static word embedding, context-aware representation learning was introduced by pretraining the model on large-scale unannotated corpora using bidirectional LSTM that takes context into consideration (e.g., ELMo (Peters et al., 2018a)), which shows significant performance boost in various natural language processing (NLP) tasks (Wang et al., 2022a; Peters et al., 2018b). Along this line, several other Pretrained Language Models (PLM) were proposed utilizing a transformer architecture with self-attention mechanisms including BERT and GPT2 (Ding et al., 2023). The pre-training and fine-tuning paradigm, closely related to transfer learning, allows the model to gain general syntactic and semantic understanding of the text corpus and then be trained on task-specific objectives to adapt to various tasks. The final and most recent stage of LM is the Large Language Models (LLMs), and will be the focus of this paper. Motivated by the observation that scaling the data and model size usually leads to improved performance, researchers sought to test the boundaries of PLM’s performance of a larger size, such as text-to-text transfer transformers (T5) (Raffel et al., 2023), GPT-3 (Brown et al., 2020), etc. Intriguingly, some advanced abilities emerge as a result. These large-sized PLMs (i.e. LLMs) show unprecedentedly powerful capabilities (also called emergent abilities) that go beyond traditional language modeling and start to gain capability to solve more general and complex tasks which was not seen in PLM. Formally, we define a LLM as follows:

Definition 1 (Large Language Model). A large language model (LLM) M , parameterized by θ, is a Transformer-based model with an architecture that can be autoregressive, autoencoding, or encoder-decoder. It has been trained on a large corpus comprising hundreds of millions to trillions of tokens. LLMs encompass pre-trained models and for our survey, refers to models that have at least 1 billion parameters.

Figure 3: Development of language models and their applications in tabular data modeling.

Several key emergent abilities of LLMs are critical for data understanding and modeling including in-context In-context learning refers to designing learning, instruction following, and multi-step reasoning. large auto-regressive language models that generate responses on unseen task without gradient update, only learning through a natural language task description and a few in-context examples provided in the prompt. The GPT3 model (Brown et al., 2020) with 175 billion parameters presented an impressive incontext learning ability that was not seen in smaller models. LLMs have also demonstrated the ability to complete new tasks by following only the instructions of the task descriptions (also known as zero-shot prompts). Some papers also fine-tuned LLMs on a variety of tasks presented as instructions (Thoppilan et al., 2022). However, instruction-tuning is reported to work best only for larger-size models (Wei et al., 2022a; Chung et al., 2022). Solving complex tasks involving multiple steps have been challenging for LLMs. By including intermediate reasoning steps, prompting strategies such as chain-of-thought (CoT) has been shown to help unlock the LLM ability to tackle complex arithmetic, commonsense, and symbolic reasoning tasks (Wei et al., 2023). These new abilities of LLMs lay the groundwork for exploring their integration into intricate tasks extending beyond traditional NLP applications across diverse data types.

1.3.1 Applications of LLMs in tabular data

Despite the impressive capabilities of LM in addressing NLP tasks, its utilization for tabular data learning has been constrained by differences in the inherent data structure. Some research efforts have sought to utilize the generic semantic knowledge contained in PLM, predominantly BERT-based models, for modeling tabular data (Figure 3). This involves employing PLM to learn contextual representation with semantic information taking header information into account (Chen et al., 2020b). The typical approach includes transforming tabular data into text through serialization (detailed explanation in Section 2) and employing a maskedlanguage-modeling (MLM) approach for fine-tuning the PLM, similar to that in BERT (PTab, CT-BERT, TABERT (Liu et al., 2022a; Ye et al., 2023a; Yin et al., 2020a). In addition to being able to incorporate semantic knowledge from column names, converting heterogenous tabular data into textual representation enables PLMs to accept inputs from diverse tables, thus enabling cross-table training. Also, due to the lack of locality property of tabular data, models need to exhibit permutation invariance of feature columns (Ye et al., 2023a). In this fashion, TABERT was proposed as a PLM trained on both natural language sentence and structured data (Yin et al., 2020a), PTab demonstrated the importance of cross-table training for an enhanced representation learning (Liu et al., 2022a), CT-BERT employs masked table modeling (MTM) and contrastive learning for cross-table pretraining that outperformed tree-based models (Ye et al., 2023a). However, previous research primarily focuses on using LM for representation learning, which is quite limited.

1.3.2 Opportunities for LLMs in tabular data modeling

Many studies today explore the potential of using LLMs for various tabular data tasks, ranging from prediction, data generation, to data understanding (further divided into question answering and data reasoning). This exploration is driven by LLMs’ unique capabilities such as in-context learning, instruction following, and step-wise reasoning. The opportunities for applying LLMs to tabular data modeling are as follows:

  1. Deep learning methods often exhibit suboptimal performance on datasets they were not initially trained on, making transfer learning using the pre-training and fine-tuning paradigm highly promising (Shwartz-Ziv & Armon, 2022).
  2. The transformation of tabular data into LLM-readable natural language addresses the curse of dimensionality associated with one-hot encoding of high-dimensional categorical data during tabular preprocessing.
  3. The emergent capabilities, such as step-by-step reasoning through CoT, have transformed LM from language modeling to a more general task-solving tool. Research is needed to test the limit of LLM’s emergent abilities on tabular data modeling.

1.4 Contribution

The key contributions of this work are as follows:

  1. A formal break down of key techniques for LLMs’ applications on tabular data We split the application of LLM in tabular data to tabular data prediction, tabular data synthesis, tabular data question answering and table understanding. We further extract key techniques that can apply to all applications. We organize these key techniques in a taxonomy that researchers and practitioners can leverage to describe their methods, find relevant techniques and understand the difference between these techniques. We further breakdown each technique to subsections so that researchers can easily find relevant benchmark techniques and properly categorize their proposed techniques.
  2. A survey and taxonomy of metrics for LLMs’ applications on tabular data. For each application, we categorize and discuss a wide range of metrics that can be used to evaluate the performance of that application. For each application, we documented the metric of all relevant methods, and we identify benefits/limitations of each class of metrics to capture application’s performance. We also provide recommended metrics when necessary.
  3. A survey and taxonomy of datasets for LLMs’ applications on tabular data. For each application, we identify datasets that are commonly used for benchmark. For table understanding and question answering, we further categorize datasets by their downstream applications: Question Answering, Natural Language Generation, Classification, Natural Language Inference and Text2SQL. We further provided recommended datasets based on tasks and their GitHub link. Practitioners and researchers can look at the section and find relevant dataset easily.
  4. A survey and taxonomy of techniques for LLMs’ applications on tabular data. For each application, we break down an extensive range of tabular data modeling methods by steps. For example, tabular data prediction can be breakdown by pre-processing (modifying model inputs), target augmentation (modifying the outputs), fine-tuning (fine-tuning the model). We construct granular subcategories at each stage to draw similarities and trends between classes of methods, and with illustrated examples of main techniques. Practitioners and researchers can look at the section and understand the difference of each technique. We only recommend benchmark methods and provide GitHub link of these techniques for reference and benchmark.
  5. An overview of key open problems and challenges that future work should address. We challenge future research to solve bias problem in tabular data modeling, mitigate hallucination, find better representations of numerical data, improve capacity, form standard benchmark, improve model interpretability, create an integrated workflow, design better fine-tuning strategies and improve the performance of downstream applications.

Table 1: Text-based serialization methods.

2 Key techniques for LLMs’ applications on tabular data

While conducting our survey, we noticed a few common components in modeling tabular data with LLMs across tasks. We discuss common techniques, like serialization, table manipulations, prompt engineering, and building end-to-end systems in this section. Fine-tuning LLMs is also popular, but tend to be application-specific, so we leave discussions about it to Sections 3 and 5.

2.1 Serialization

Since LLMs are sequence-to-sequence models, in order to feed tabular data as inputs into an LLM, we have to convert the structured tabular data into a text format (Sui et al., 2023b; Jaitly et al., 2023).

Text-based Table 1 describes the common text-based serialization methods in the literature. A straight-forward way would be to directly input a programming language readable data structure (E.g. Pandas DataFrame Loader for Python, line-separated JSON-file format, Data Matrix represented by a list of lists, HTML code reflecting tables, etc). Alternatively, the table could be converted into X-separated values, where X could be any reasonable delimiter like comma or tab. Some papers convert the tables into human-readable sentences using templates based on the column headers and cell values. The most common approach based on our survey is the Markdown format.

Embedding-based Many papers also employ table encoders, which were fine-tuned from PLMs, to encode tabular data into numerical representations as the input for LLMs. There are multiple table encoders, built on BERT (Devlin et al., 2019) for table-related task, like TAPAS (Herzig et al., 2020), TABERT (Yin et al., 2020b), TURL (Deng et al., 2022a), TUTA (Wang et al., 2021), TABBIE (Iida et al., 2021) and UTP (Chen et al., 2023a). For LLMs with >1B parameters, there are UniTabPT (Sarkar & Lausen, 2023) with 3B parameters (based on T5 and Flan-T5 models)), TableGPT (Gong et al., 2020) with 1.5B parameters (based on GPT2), and TableGPT2 (Zha et al., 2023) with 7B parameters (based on Phoenix (Chen et al., 2023b)).

Graph-based & Tree-based A possible, but less commonly explored, serialization method involves converting a table to a graph or tree data structure. However, when working with sequence-to-sequence models, these structures must still be converted back to text. For Zhao et al. (2023a), after converting the table into a tree, each cell’s hierarchical structure, position information, and content was represented as a tuple and fed into GPT3.5.

Comparisons Research has shown that LLM performance is sensitive to the input tabular formats. Singha et al. (2023) found that DFLoader and JSON formats are better for fact-finding and table transformation tasks. Meanwhile, Sui et al. (2023a) found that HTML or XML table formats are better understood by GPT models over tabular QA and FV tasks. However, they require increased token consumption. Likewise, Sui et al. (2023b) also found markup languages, specifically HTML, outperformed X-separated formats for GPT3.5 and GPT4. Their hypothesis is that the GPT models were trained on a significant amount of web data and thus, probably exposed the LLMs to more HTML and XML formats when interpreting tables.

Apart from manual templates, Hegselmann et al. (2023) also used LLMs (Fine-tuned BLOOM on ToTTo (Parikh et al., 2020b), T0++ (Sanh et al., 2022), GPT-3 (Ouyang et al., 2022)) to generate descriptions of a table as sentences, blurring the line between a text-based and embedding-based serialization methodology. However, for the few-shot classification task, they find that traditional list and text templates outperformed the LLM-based serialization method. Amongst LLMs, the more complex and larger the LLM, the better the performance (GPT-3 has 175B, T0 11B, and fine-tuned BLOOM model 0.56B parameters). A key reason why the LLMs are worse off at serializing tables to sentences is due to the tendency for LLMs to hallucinate: LLMs respond with unrelated expressions, adding new data, or return unfaithful features.

2.2 Table Manipulations

One important characteristic of tabular data is its heterogeneity in structure and content. They oftentimes come in large size with different dimensions encompassing various feature types. In order for LLMs to ingest tabular data efficiently, it is important to compact tables to fit context lengths, for better performance and reduced costs.

Compacting tables to fit context lengths, for better performance and reduced costs For smaller tables, it might be possible to include the whole table within a prompt. However, for larger tables, there are three challenges:

Firstly, some models have short context lengths (E.g. Flan-UL2 (Tay et al., 2023b) supports 2048 tokens, Llama 2 (Touvron et al., 2023b) supports 4096 context tokens) and even models that support large context lengths might still be insufficient if the table is over say 200K rows (Claude 2.1 supports up to 200K tokens).

Secondly, even if the table could fit the context length, most LLMs are inefficient in dealing with long sentences due to the quadratic complexity of self-attention (Sui et al., 2023b; Tay et al., 2023a; Vaswani et al., 2017). When dealing with long contexts, performance of LLMs significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models (Liu et al., 2023b). For tabular data, Cheng et al. (2023); Sui et al. (2023c) highlights that noisy information becomes an issue in large tables for LMs. Chen (2023) found that for table sizes beyond 1000 tokens, GPT-3’s performance degrades to random guesses.

Thirdly, longer prompts incur higher costs, especially for applications built upon LLM APIs.

To address these issues, Herzig et al. (2020); Liu et al. (2022c) proposed naive methods to truncate the input based on a maximum sequence length. Sui et al. (2023b) introduced predefined certain constraints to meet the LLM call request. Another strategy is to do search and retrieval of only highly relevant tables, rows, columns or cells which we will discuss later in Section 5.

2 Same name, different group of authors.

Additional information about tables for better performance Apart from the table, some papers explored including table schemas and statistics as part of the prompt. Sui et al. (2023c) explored including additional information about the tables: Information like “ dimension, measure, semantic field type” help the LLM achieve higher accuracy across all six datasets explored. “statistics features” improved performance for tasks and datasets that include a higher proportion of statistical cell contents, like FEVEROUS (Aly et al., 2021). Meanwhile, “document references” and “term explanations” add context and semantic meaning to the tables. “Table size” had minimal improvements, while “header hierarchy” added unnecessary complexity, and hurt performance.

Robustness of LLM performance to table manipulations Liu et al. (2023e) critically analyzed the robustness of GPT3.5 across structural perturbations in tables (transpose and shuffle). They find that LLMs suffer from structural bias in the interpretation of table orientations, and when tasked to transpose the table, LLMs performs miserably ( 50% accuracy). However, LLMs can identify if the first row or first column is the header (94-97% accuracy). Zhao et al. (2023e) investigated the effects of SOTA Table QA models on manipulations on the table header, table content and natural language question (phrasing).3 They find that all examined Table QA models (TaPas, TableFormer, TaPEX, OmniTab, GPT3) are not robust under adversarial attacks.

2.3 Prompt Engineering

A prompt is an input text that is fed into an LLM. Designing an effective prompt is a non-trivial task, and many research topics have branched out from prompt engineering alone. In this subsection, we cover the popular techniques in prompt engineering, and how researchers have used them for tasks involving tables.

Prompt format The simplest format is concatenating task description with the serialized table as string. An LLM would then attempt to perform the task described and return a text-based answer. Clearly-defined and well-formatted task descriptions are reported to be effective prompts (Marvin et al., 2023). Some other strategies to improve performance are described in the next few paragraphs. Sui et al. (2023b) recommended that external information (such as questions and statements) should be placed before the tables in prompts for better performance.

  • In-context learning As one of the emergent abilities of LLMs (see 1.3) In-context learning refers to incorporate similar examples to help the LLMs understand the desired output. Sui et al. (2023b) observed significant performance drops performance, of overall accuracy decrease of 30.38% on all tasks, when changing their prompts from a 1-shot to a 0-shot setting. In terms of choosing appropriate examples, Narayan et al. (2022) found their manually curated examples to outperform randomly selected examples by an average of 14.7 F1 points. For Chen (2023), increasing from 1-shot to 2-shot can often benefit the model, however, further increases did not lead to more performance gain.
  • Chain-of-Thought and Self-consistency Chain-of-Thought (CoT) (Wei et al., 2022c) induces LLMs to decompose a task by performing step-by-step thinking, resulting in better reasoning. Program-of-Thoughts (Chen et al., 2022) guides the LLMs using code-related comments like “Let’s write a program step-by-step…”. Zhao et al. (2023d) explored CoT and PoT strategies for the numerical QA task. Yang et al. (2023) prompt the LLMs with one shot CoT demonstration example to generate a reasoning and answer. Subsequently, they included the reasoning texts, indicated by special “” token, as part of inputs to fine-tune smaller models to generate the final answer.
  • Self-consistency (SC) (Wang et al., 2023b) leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. SC samples a diverse set of reasoning paths from an LLM, then selects the most consistent answer by marginalizing out the sampled reasoning paths. Inspired by these strategies, Zhao et al. (2023a); Ye et al. (2023b) experimented with multi-turn dialogue strategies, where they decompose the original question into sub-tasks or sub-questions to guide the LLM’s reasoning. Sui et al. (2023c) instructed the LLM to “identify critical values and ranges of the last table related to the statement” to obtain additional information that were fed to the final LLM, obtaining increased scores for five datasets. Liu et al. (2023e) also investigated strategies around SC, along with self-evaluation, which guides the LLM to choose between the two reasoning approaches based on the question’s nature and each answer’s clarity. Deng et al. (2022b) did consensus voting across a sample a set of candidate sequences, then selected final response by ensembling the derived response based on plurality voting.

3 For table headers, they explored synonym and abbreviation replacement perturbations. For table content, they explored five perturbations: (1) row shuffling, (2) column shuffling, (3) extending column names content into semantically equivalent expressions, (4) masking correlated columns (E.g. “Ranking” and “Total Points” can be inferred from one another), and (5) introducing new columns that are derived from existing columns. For the question itself, they perturbed questions at the word-level or sentence-level.

Chen (2023) investigated the effects of both CoT and SC on QA and FV tasks. When investigating the explainability of LLM’s predictions, Dinh et al. (2022) experimented with a multi-turn approach of asking GPT3 to explain its own prediction from the previous round, and guided the explanation response using CoT by adding the line “Let’s think logically. This is because”.

  • Retrieval-augmented generation (RAG) Retrieval-augmented generation (RAG) relies on the intuition that the LLMs are general models, but can be guided to a domain-specific answer if the user includes the relevant context within the prompts. By incorporating tables as part of the prompts, most papers described in this survey can be attributed as RAG systems. A particular trait challenge in RAG is to extract the most relevant information out of a large pool of data to better inform the LLMs. This challenge overlaps slightly with the strategies about table sampling mentioned earlier under Section 2.2. Apart from the aforementioned methods, Sundar & Heck (2023) designed a dual-encoder-based Dense Table Retrieval (DTR) model to rank cells of the table to the relevance of the query. The ranked knowledge sources are incorporated within the prompt, and led to top ROUGE scores.

Role-play Another popular prompt engineering technique is role-play, which refers to including descriptions in the prompt about the person the LLM should portray as it completes a task. For example, Zhao et al. (2023a) experimented with the prompt “Suppose you are an expert in statistical analysis.”.

2.4 End-to-end systems

Since LLMs can generate any text-based output, apart from generating human-readable responses, it could also generate code readable by other programs. Abraham et al. (2022) designed a model that converts natural language queries to structured queries, which can be run against a database or a spreadsheet. Liu et al. (2023e) designed a system where the LLM could interact with Python to execute commands, process data, and scrutinize results (within a Pandas DataFrame), iteratively over a maximum of five iterations. Zhang et al. (2023d) demonstrated that we can obtain errors from the SQL tool to be fed back to the LLMs. By implementing this iterative process of calling LLMs, they improved the success rate of the SQL query generation. Finally, Liu et al. (2023c) proposes a no-code data analytics platform that uses LLMs to generate data summaries, including generating pertinent questions required for analysis, and queries into the data parser. A survey by Zhang et al. (2023g) covers further concepts about natural language interfaces for tabular data querying and visualization, diving deeper into recent advancements in Text-to-SQL and Text-to-Vis domains.

3 LLMs for predictions

Several studies endeavor to leverage LLMs for prediction task from tabular data. This section will delve into the existing methodologies and advancements pertaining to two categories of tabular data: standard feature-based tabular data and time series data. Time series prediction is different from normal feature-based tabular data since the predictive power heavily rely on pastime series numbers. For each category, we divide it to different steps which includes preprocessing, fine-tuning and target augmentation. Preprocessing explains how different prediction methods generate input to the language model. Preprocessing includes serialization, table manipulation and prompt engineering. Target augmentation maps the textual output from LLMs to a target label for prediction tasks. At the end, we will briefly touch on domain specific prediction methods using LLMs.

3.1 Dataset

For task specific fine-tuning, most datasets for prediction task are chosen from UCI ML, OpenML or a combo of 9 datasets created by Manikandan et al. (2023). We put all details in Table 2. Using the combo of 9 datasets is recommended 4 since it contains larger size dataset and more diverse feature set compared to OpenML and UCI ML. For general fine-tuning, existed methods choose Kaggle API5 as it has 169 datasets and Datasets are very diverse.

Dinh et al. (2022); Manikandan et al. (2023) Hegselmann et al. (2023); Wang et al. (2023a); Zhang et al. (2023a) Hegselmann et al. (2023); Wang et al. (2023a); Zhang et al. (2023a) Manikandan et al. (2023); Slack & Singh (2023) Slack & Singh (2023)

Table 2: Combo is the combination of the following dataset in the form of dataset name (number of rows, number of features): Bank (45,211 rows, 16 feats), Blood (748, 4), California (20,640, 8), Car (1,728, 8), Creditg (1,000, 20), Income (48,842, 14), and Jungle (44,819, 6), Diabetes (768, 8) and Heart (918, 11).

3.2 Tabular prediction

Type Algorithm Method TabletSlack & Singh (2023) Tabular No Finetune SummaryBoostManikandan et al. (2023) Tabular No Finetune Tabular LIFTDinh et al. (2022) Finetune Tabular TabLLMHegselmann et al. (2023) Finetune Tabular UnipredictWang et al. (2023a) Finetune Tabular GTLZhang et al. (2023a) Finetune Finetune Tabular SerializeLLMJaitly et al. (2023) Time Series Finetune PromptCastXue & Salim (2022) Time Series No Finetune ZeroTSGruver et al. (2023) Time Series Finetune TESTSun et al. (2023a) Time Series Finetune TimeLLMJin et al. (2023a) Finetune Medical MediTabWang et al. (2023c) Finetune Finance CTRLLi et al. (2023) Finetune CTR FinPTYin et al. (2023)

Table 3: Prediction methods. Resource is high if it has to finetune a model with size ≥ 1B even if it is PEFT. Used Model include all models used in the paper which includes serialization, preprocessing and model fine-tuning. ACC stands for accuracy. AUC stands for Area under the ROC Curve. MAE stands for mean absolute error. RMSE stands for root-mean-square error. F1 score is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. CRPS is continous ranked probability score. We will introduce other metrics in relevant sections.

Preprocessing Serialization in prediction task is mostly Text-based in section 2.1. Table manipulation includes statistics and metadata of datasets in section 2.2. Prompt engineering includes task specific cues and relevant samples in section 2.1. We give an illustration of different preprocessing methods in Table 4

As one of the earliest endeavors, LIFT (Dinh et al., 2022) tried a few different serialization methods, such as feature and value as a natural sentence such as “The column name is Value” or a bunch of equations, such as col1 = val1, col2 = val2, …. The former is shown to achieve higher prediction accuracy, especially in low-dimensional tasks. The same conclusion was also found by TabLLM (Hegselmann et al., 2023) where they evaluated 9 different serialization methods. They found that a textual enumeration of all features-‘The column name is Value’, performs the best. They also added a description for classification problem. For medical prediction, they mimic the thinking process of medical professional as prompt engineering.

They found out that LLM actually make use for column name and their relationships in few shot learning settings. In a subsequent study, TABLET (Slack & Singh, 2023) included naturally occurring instructions along with examples for serialization. In this case, where the task is for medical condition prediction, naturally occurring instructions are from consumer-friendly sources, such as government health website or technical reference such as Merck Manual. It includes instructions, examples, and test data point. They found that these instructions significantly enhance zero-shot F1 performance.

However, LLMs still ignore instructions sometimes, leading to prediction failures. Along this fashion, more studies tested a more complex serialization and prompt engineering method rather than simple concatenation of feature and value for serialization. The schema-based prompt engineering usually includes background information of the dataset, a task description, a summary, and example data points. Summary Boosting(Manikandan et al., 2023) serializes data and metadata into text prompts for summary generation. This includes categorizing numerical features and using a representative dataset subset selected via weighted stratified sampling based on language embeddings. Serilize-LM (Jaitly et al., 2023) introduces 3 novel serialization techniques which boosts LLM performance in domain specific datasets. They included related features into one sentence to make the prompt more descriptive and easier to understand for LLM. Take car classification as an example, attributes like make, color and body type are now combined into a single richer sentence.

It leverages covariance to identify most relevant features and either label them critical or adding a sentence to explain the most important features. Finally, they converted tabular data into LaTeX code format. This LaTeX representation of the table was then used as the input for fine-tuning our LLM by just passing a row representation preceded by hline tag without any headers. UniPredict (Wang et al., 2023a) reformats meta data by consolidating arbitrary input M to a description of the target and the semantic descriptions of features. Feature serialization follows a “column name is value” format, …. The objective is to minimize the difference between the output sequence generated by the adapted LLM function and the reference output sequence generated from target augmentation (represented by serialize target). Generative Tabular Learning (GTL) was proposed by (Zhang et al., 2023a) which includes two parts: 1) the first part specifies the task background and description with optionally some examples as in-context examples(Prompt Engineering); 2) the second part describes feature meanings and values of the current instance to be inferred(Serialization); For researchers and practitioners, we recommend to benchmark LIFT, TABLET and TabLLM for new preprocessing method since their methods are representative and clearly documented. The code is available. 6

4 Here is the GitHub repository to get the data https://Github.com/clinicalml/TabLLM/tree/main/datasets

5 Here is the website to get the pretrained data https://Github.com/Kaggle/kaggle-api

Some other methods leverage an LLM to rewrite the serialization or do the prompt engineering. TabLLM (Hegselmann et al., 2023) showed that LLM is not good for serialization because it is not faithful and may hallucinate. Summary Boosting(Manikandan et al., 2023) uses GPT3 to convert metadata to data description and generate summary for a subset of datasets in each sample round. TABLET (Slack & Singh, 2023) fits a simple model such as one layer rule set morel or prototype with 10 most important features on the task’s full training data. It then serializes the logic into text using a template and revise the templates using GPT3. Based on their experiments, generated instructions do not significantly improve the performance. Thus, unless the serialization requires summarizing the long input, it is not recommended to use LLM to rewrite serialization.

Target Augmentation LLMs can solve complex task through text generation, however, the output is not always controllable (Dinh et al., 2022). As a result, mapping the textual output from LLMs to a target label for prediction tasks is essential. We call it target augmentation. A straightforward but labor-intensive way is manual labeling as used by Serilize-LM (Jaitly et al., 2023). LIFT (Dinh et al., 2022) employs ### and @@@ for question-answer separation and end of generation, respectively, placing answers in between. To mitigate invalid inferences, LIFT conducts five inference attempts, defaulting to the training set’s average value if all fail. TabLLM (Hegselmann et al., 2023) uses verbalizer (Cui et al., 2022) to map the answer to a valid class. UniPredict (Wang et al., 2023a) has the most complicated target augmentation. They transform the target label into a set of probabilities for each class via a function called “augment”. Formally, for target T in an arbitrary dataset D, they define a function augment(T ) = C, P , where C are new categories of targets with semantic meaning and P are the assigned probabilities to each category. They extend the target into categorical one-hot encoding and then use an external predictor to create the calibrated probability distributions. This replaces the 0/1 one-hot encoding while maintaining the final prediction outcome. Formally, given the target classes t ∈ 0, …, |C| and target probabilities p ∈ P , they define a function serialize target(t, p) that serializes target classes and probabilities into a sequence formatted as “class t1 : p1, t2 : p2, . . . ” We give an example for each method in 5 While customized target augmentation could be useful in some cases, the simple Verbalizer is recommended for its convenience to implement and can assign the probability of the output.

6 Here is the Github repo for TABLET https://Github.com/dylan-slack/Tablet, TabLLM https://Github.com/clinicalml/TabLLM and LIFT https://Github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning

Inference Only Prediction Some work uses LLMs directly for prediction without fine-tuning, we refer these approaches inference only prediction. TABLET (Slack & Singh, 2023) utilizes models like Tk-Instruct (Wang et al., 2022b) 11b, Flan-T5 (Chung et al., 2022) 11b, GPT-J (Black et al., 2022) 6b, and ChatGPT to inference the model, but find out that a KNN approach with feature weights from XGBoost surpasses Flan-T5 11b in performance using similar examples and instructions. Summary Boosting (Manikandan et al., 2023) creates multiple input through serialization step. The AdaBoost algorithm then creates an ensemble of summary-based weak learners. While non-fine-tuned LLMs struggle with continuous attributes, summary boosting is effective with smaller datasets. Furthermore, its performance is enhanced using GPT-generated descriptions by leveraging existing model knowledge, underscoring the potential of LLMs in new domains with limited data. However, it does not perform well when there are many continuous variables. For any new LLM based prediction method without any fine-tuning, we suggest to benchmark LIFT and TABLET. LIFT is the first LLM based method for inference only prediction. TABLET shows significantly better performance compared to LIFT. Both methods have code available.

Fine-tuning For studies involving fine-tuning, they typically employ one of two distinct approaches. The first involves training a LLM model on large datasets to learn fundamental features before adapting it to specific prediction tasks. The second takes a pre-trained LLM and further training it on a smaller, specific prediction dataset to specialize its knowledge and improve its performance on the prediction. LIFT (Dinh et al., 2022) fine-tunes pretrained language models like GPT-3 and GPT-J using Low-Rank Adaptation (LoRA) on training set. They found that LLM with general pretraining could improve the performance. However, the performance of this method does not surpass in context learning result. TabLLM (Hegselmann et al., 2023) uses T0 model (Sanh et al., 2021) and t few (?) for fine-tuning. TabLLM has demonstrated remarkable few-shot learning capabilities outperforming traditional deep-learning methods and gradient-boosted trees. TabLLM’s efficacy is highlighted by its ability to leverage the extensive knowledge encoded in pre-trained LLMs, requiring minimal labeled data. However, the sample efficiency of TabLLM is highly task-dependent. Jaitly et al. (2023) uses T0 (Sanh et al., 2021). It is trained using Intrinsic Attention-based Prompt Tuning (IA3) (Liu et al., 2022b). However, this method only works for few short learning, worse than baseline when number of shots more or equal to 128. T0 model (Sanh et al., 2021) is commonly used as base model for tabular prediction fine-tuning.

UniPredict (Wang et al., 2023a) trains a single LLM (GPT2) on an aggregation of 169 tabular datasets with diverse targets and observe advantage over existed methods. This model does not require fine-tuning LLM on specific datasets. Model accuracy and ranking is better than XGBoost when the number of samples is small. The model with target augmentation performs noticeably better than the model without augmentation. It does not perform well when there are too many columns or fewer representative features. TabFMs (Zhang et al., 2023a) fine-tunes LLaMA to predict next token. we are left with 115 tabular datasets. To balance the number of instances across different datasets, we randomly sample up to 2,048 instances from each tabular dataset for GTL. They employed GTL which significantly improves LLaMA in most zero-shot scenarios. Based on the current evidence, we believe that fine-tuning on large number of datasets could further improve the performance. However, both UniPredict and GTL have not released their code yet.

Metric We suggest to report AUC for classification prediction and RMSE for regression since they are mostly common used in the literature 3

The task is about fraud repair claim prediction. The brand of car is Land Rover. The produce year is 2017. The repair claim of the car is Larger car is always more expensive. This is a 2017 Land Rover. Therefore, this car repair claim is (Fraudulent or Not Fraudulent): \hline Land Rover & 2017 … Is this car repair claim fraudulent? Yes or No? Identify if car repair claim is fraudulent. Older car is more likely to have fraudulent repair claim. Features Car Brand: Land Rover Year: 2017. Answer with one of the following: Yes / No The dataset is about fraud repair claim. Car Brand is the brand of car. Year is the age when the car is produced. The features are: Car Brand is Land Rover. Year is 2017. Predict if this car repair claim fraudulent by Yes for fraudulent, No for not fraudulent

Table 4: Method and Example for different preprocessing in general prediction. The example is to predict if a car repair claim fraudulent or not.

3.3 Time Series Forecasting

Compared to prediction on feature-based tabular data with numerical and categorical features, time series prediction pays more attention to numerical features and temporal relations. Thus, serialization and target augmentation are more relevant to how to best represent numerical features. Many papers have claimed that they use LLM for time series. However, most of these papers use LLM that is smaller than 1B. We will not discuss these methods here. Please refer to (Jin et al., 2023b) for a complete introduction of these methods.

Preprocessing PromptCast (Xue & Salim, 2022) uses input time series data as it is and convert it to a test format with minimal description of the task and convert target as a sentence to be the output. ZeroTS (Gruver et al., 2023) claims that the number is not encoded well in original LLM encoding method. Thus, it encodes numbers by breaking them down by a few digits or by each single digit for GPT-3 and LLaMA, respectively. It uses spaces and commas for separation and omitting decimal points. Time LLM (Jin et al., 2023a) involves patching time series into embeddings and integrating them with word embeddings to create a comprehensive input. This input is complemented by dataset context, task instructions, and input statistics as a prefix. TEST (Sun et al., 2023a) introduces an embedding layer tailored for LLMs, using exponentially dilated causal convolution networks for time series processing. The embedding is generated through contrastive learning with unique positive pairs and aligning text and time series tokens using similarity measures. Serialization involves two QA templates, treating multivariate time series as univariate series for sequential template filling.

Target Augmentation In terms of output mapping, ZeroTS (Gruver et al., 2023) involves drawing multiple samples and using statistical methods or quantiles for point estimates or ranges. For Time-LLM (Jin et al., 2023a), the output processing is done through flatten and linear projection. The target augmentation method of ZeroTS is easy to implement 7 while TimeLLM’s code is not available.

Inference Only Prediction Similar to feature-based tabular prediction, researchers explored LLMs’ performance for time series forecasting without fine-tuning. ZeroTS (Gruver et al., 2023) examines the use of LLMs like GPT-3 (Brown et al., 2020) and LLaMA-70B Touvron et al. (2023a) directly for time series forecasting. It evaluates models using mean absolute error (MAE), Scale MAE, and continuous ranked probability score (CRPS), noting LLMs’ preference for simple rule-based completions and their tendency towards repetition and capturing trends. The study notes LLMs’ ability to capture time series data distributions and handle missing data without special treatment. However, this approach is constrained by window size and arithmetic ability, preventing it from further improvement.

Fine-tuning Fine-tuning the model for time series prediction is more commonly seen in current research. PromptCast (Xue & Salim, 2022) tried the method on inference only prediction or fine-tuning on task specific datasets. It shows that larger model always perform better. Time LLM (Jin et al., 2023a) presents a novel approach to time series forecasting by fine-tuning LLMs like LLaMA Touvron et al. (2023a) and GPT-2 (Brown et al., 2020). Time-LLM is evaluated using metrics symmetric mean absolute percentage error (SMAPE), mean absolute scaled error (MSAE), and overall weighted average (OWA). It demonstrates notable performance in few-shot learning scenarios, where only 5 percent or 10 percent of the data are used. This innovative technique underscores the versatility of LLMs in handling complex forecasting tasks. For TEST (Sun et al., 2023a), soft prompts are used for fine-tuning. The paper evaluates models like Bert, GPT-2 (Brown et al., 2020), ChatGLM (Zeng et al., 2023), and LLaMA Touvron et al. (2023a), using metrics like classification accuracy and RMSE. However, the result shows that this method is not as efficient and accurate as training a small task-oriented model. In general, currently LLaMA is the most commonly used model and soft prompt seems to be a suitable approach for fine-tuning.

Metric MAE is the most common metric. Continuous Ranked Probability Score (CRPS) as it captures distributional qualities, allowing for comparison of models that generate samples without likelihoods. CRPS is considered an improvement over MAE as it does not ignore the structures in data like correlations between time steps. Symmetric Mean Absolute Percentage Error (SMAPE) measures the accuracy based on percentage errors, Mean Absolute Scaled Error (MASE) is a scale-independent error metric normalized by the in-sample mean absolute error of a naive benchmark model, and Overall Weighted Average (OWA) is a combined metric that averages the ranks of SMAPE and MASE to compare the performance of different methods. Despite the introduction of new metrics, MAE and RMSE are mostly common used in the literature. We still recommend using MAE and RMSE as they are simple to implement and easy to benchmark.

Table 5: Target Augmentation method, used papers and examples

3.4 Application of Prediction using LLM

Medical Prediction It was found that PTL such as DeBERTa has been shown perform better than XGBoost in electronic health record (EHR) prediction tasks (McMaster et al., 2023). For preprocessing, Meditab Wang et al. (2023c) utilizes GPT-3.5 Brown et al. (2020) to convert tabular data into textual format, with a focus on extracting key values. Subsequently, it employs techniques such as linearization, prompting, and sanity checks to ensure accuracy and mitigate errors. For fine-tuning, the system further leverages multitask learning on domain-specific datasets, generates pseudo-labels for additional data, and refines them using data Shapley scores. Pretraining on the refined dataset is followed by fine-tuning using the original data. The resulting model supports both zero-shot and few-shot learning for new datasets. GPT-3.5 accessed via OpenAI’s API facilitates data consolidation and augmentation, while UnifiedQA-v2-T5 Khashabi et al. (2022) is employed for sanity checks. Additionally, Meditab utilizes a pretrained BioBert classifier Lee et al. (2019). The system undergoes thorough evaluation across supervised, few-shot, and zero-shot learning scenarios within the medical domain, demonstrating superior performance compared to gradient boosting methods and existing LLM-based approaches. However, it may have limited applicability beyond the medical domain. We recommend exploring the provided code8 for tabular prediction tasks specifically in the medical domain. On top AUCROC, they also use precision recall curve (PRAUC) for evaluation. PRAUC is useful in imbalanced datasets which are always the case for medical data.

7 The code is in https://Github.com/ngruver/TextGenerationLLMtime

Financial Prediction FinPT (Yin et al., 2023) presents an LLM based approach to financial risk prediction. The method involves filling tabular financial data into a pre-defined template, prompting LLMs like ChatGPT and GPT-4 to generate natural-language customer profiles. These profiles are then used to fine-tune large foundation models such as BERT (Devlin et al., 2019), employing the models’ official tokenizers. The process enhances the ability of these models to predict financial risks, with Flan-T5 emerging as the most effective backbone model in this context, particularly across eight datasets. For financial data, we suggest to use 9 and benchmark against FinPT10.

Recommendation Prediction CTRL (Li et al., 2023) proposes a novel method for Click Through Rate (CTR) prediction by converting tabular data into text using human-designed prompts, making it understandable for language models. The model treats tabular data and generated textual data as separate modalities, feeding them into a collaborative CTR model and a pre-trained language model such as ChatGLM (Zeng et al., 2023), respectively. CTRL employs a two-stage training process: the first stage involves cross-modal contrastive learning for fine-grained knowledge alignment, while the second stage focuses on fine-tuning a lightweight collaborative model for downstream tasks. The approach outperforms all the SOTA baselines including semantic and collaborative models over three datasets by a significant margin, showing superior prediction capabilities and proving the effectiveness of the paradigm of combining collaborative and semantic signals. However, the code for this method is not available. They use LogLoss and AUC to evaluate the method. For LogLoss, A lower bound of 0 for Logloss indicates that the two distributions are perfectly matched, and a smaller value indicates a better performance.

4 LLMs for tabular data generation

In this section, we focus on the pivotal role of data generation. The escalating demand for nuanced datasets prompts the exploration of novel methodologies leveraging LLMs to augment tabular data. This section scrutinizes methodologies illuminating the transformative potential of conjoining LLMs and tabular data for data synthesis.

Table 6: Data synthesis methods. “DCR” stands for Distance to the Closest Record and “MLE” stands for Machine Learning Efficiency.

8 Available at https://Github.com/RyanWangZf/MediTab.

9 The dataset is in https://huggingface.co/datasets/yuweiyin/FinBench

10 The code is in https://Github.com/YuweiYin/FinPT

4.1 Methodologies

Borisov et al. (2023b) proposed GReaT11 (Generation of Realistic Tabular data) to generate synthetic samples with original tabular data characteristics. The GReaT data pipeline involves a textual encoding step transforming tabular data into meaningful text using the sentences serialization methods as shown in Table 1, followed by fine-tuning GPT-2 or GPT-2 distill models. Additionally, a feature order permutation step precedes the use of obtained sentences for LLM fine-tuning.

REaLTabFormer (Solatorio & Dupriez, 2023) extends GReaT by generating synthetic non-relational and relational tabular data. It uses an autoregressive GPT-2 model to generate a parent table and a sequence-to-sequence model conditioned on the parent table for the relational dataset. The model implements target masking to prevent data copying and introduces statistical methods to detect overfitting. It demonstrates superior performance in capturing relational structures and achieves state-of-the-art results in predictive tasks without needing fine-tuning.

Following the similar paradigm, Zhang et al. (2023e) proposed the TAPTAP12 (Table Pretraining for Tabular Prediction) which incorporates several enhancements. The method involves pre-fine-tuning the GPT2 on 450 Kaggle/UCI/OpenML tables, generating label columns using a machine learning model. Claimed improvements include a revised numerical encoding scheme and the use of external models like GBDT for pseudo-label generation, deviating from conventional language model-based approaches. However, the work lacks a comparison with diffusion-based models like TabDDPM, and the numerical encoding scheme improvement as highlighted in (Gruver et al., 2023) depends on the model used.

In a related work (Wang provement as highlighted in (Gruver et al., 2023) depends on the model used. et al., 2023a), a similar approach is employed for generating pseudo-labels, where the labels are represented as probability vectors.

TabuLa (Zhao et al., 2023f) addresses long training times of LLMs by advocating for a randomly initialized model as the starting point and shows the potential for continuous refinement through iterative fine-tuning on successive tabular data tasks 13. It introduces a token sequence compression method and a middle padding strategy to simplify training data representation and enhance performance, achieving a significant reduction in training time while maintaining or improving synthetic data quality.

Seedat et al. (2023) introduces Curated LLM, a framework that leverages learning dynamics and two novel curation metrics, namely confidence and uncertainty. These metrics are employed to filter out undesirable generated samples during the training process of a classifier, aiming to produce high-quality synthetic data. Specifically, both metrics are calculated for each sample, utilizing the classifier trained on these samples. Additionally, CLLM distinguishes itself by not requiring any fine-tuning of LLMs, specifically utilizing the GPT-4.

TabMT (Gulati & Roysdon, 2023) employs a masked transformer-based architecture. The design allows efficient handling of various data types and supports missing data imputation. It leverages a masking mechanism to enhance privacy and data utility, ensuring a balance between data realism and privacy preservation. TabMT’s architecture is scalable, making it suitable for diverse datasets and demonstrating improved performance in synthetic data generation tasks.

4.2 Evaluation

As outlined in Zhang et al. (2023c), the evaluation of synthetic data quality can be approached from four different dimensions: 1) Low-order statistics – column-wise density and pair-wise column correlation, estimating individual column density and the relational dynamics between pairs of columns, 2) High-order metrics – the calculation of α-precision and β-recall scores that measure the overall fidelity and diversity of synthetic data, 3) privacy preservation – DCR score, representing the median Distance to the Closest Record (DCR), to evaluate the privacy level of the original data, and 4) Performance on downstream tasks – like machine learning efficiency (MLE) and missing value imputation. MLE is to compare the testing accuracy on real data when trained on synthetically generated tabular datasets. Additionally, the quality of data generation can be assessed through its performance in the task of missing value imputation, which focuses on the replenishment of incomplete features/labels using available partial column data.

11 The code is in https://github.com/kathrinse/be_great

12 The code is in https://github.com/ZhangTP1996/TapTap

13 The code is in https://github.com/zhao-zilong/Tabula

Figure 4: General data generation pipeline

5 LLMs for table understanding

In this section, we cover datasets, trends and methods explored by researchers for question answering (QA), fact verification (FV) and table reasoning tasks. There are many papers working on database manipulation, management and integration (Lobo et al., 2023; Fernandez et al., 2023; Narayan et al., 2022; Zhang et al., 2023b), which also include instructions and tabular inputs to LLMs. However, they are not typically referred to as a QA task, and will not be covered by this paper.

5.1 Dataset

Table 7 outlines some of the popular datasets and benchmark in the literature working on tabular QA tasks.

Table QA For table QA datasets, we recommend to benchmark FetaQA (Nan et al., 2022) over WikiTable-Question (Pasupat & Liang, 2015a). Unlike WikiTableQuestions, which focuses on evaluating a QA system’s ability to understand queries and retrieve short-form answers from tabular data, FeTaQA introduces elements that require deeper reasoning and integration of information. This includes generating free-form text answers that involve the retrieval, inference, and integration of multiple discontinuous facts from structured knowledge sources like tables. This requires the model generated long, informative, and free-form answers. NQ-TABLES Herzig et al. (2021) is larger than previously mentioned table. Its advantage lies in its emphasis on open-domain questions, which can be answered using structured table data. The code is in footnote 14.

Table and Conversation QA For QA task that involved both conversation and tables, we recommend to use HybriDialogue (Nakamura et al., 2022). HybriDialogue includes conversations grounded on both Wikipedia text and tables. This addresses a significant challenge in current dialogue systems: conversing on topics with information distributed across different modalities, specifically text and tables. The dataset is in footnote. 15

14 The dataset for NQ-Tables is in https://github.com/google-research-datasets/natural-questions. The dataset for WikiTableQuestions is in https://ppasupat.github.io/WikiTableQuestions/. The dataset for FetaQA is in https://github.com/Yale-LILY/FeTaQA.

15 The dataset is in https://github.com/entitize/HybridDialogue

Table 7: Overview of Various Datasets and Related Work for LLMs for tabular QA data. We only select datasets that have been used by more than one relevant method in this table.

Table Classification We recommend to benchmark FEVEROUS Aly et al. (2021) if the tasks involve fact verification using both unstructural text and structured tables. We recommend to benchmark Dresden Web Tables (Eberius et al., 2015) for tasks requiring the classification of web table layouts, particularly useful in data extraction and web content analysis where table structures are crucial. The dataset is in footnote. 16

Text2SQL If you want to create a SQL executor, you can use TAPEX (Liu et al., 2022c) and WIK-ISQL (Zhong et al., 2017b) which contains both tables , SQL query and answer. If you want to test ability to write a SQL query, you can use Spider (Yu et al., 2018b)17, Magellan Das et al. or WIKISQL (Zhong et al., 2017b). Overall WIKISQL is preferable since it is large in size and has been benchmarked by many existed methods such as (Chen et al., 2023a; Abraham et al., 2022; Zhang et al., 2023f; Jiang et al., 2023) . The dataset is in footnote 18.

16 The dataset for FEVEROUS is in https://fever.ai/dataset/feverous.html. The dataset for Dresden Web Tables is in https://ppasupat.github.io/WikiTableQuestions/.

17 Leaderboard for Spider: https://yale-lily.github.io/spider

18 The dataset for TAPEX is in https://github.com/microsoft/Table-Pretraining/tree/main/data_generator. The dataset for spider is in https://drive.usercontent.google.com/download?id=1iRDVHLr4mX2wQKSgA9J8Pire73Jahh0m&export=download&authuser=0. The dataset for WIKISQL is in https://github.com/salesforce/WikiSQL.

Table NLG ToTTo Parikh et al. (2020a) aims to create natural yet faithful descriptions to the source table. It is rich in size and can be used to benchmark table conditional text generation task. HiTAB (Cheng et al., 2022) allows for more standardized and comparable evaluation across different NLG models and tasks, potentially leading to more reliable and consistent benchmarking in the field. The dataset is in footnote. 19.

Table NLI InfoTabs (Gupta et al., 2020) uses Wikipedia infoboxes and is designed to facilitate understanding of semi-structured tabulated text, which involves comprehending both text fragments and their implicit relationships. InfoTabs is particularly useful for studying complex, multi-faceted reasoning over semi-structured, multi-domain, and heterogeneous data. TabFactChen et al. (2020a) consists of human-annotated natural language statements about Wikipedia tables. It requires linguistic reasoning and symbolic reasoning to get right answer. The dataset is in footnote. 20.

Domain Specific For airline industry specific table question answer, we recommend to use AIT-QA (Katsis et al., 2022). It highlights the unique challenges posed by domain-specific tables, such as complex layouts, hierarchical headers, and specialized terminology. For syntax description, we recommend to use TranX (Yin & Neubig, 2018). It uses an abstract syntax description language for the target representations, enabling high accuracy and generalizability across different types of meaning representations. For finance related table question answer, we recommend to use TAT-QA Zhu et al. (2021a). This dataset demands numerical reasoning for answer inference, involving operations like addition, subtraction, and comparison. Thus, TAT-QA can be used for complex task benchmark. The dataset is in footnote. 21.

Pretraining For pretraining on large datasets for table understanding, we recommend to use TaBERT (Yin et al., 2020c) and TAPAS (Herzig et al., 2020). Dataset in Tapas has 6.2 million tables and is useful for semantic parsing. TAPAS has 26 million tables and their associated english contexts. It can help model gain better understanding in both textual and table. The dataset is in footnote. 22.

5.2 General ability of LLMs in QA

Table 8 outlines the papers that investigated the effectiveness of LLMs on QA and reasoning, and the models explored. The most popular LLM used today is GPT3.5 and GPT4. Although these GPT models were not specifically optimized for table-based tasks, many of these papers found them to be competent in performing complex table reasoning tasks, especially when combined with prompt engineering tricks like CoT. In this section, we summarize the general findings of LLMs in QA tasks and highlight models that have reported to work well.

Numerical QA A niche QA task involves answering questions that require mathematical reasoning. An example query could be “What is the average payment volume per transaction for American Express?” Many real-world QA applications (E.g. working with financial documents, annual reports, etc.) involve such mathematical reasoning tasks. So far, Akhtar et al. (2023) conclude that LLMs like FlanT5 and GPT3.5 perform better than other models on various numerical reasoning tasks. On the DOCMATH-EVAL Zhao et al. (2023d) dataset, GPT-4 with CoT significantly outperforms other LLMs, while open-source LLMs (LLaMA-2, Vicuna, Mistral, Starcoder, MPT, Qwen, AquilaChat2, etc.) lag behind.

Text2SQL Liu et al. (2023c) designed a question matcher that identifies three keyword types: 1) column name-related terms, 2) restriction-related phrases (e.g. “top ten”), and 3) algorithm or module keywords. Once these keywords are identified, the module begins to merge the specific restrictions associated with each column into a unified combination, which is then matched with an SQL algorithm or module indicated by the third type of keyword. Zhang et al. (2023d) opted for a more straightforward approach of tasking LLaMA-2 to generate an SQL statement based on a question and table schema. Sun et al. (2023b) finetuned PaLM-2

19 The dataset for ToTTo is in https://github.com/google-research-datasets/ToTTo. The dataset for HiTAB is in https://github.com/microsoft/HiTab

20 The dataset for InfoTabs is in https://infotabs.github.io/. The dataset for TabFact is in https://tabfact.github.io/

21 The dataset for AIT-QA is in https://github.com/IBM/AITQA. The dataset for TranX is in https://github.com/pcyin/tranX. The dataset for TAT-QA is in https://github.com/NExTplusplus/TAT-QA

22 The dataset for TaBERT is in https://github.com/facebookresearch/TaBERT. The dataset for TAPAS is in https://github.com/google-research/tapas

Models Explored GPT4, GPT3.5, WizardLM, Llama-2 7, 13, 70B, CodeLlama 34B, Baichuan, Qwen, WizardMath, Vicuna, Mistral, etc. TAPAS, DeBERTa, TAPEX, NT5, LUNA, PASTA, ReasTAP, FlanT5, GPT3.5, PaLM GPT2 GPT3 Codex T5, CodeT5 GPT3 Custom: Dense Table Retrieval based on RoBERTa + Coarse State Tracking + Response based on GPT3.5 GPT-3.5, GPT-4 GPT-3.5 GPT3.5 Phoenix-7B Instruct GPT3.5, GPT4 T5 Custom: Retrieval trained on contrastive loss, Rank by softmax, Generation built on T5 Custom: TableLlama GPT3.5, DSP, ReAct GPT3.5, ChatGPT3.5 Vicuna, GPT4 GPT4 GPT4 ChatGPT3.5 LLaMA2 70b Custom: Table Selector + Known & Unknown Fields Extractor + AggFn Classifier on the Text2SQL task, achieving considerable performance on Spider. The top scoring models for the Spider today are Dong et al. (2023); Gao et al. (2023); Pourreza & Rafiei (2023), all building off OpenAI’s GPT models. SQL generation is popular in the industry, with many open-source fine-tuned models available.23.

Table 8: Overview of Papers and Models for LLMs for tabular QA tasks. We only include papers that work with models of >1B parameters. Models that are described as “Custom” indicates papers that fine-tuned specific portions of their pipeline for the task, whereas the other papers focus more on non-fine-tuning methods like prompt engineering. NumQA: Numerical QA.

Impact of model size on performance Chen (2023) found that size does matter: On WebTableQuestions, when comparing the 6.7B vs. 175B GPT-3 model, the smaller model achieved only half the scores of the larger one. On TabFact, they found that smaller models (<=6.7B) obtained almost random accuracy.

Finetuning or No fine-tuning? Based on our survey, there is minimal work in the tabular QA space that finetunes LLMs (>70B parameters). This might be due to the general ability of LLMs (GPT3.5, GPT4) to perform many QA tasks without fine-tuning. For SQL generation on Spider, DIN-SQL Pourreza & Rafiei (2023) and DAIL-SQL are inference-based techniques using GPT4, and surpassed previous fine-tuned smaller models. The papers that finetune on QA based off smaller LLMs, are not the focus of this paper, and was mentioned previously in Section 2.1 under embeddings-based serialization. Instead, most papers working on tabular QA based on LLMs focus on the aspects of prompt engineering, search and retrieval, and end-to-end pipelines (user interfaces), which we describe further in the next section.

23 https://huggingface.co/NumbersStation

5.3 Key components in QA

In the simplest QA architecture, an LLM takes in an input prompt (query and serialized table)24, and returns an answer. In more involved architectures, the system might be connected to external databases or programs. Most of the times, the knowledge base might not fit in the context length or memory of the LLM. Therefore, unique challenges to tabular QA for LLMs include: query intent disambiguation, search and retrieval, output types and format, and multi-turn settings where iterative calls between programs are needed. We describe these components further in this section.

5.3.1 Query intent disambiguation

Zha et al. (2023) introduced the concept of Chain-of-command (CoC), that translates user inputs into a sequence of intermediate command operations. For example, an LLM needs to first check if the task requires retrieval, mathematical reasoning, table manipulations, and/or the questions cannot be answered if the instructions are too vague. They constructed a dataset of command chain instructions to fine-tune LLMs to generate these commands. Deng et al. (2022b) proposed the QA task be split into three subtasks: Clarification Need Prediction (CNP) to determine whether to ask a question for clarifying the uncertainty; Clarification Question Generation (CQG) to generate a clarification question as the response, if CNP detects the need for clarification; and Conversational Question Answering (CQA) to directly produce the answer as the response if it is not required for clarification. They trained a UniPCQA model which unifies all subtasks in QA through multi-task learning.

5.3.2 Search and retrieval

The ability to accurately search and retrieve information from specific positions within structured data is crucial for LLMs. There are two types of search and retrieval use-cases: (1) to find the information (table, column, row, cell) relevant to the question, and (2) to obtain additional information and examples.

For main table Zhao et al. (2023d) observed that better performance of a retriever module (that returns the top-n most relevant documents) consistently enhances the final accuracy of LLMs in numerical QA. Sui et al. (2023c) explored multiple table sampling methods (of rows and columns) and table packing (based on a token-limit parameter). The best technique was the query-based sampling, which retrieves rows with the highest semantic similarity to the question, surpassing methods involving no sampling, or clustering, random, even sampling, or content snapshots. Dong et al. (2023) used ChatGPT to rank tables based on their relevance to the question using SC: they generate ten sets of retrieval results, each set containing the top four tables, then selecting the set that appears most frequently among the ten sets. To further filter the columns, all columns are ranked by relevance to the question by specifying that ChatGPT match the column names against with the question words or the foreign key should be placed ahead to assist in more accurate recall results. Similarly, SC method is used. cTBLS Sundar & Heck (2023) designed a three-step architecture to retrieve and generate dialogue responses grounded on retrieved tabular information. In the first step, a dual-encoder-based Dense Table Retrieval (DTR) model, initialized from RoBERTa Liu et al. (2019), identifies the most relevant table for the query. In the second step, a Coarse System State Tracking system, trained using triplet loss, is used to rank cells. Finally, GPT-3.5 is prompted to generate a natural language response to a follow-up query conditioned on cells of the table ranked by their relevance to the query as obtained from the coarse state tracker. The prompt includes the dialogue history, ranked knowledge sources, and the query to be answered. Their method produced more coherent responses than previous methods, suggesting that improvements in table retrieval, knowledge retrieval, and response generation lead to better downstream performance. Zhao et al. (2023d) used OpenAI’s Ada Embedding4 and Contriever (Izacard et al., 2022) as the dense retriever along with BM25 (Robertson et al., 1995) as the sparse retriever. These retrievers help to extract the top-n most related textual and tabular evidence from the source document, which were then provided as the input context to answer the question.

24 For the scope of our paper, we do not consider images, videos and audio inputs.

For additional information Some papers explore techniques to curate samples for in-context learning. Gao et al. (2023) explored the a few methods: (1) random: randomly selecting k examples; (2) question similarity selection: choosing k examples based on semantic similarity with question Q, based on a predefined distance metric (E.g. Euclidean or negative cosine similarity) of the question and example embedding, and kNN algorithm to select k closest examples from Q; (3) masked question similarity selection: similar to (2), but beforehand masking domain-specific information (the table names, column names and values) in the question; (4) query similarity selection: select k examples similar to target SQL query s∗, which relies on another model to generate SQL query s′ based on the target question and database, and so s′ is an approximation for s∗. Output queries are encoded into binary discrete syntax vectors. Narayan et al. (2022) explored manually curated and random example selection.

5.3.3 Multi-turn tasks

Some papers design pipelines that call LLMs iteratively. We categorize the use-cases for doing so into three buckets: (1) to decompose a challenging task into manageable sub-tasks, (2) to update the model outputs based on new user inputs, and (3) to work-around specific constraints or to resolve errors.

Intermediate, sub-tasks This section overlaps with concepts around CoT and SC discussed earlier in Section 2.3. In a nutshell, since the reasoning task might be complex, LLMs might require guidance to decompose the task into manageable sub-tasks. For example, to improve downstream tabular reasoning, Sui et al. (2023b) proposed a two-step self-augmented prompting approach: first using prompts to ask the LLM to generate additional knowledge (intermediate output) about the table, then incorporating the response into the second prompt to request the final answer for a downstream task. Ye et al. (2023b) also guided the LLM to decompose a huge table into a small table, and to convert a complex question into simpler subquestions for text reasoning. Their strategy achieved significantly better results than competitive baselines for table-based reasoning, outperforms human performance for the first time on the TabFact dataset. For Liu et al. (2023e), in encouraging symbolic CoT reasoning pathways, they allowed the model to interact with a Python shell that could execute commands, process data, and scrutinize results, particularly within a pandas dataframe, limited to a maximum of five iterative steps.

Dialogue-based applications In various applications where the users are interacting with the LLMs, like in chatbots, the pipeline must allow for LLMs to be called iteratively. Some dialogue-based Text2SQL datasets to consider are the SParC (Yu et al., 2019b) and CoSQL (Yu et al., 2019a) datasets. For SParC, the authors designed subsequent follow-up questions based on Spider (Yu et al., 2018b).

Working around constraints or error de-bugging Zhao et al. (2023a) used multi-turn prompts to work around cases where the tables exceed the API input limit. In other cases, especially if the generated LLM output is code, an iterative process of feeding errors back to the LLM can help the LLM generate correct code. Zhang et al. (2023d) did so to improve SQL query generation.

5.3.4 Output evaluation and format

If the QA output is a number or category, F1 or Accuracy evaluation metrics are common. If evaluating open-ended responses, apart from using typical measures for like ROUGE and BLEU, some papers also hire annotators to evaluate the Informativeness, Coherence and Fluency of the LLM responses Zhang et al. (2023g). When connected to programs like Python, Power BI, etc, LLMs’ outputs are not limited to text and code. For example, creating visualizations from text and table inputs are a popular task too Zhang et al. (2023g); Zha et al. (2023).

6 Limitations and future directions

LLMs has already been used in many tabular data applications, such as predictions, data synthesis, question answering and table understanding. Here we outline some practical limitations and considerations for future research.

Bias and fairness LLMs tend to inherit social biases from their training data, which significantly impact their fairness in tabular prediction and question answering tasks. Liu et al. (2023f) uses GPT3.5 and do few-shot learning to evaluate the fairness of tabular prediction on in context learning. The research concludes that LLMs tend to inherit social biases from their training data, which significantly impact their fairness in tabular prediction tasks. The fairness metric gap between different subgroups is still larger than that in traditional machine learning model. Additionally, the research further reveals that flipping the labels of the in-context examples significantly narrows the gap in fairness metrics across different subgroups, but comes at the expected cost of a reduction in predictive performance. The inherent bias of LLM is hard to mitigate through prompt (Hegselmann et al., 2023). Thus, a promising approach has proposed to mitigate bias is through pre-processing (Shah et al., 2020) or optimization (Bassi et al., 2024).

Hallucination LLMs have the risk of producing content that is inconsistent with the real-world facts or the user inputs (Huang et al., 2023). Hallucination raises concerns over the reliability and usefulness of LLMs in the real-world applications. For example, when working with patient records and medical data, hallucinations have critical consequences. Akhtar et al. (2023) found that hallucination led to performance drops in reasoning for LLMs. To address these issues, Wang et al. (2023c) incorporated an audit module that utilizes LLMs to perform self-check and self-correction. They generated pseudo-labels, then used a data audit module which filters the data based on data Shapley scores, leading to a smaller but cleaner dataset. Secondly, they also removed any cells with False values, which removes the chances of the LLMs making false inference on these invalid values. Finally, they performed a sanity check via LLM’s reflection: They queried the LLM with the input template “What is the {column}? {x}” to check if the answer matches the original values. If the answers do not match, the descriptions are corrected by re-prompting the LLM. However, this method is far from efficient. Better methods to deal with hallucination could make LLMs’ application in tabular data modeling more practical.

Numerical representation It was revealed that LLM in house embedding is not suitable for representing intrinsic relations in numerical features (Gruver et al., 2023), so specific embedding is needed. Tokenization significantly impacts pattern formation and operations in language models. Traditional methods like Byte Pair Encoding (BPE) used in GPT-3 often split numbers into non-aligned tokens (e.g., 42235630 into [422, 35, 630]), complicating arithmetic. Newer models like LLaMA tokenize each digit separately. Both approaches make LLM difficult to understand the whole number. Also, based on Spathis & Kawsar (2023), the tokenization of integers lacks a coherent decimal representation, leading to a fragmented approach where even basic mathematical operations require memorization rather than algorithmic processing. The devel- opment of new tokenizers, like those used in LLaMA (Touvron et al., 2023b), which outperformed GPT-4 in arithmetic tasks, involves rethinking tokenizer design to handle mixed textual and numerical data more effectively, such as by splitting each digit into individual tokens for consistent number tokenization (Gruver et al., 2023). This method has shown promise in improving the understanding of symbolic and numerical data. However, it hugely increases the dimension of the input which makes the method not practical for large datasets and many features.

Categorical representation Tabular dataset very often contains an excessive number of columns, which can lead to serialized input strings surpassing the context limit of the language model and increased cost. This is problematic as it results in parts of the data being pruned, thereby negatively impacting the model’s performance. sample/truncate. Additionally, there are issues with poorly represented categorical feature, such as nonsensical characters, which the model struggles to process and understand effectively. Another concern is inadequate or ambiguous Metadata, characterized by unclear or meaningless column names and metadata, leading to confusion in the model’s interpretation of inputs. Better categorical features encoding is needed to solve these problems.

Standard benchmark LLMs for tabular data could greatly benefit from standardized benchmark datasets to enable fair and transparent comparisons between models. In this survey, we strive to summarize commonly used datasets/metrics and provide recommendations for dataset selection to researchers and practitioners. However, the heterogeneity in tasks and datasets remains a significant challenge, hindering fair comparisons of model performance. Therefore, there is a pressing need for more standardized and unified datasets to bridge this gap effectively.

Model interpretability Like many deep learning algorithms, output from LLM suffers from a lack of interpretability. Only a few systems expose a justification of their model output such as TabLLM Hegselmann et al. (2023). One direction is to use the Shapley to derive interpretations. Shapley has been used to evaluate the prompt for LLM (Liu et al., 2023a). It could also be useful to understand how each feature influence the result. For instance, in prediction for diseases, providing explanation is crucial. In this case, a basic Shapley explanations would be able to show all features that led to the final decision. Future research is needed to explore the mechanisms for LLM’s emerging capabilities for tabular data understanding.

Easy to use Currently, most relevant models require fine-tuning or data serialization, which could make these models hard to access. Some pretrained model such as Wang et al. (2023c); ? could make people easy to use. It would be much easier to access if we can integrate these models with auto data prepossessing and serialization to existed platform such as Hugging Face.

Fine-tuning strategy design Designing appropriate tasks and learning strategies for LLMs is crucial. While LLMs demonstrate emergent abilities such as in-context learning, instruction following, and step-by-step reasoning, these capabilities may not be fully evident in certain tasks, depending on the model used. Also, LLMs are sensitive to various serialization and prompt engineering methods, which is the primary way to adapt LLM to unseen tasks. Thus, researchers and practitioners need to carefully design tasks and learning strategies tailored to specific models in order to achieve an optimal performance.

Model grafting The performance of LLM for tabular data modeling could be improved through model grafting. Model grafting involves mapping non-text data into the same token embedding space as text using specialized encoders, as exemplified by the HeLM model (Belyaeva et al., 2023), which integrates spirogram sequences and demographic data with text tokens. This approach is efficient and allows integration with high-performing models from various domains but adds complexity due to its non-end-to-end training nature and results in communication between components that is not human-readable. This approach could be adapted to LLM for tabular data to improve the encoding of non-text data.

7 Conclusion

This survey represents the first comprehensive investigation into the utilization of LLMs for modeling heterogeneous tabular data across various tasks, including prediction, data synthesis, question answering and table understanding. We delve into the essential steps required for tabular data to be ingested by LLM, covering serialization, table manipulation, and prompt engineering. Additionally, we systematically compare datasets, methodologies, metrics and models for each task, emphasizing the principal challenges and recent advancements in understanding, inferring, and generating tabular data. We provide recommendations for dataset and model selection tailored to specific tasks, aimed at aiding both ML researchers and practitioners in selecting appropriate solutions for tabular data modeling using different LLMs. Moreover, we examine the limitations of current approaches, such as susceptibility to hallucination, fairness concerns, data pre- processing intricacies, and result interpretability challenges. In light of these limitations, we discuss future directions that warrant further exploration in future research endeavors.

With the rapid development of LLMs and their impressive emergent capabilities, there is a growing demand for new ideas and research to explore their potential in modeling structured data for a variety of tasks. Through this comprehensive review, we hope it can provide interested readers with pertinent references and insightful perspectives, empowering them with the necessary tools and knowledge to effectively navigate and address the prevailing challenges in the field.

Previous: Survey | Datasets for LLMs Next: Model | Plan GPT

post contain ""

    No matching posts found containing ""