使用 Visual Studio Code 在 Microsoft Fabric 中进行数据科学

Name: Visual Studio Code
Author: Microsoft

您可以在 VS Code 中为 Microsoft Fabric 构建和开发数据科学与数据工程解决方案。适用于 VS Code 的 Microsoft Fabric 扩展提供了集成开发体验，可用于处理 Fabric 项目、Lakehouse、Notebook 和用户定义函数。

什么是 Microsoft Fabric？

Microsoft Fabric 是一个面向企业的、端到端的数据分析平台。它统一了数据移动、数据处理、数据引入、数据转换、实时事件路由和报表构建。它通过 Data Engineering、Data Factory、Data Science、Real-Time Intelligence、Data Warehouse 和 Databases 等集成服务支持这些功能。立即免费注册，免费体验 Microsoft Fabric 60 天 — 无需信用卡。

Diagram that shows what is Microsoft Fabric?

先决条件

在开始使用适用于 VS Code 的 Microsoft Fabric 扩展之前，您需要：

Visual Studio Code：安装最新版 VS Code。
Microsoft Fabric 帐户：您需要有权访问 Microsoft Fabric 工作区。您可以注册免费试用以开始使用。
Python：安装 Python 3.8 或更高版本，以便在 VS Code 中使用 Notebook 和用户定义函数。

安装和设置

您可以从 Visual Studio Marketplace 中查找并安装这些扩展，或者直接在 VS Code 中安装。选择“扩展”视图（⇧⌘X (Windows、Linux Ctrl+Shift+X)），然后搜索“Microsoft Fabric”。

使用哪些扩展

扩展	最适合	主要功能	建议您使用，如果您...	文档
Microsoft Fabric 扩展	通用工作区管理、项目管理和处理项目定义	- 管理 Fabric 项目（Lakehouse、Notebook、管道） - Microsoft 帐户登录和租户切换 - 统一或分组的项目视图 - 使用 IntelliSense 编辑 Fabric Notebook - 命令面板集成（`Fabric:` 命令）	您希望有一个扩展程序可以直接从 VS Code 中管理 Fabric 中的工作区、Notebook 和项目。	Fabric VS Code 扩展是什么
Fabric 用户定义函数	开发自定义转换和工作流的开发人员	- 在 Fabric 中编写无服务器函数 - 使用断点进行本地调试 - 管理数据源连接 - 安装/管理 Python 库 - 直接将函数部署到 Fabric 工作区	您需要构建自动化或数据转换逻辑，并需要从 VS Code 进行调试和部署。	在 VS Code 中开发用户定义函数
Fabric 数据工程	处理大规模数据和 Spark 的数据工程师	- 探索 Lakehouse（表、原始文件） - 开发/调试 Spark Notebook - 构建/测试 Spark 作业定义 - 在本地 VS Code 和 Fabric 之间同步 Notebook - 预览架构和示例数据	您使用 Spark、Lakehouse 或大规模数据管道，并希望进行本地探索、开发和调试。	在 VS Code 中开发 Fabric Notebook

入门

安装扩展并登录后，您就可以开始使用 Fabric 工作区和项目了。在命令面板（⇧⌘P (Windows、Linux Ctrl+Shift+P)）中，键入“Fabric”以列出 Microsoft Fabric 特有的命令。

Diagram that shows all microsoft Fabric commands

Fabric 工作区和项目浏览器

Fabric 扩展程序提供了一种无缝的方式来处理远程和本地 Fabric 项目。

在 Fabric 扩展程序中，“Fabric 工作区”部分按类型（Lakehouse、Notebook、管道等）列出了远程工作区中的所有项目。
在 Fabric 扩展程序中，“本地文件夹”部分显示了在 VS Code 中打开的 Fabric 项目文件夹。它反映了您在 VS Code 中打开的每种类型的 Fabric 项目定义的结构。这使您能够本地开发并将更改发布到当前或新的工作区。

Screenshot that shows how to view your workspaces and items?

使用用户定义函数进行数据科学

在命令面板（⇧⌘P (Windows、Linux Ctrl+Shift+P)）中，键入“Fabric: Create Item”。
选择您的工作区，然后选择“User data function”。提供一个名称并选择“Python”语言。
您会收到通知，需要设置 Python 虚拟环境，然后继续在本地进行设置。
使用 pip install 安装库，或在 Fabric 扩展程序中选择用户定义函数项目来添加库。更新 requirements.txt 文件以指定依赖项。
```
fabric-user-data-functions ~= 1.0
pandas == 2.3.1
numpy == 2.3.2
requests == 2.32.5
scikit-learn=1.2.0
joblib=1.2.0
```

打开 functions_app.py。下面是使用 scikit-learn 为数据科学开发用户定义函数的示例。

import datetime
import fabric.functions as fn
import logging

# Import additional libraries
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib

udf = fn.UserDataFunctions()
@udf.function()
def train_churn_model(data: list, targetColumn: str) -> dict:
    '''
    Description: Train a Random Forest model to predict customer churn using pandas and scikit-learn.

    Args:
    - data (list): List of dictionaries containing customer features and churn target
    Example: [{"Age": 25, "Income": 50000, "Churn": 0}, {"Age": 45, "Income": 75000, "Churn": 1}]
    - targetColumn (str): Name of the target column for churn prediction
    Example: "Churn"

    Returns: dict: Model training results including accuracy and feature information
    '''
    # Convert data to DataFrame
    df = pd.DataFrame(data)

    # Prepare features and target
    numeric_features = df.select_dtypes(include=['number']).columns.tolist()
    numeric_features.remove(targetColumn)

    X = df[numeric_features]
    y = df[targetColumn]

    # Split and scale data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train model
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X_train_scaled, y_train)

    # Evaluate and save
    accuracy = accuracy_score(y_test, model.predict(X_test_scaled))
    joblib.dump(model, 'churn_model.pkl')
    joblib.dump(scaler, 'scaler.pkl')

    return {
        'accuracy': float(accuracy),
        'features': numeric_features,
        'message': f'Model trained with {len(X_train)} samples and {accuracy:.2%} accuracy'
    }

@udf.function()
def predict_churn(customer_data: list) -> list:
    '''
    Description: Predict customer churn using trained Random Forest model.

    Args:
    - customer_data (list): List of dictionaries containing customer features for prediction
    Example: [{"Age": 30, "Income": 60000}, {"Age": 55, "Income": 80000}]

    Returns: list: Customer data with churn predictions and probability scores
    '''
    # Load saved model and scaler
    model = joblib.load('churn_model.pkl')
    scaler = joblib.load('scaler.pkl')

    # Convert to DataFrame and scale features
    df = pd.DataFrame(customer_data)
    X_scaled = scaler.transform(df)

    # Make predictions
    predictions = model.predict(X_scaled)
    probabilities = model.predict_proba(X_scaled)[:, 1]

    # Add predictions to original data
    results = customer_data.copy()
    for i, (pred, prob) in enumerate(zip(predictions, probabilities)):
        results[i]['churn_prediction'] = int(pred)
        results[i]['churn_probability'] = float(prob)

    return results

通过按 F5 在本地测试您的函数。
在 Fabric 扩展程序中，在“本地文件夹”下，选择函数并发布到您的工作区。

详细了解如何从以下位置调用函数：

使用 Fabric Notebook 进行数据科学

Fabric Notebook 是 Microsoft Fabric 中的交互式工作簿，用于并排编写和运行代码、可视化和 Markdown。Notebook 支持多种语言（Python、Spark、SQL、Scala 等），非常适合在 Fabric 中使用 OneLake 中的现有数据进行数据探索、转换和模型开发。

示例

下面的单元格使用 Spark 读取 CSV，将其转换为 pandas，并使用 scikit-learn 训练逻辑回归模型。请将列名和路径替换为您数据集的值。

def train_logistic_from_spark(spark, csv_path):
    # Read CSV with Spark, convert to pandas
    sdf = spark.read.option("header", "true").option("inferSchema", "true").csv(csv_path)
    df = sdf.toPandas().dropna()

    # Adjust these to match your dataset
    X = df[['feature1', 'feature2']]
    y = df['label']

    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LogisticRegression(max_iter=200)
    model.fit(X_train, y_train)

    preds = model.predict(X_test)
    return {'accuracy': float(accuracy_score(y_test, preds))}

# Example usage in a Fabric notebook cell
# train_logistic_from_spark(spark, '/path/to/data.csv')

有关更多信息，请参阅 Microsoft Fabric Notebooks 文档。

Git 集成

Microsoft Fabric 支持 Git 集成，可实现数据和分析项目的版本控制和协作。您可以将 Fabric 工作区连接到 Git 存储库（主要是 Azure DevOps 或 GitHub），并且仅同步支持的项目。此集成还支持 CI/CD 工作流，使团队能够高效地管理版本并维护高质量的分析环境。

GIF that shows how to use Git integration with User data functions

后续步骤

现在您已在 VS Code 中设置了 Microsoft Fabric 扩展，请探索这些资源以加深您的知识：

参与社区并获取支持

10/9/2025