Gpu accelerates aimodeldevelopmentandanalyticsutilizingelasticsearchandazure ai

228 Views

April 30, 21

#nvidia #gpu #azure #azure ml #azure ai #elastic stack #e #AI #Azure #Elasticsearch #MLOps #GPU

スライド概要

NVIDIA #GTC21
AI モデル作成と解析～ ElasticsearchとAzure AIを活用～[S32677]
https://gtc21.event.nvidia.com/esearch/search?keyword=S32677

Shotaro Suzuki

@shosuz

スライド一覧

FPT ジャパン FPT データ& AI インテグレーション　エグゼクティブエバンジェリスト独立行政法人　国立印刷局デジタル統括アドバイザー兼最高情報セキュリティアドバイザー AI 駆動開発勉強会主催。Microsoft エバンジェリスト時代から、Dell、Accenture、Elastic、VMware を経て現職まで一貫して開発者向けに最新技術を啓発。GPU クラウド技術訴求、AI 駆動開発推進。　政府の仕事は、内閣官房政府 CIO 補佐官、デジタル庁 PM を経て、現職を兼務。 Locofy.ai Regional Developer Advocate Google Cloud Partner All Certifications Engineer 2025

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

Locofy.ai による AI 駆動エンタープライズフロンドエンド開発実践-s

vs code github copilot gemini locofy.ai figma enterprise ui/ux designer & developer collaboration fpt fpt ai factory nvidia gpu cloud frontend

Shotaro Suzuki 69.6K

GitHub Copilot & Copilot Chat で Java コーディングを最大限効率化する-配布用

intellij spring starter github github copilot github copilot chat visual studio code java spring boot h2

Shotaro Suzuki 63K

新たな AI 駆動開発の潮流(SWE Agent, AutoDev,Devin, GitHub Copilot Workspace等)

devin opendevin azure autodev github github copilot github copilot chat swe agent chatgpt openai python conda claude github copilot workspace gpt-4 gpt-4o agent github codespace

Shotaro Suzuki 47.7K

Azure OpenAI Service 概要とサンプルアプリ等のご紹介

microsoft azure openai .net python javascript ai chatgpt azure openai service csharp

Shotaro Suzuki 46.6K

FPT AI Factory で加速する AI 開発-20250213-公開版

fpt fpt ai factory generative ai azure google cloud aws kubernetes gpu metal cloud nvidia enterprise nvidia nvidia nim nvidia nemo code vista test vista fpt ai studio gpu managed cluster gpu container nvidia h100 nvidia h200 fine tuning model serving rag openai anthropic claude google gemini visual studio code visual studio intellij idea android studio jetbrains multi-agent ai driven development ivychat

Shotaro Suzuki 31.9K

最新の React/TypeScript SPA テンプレートを .NET 8 で試してみよう

react angular vue node.js vite javascript visual studio 2022 preview visual studio code typescript .net conf asp.net core .net 8 interactjs microsoft.jsinterop npm

Shotaro Suzuki 16.7K

各ページのテキスト

GPU で加速する AI モデル作成と解析〜 Elasticsearch と Azure AI を活⽤〜鈴⽊章太郎 Elastic テクニカルプロダクトマーケティングマネージャー/エバンジェリスト内閣官房 IT 総合戦略室政府 CIO 補佐官

Shotaro Suzuki Twitter : @shosuz Elastic Technical Product Marketing Manager/Evangelist 内閣官房 IT 総合戦略室政府 CIO 補佐官元 Microsoft Technical Evangelist

Agenda • • AI ソリューションの課題 Microsoft AI + Elastic による解決

AI ソリューションの課題

機械学習の知識 + Cloud インフラの知識

Microsoft AI + Elastic による解決

Microsoft Azure の⼈⼯知能 AI 向けのスケーラブルで信頼性の⾼いクラウドプラットフォーム • • • データストレージ計算サービス機械学習モデルのトレーニング、デプロイ、および管理のためのプラットフォーム開発者が AI ソリューションの構築に使⽤できる⼀連のサービスボットを開発および管理するためのクラウドベースのプラットフォーム

Python & R SDK Azure Cloud Services Compute (Container) / Storage üデータの加⼯ üモデルの学習 üモデルの管理 üモデルの展開と追跡機械学習 / 深層学習を Azure で⾏うためのベストプラクティス

モデルの構築・展開を、個⼈から企業レベルでも Notebooks Reproducibility Automated ML UX Automation CPU, GPU, FPGAs Designer Deployment IoT Edge Re-training

10.

Elastic テクノロジー概要 3つのソリューション Elastic エンタープライズサーチ Elastic オブザーバビリティ Elastic セキュリティ Kibana Elastic スタックで実現 Elasticsearch Beats 豊富なデプロイ選択肢 Logstash Elastic Cloud Elastic Cloud Enterprise SaaS (AWS/Azure/GCP) IaaS (クラウド & オンプレ） Elastic Cloud on Kubernetes Kubernetes (クラウド & オンプレ）

11.

Azure サービスとの連携パターン (例) Seamless connectivity with Beats, Logstash and Azure Serverless Web App Azure Monitor VM ログファイル Windows イベントログメトリック Audit etc. Modules beats Azure Blob Storage Azure Event Hub Azure IoT Hub HTTP etc. Elasticsearch Service Azure Functions, etc Elastic Cloud on Kubernetes Storage Event Hub IoT Hub Database Azure Blob Storage Azure Table Storage Azure Service Bus Azure Event Hub Azure IoT Hub Azure SQL Database Azure Database for MySQL Azure Database for PostgreSQL Azure Database for MariaDB Application Insights Connectors SaaS Modules Logstash Elastic Cloud Enterprise Elastic Stack Logic App

12.

あらゆるスキルレベルにモデルのライフサイクル管理 (MLOps) オープンで相互運⽤可能責任ある ML

13.

• あらゆるスキルレベルに対応

14.

AutoML in GUI Designer SDK for Code

15.

MLOps による全ライフサイクル管理

16.

Collaborate App Developer CI / CD Tools Build app Test app Release app Monitor app Train model Validate model Deploy model Monitor model GitHub Actions or Azure DevOps による⾃動化監査証跡の管理とモデルの解釈可能性 Retrain model Data Scientist Azure Machine Learning Model reproducibility Model validation Model deployment Model retraining

17.

ソフトウェアの開発⽅法とデリバリは常に進化 CI / CD サーバレスコンテナオーケストレーションマイクロサービスクラウド

18.

65 10 % の組織は種類以上の監視ツールを使⽤

19.

現状ー典型的なオブザーバビリティのツール群運⽤: ログ監視ログツールウェブログアプリログデータベースログコンテナログ運⽤︓ インフラ監視メトリックツールコンテナ指標ホスト指標データベース指標ネットワーク指標ストレージ指標開発チーム運⽤︓ サービス監視ビジネスチーム APM ツールアップタイムツールビジネスツールリアルユーザー監視トランザクションパフォーマンス監視分散トレーシング可動性応答時間ビジネス KPI

20.

Elastic のオブザーバビリティへのアプローチ開発、運⽤、ビジネスチームログデータ指標データ APM データアップタイムデータビジネスデータ全ての運⽤にまつわるデータを、⼀つの強⼒なデータストアに集約 Elasticsearch -

21.

オープンで相互運⽤可能、責任ある ML

22.

開発ツール⾔語フレームワーク

23.

モデルの解釈プライバシーの保護展開の制御

24.

• 機械学習のためのデータ管理

25.

メトリック、データ、モデル等の⼤事な資産の共有と運⽤管理 Workspace Experiment 実験メトリックパラメータ値モデル精度の可視化データセットデータ定義の管理スナップショットモデルバージョン管理オンプレミスへのデプロイ https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-track-experiments

https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-track-experiments

26.

機械学習パイプライン構築、テスト、デプロイするためのビジュアルワークフロー • 直感的なマウス操作によるパイプライン構築 • 特徴量エンジニアリング • モデル学習 (回帰、分類、クラスタリング) • 推論 (リアルタイム & バッチ推論) • カスタムモデル・スクリプト (Python, R)

27.

アプリケーション Azure Portal プログラム or GUI で操作可能クラウドリアルタイム分析⾼性能エッジ Model Docker 外観検査、顧客分析、製造プロセス⾃動化 … ⾃動運転 … 軽量エッジ

28.

Workspace https://docs.microsoft.com/ja-jp/azure/machine-learning/service/concept-workspace https://docs.microsoft.com/ja-jp/azure/machine-learning/service/concept-azure-machine-learning-architecture

29.

• 学習作業毎の Compute Resource 論理単位 – – • • 学習モデル複数の Workspace Azure Resources – – – – Azure Azure Azure Azure Container Resistor – モデルを Docker Container 化してセキュアに管理 Storage – 学習データ、テレメトリー、モデルファイルなどのストレージ Application Insight – モデルのモニタリング Key Vault – 学習・推論 Compute のクレデンシャルなどセキュアな情報管理

30.

• Model – 機械学習の結果のファイル • – – • Azure Machine Learning servicers 以外で作成したモデルも扱える様々な機械学習 / 深層学習のフレームワークの Model Workspace の中で管理 Model Registry – – – ラベル付けでのバージョン管理追加のメタデータ Image 化して使っているものは削除できない

31.

• • • • 1: 2: 3: 4: Model Registry へ登録 Image Registry (Azure Container Registry) へ登録 Image を展開 Model の監視

32.

• Python Script の学習ジョブ – – Workspace で管理実⾏ログの保存 • • – Timestamp, duration など標準⼊出⼒実⾏ Compute 環境をアタッチ • ジョブ毎に変更可能

33.

• 学習の環境 – • Azure Compute の抽象化学習ジョブ単位でアタッチコンピューティングターゲットローカルコンピューター GPU アクセラレーション⾃動ハイパーパラメーター調整可能性あり⾃動機械学習パイプライン親和性 ✓ Azure Machine Learning コンピューティング ✓ ✓ ✓ ✓ リモート VM ✓ ✓ ✓ ✓ ✓ ✓* Azure Databricks Azure Data Lake Analytics ✓* Azure HDInsight ✓

34.

• Image – – – • Model Model の⼊出⼒を抽象化した Script Model もしくは Scoring 実⾏の依存関係 2 種類 – – • Image Registry – – Model から作成された Image の管理メタデータ, 検索 FPGA Image: Azure 内の FPGAクラスターへ Docker Image: 任意の場所の Docker 実⾏環境へ

35.

Sources Environments Formats Challenges Insecure and fragile + + = Increasing storage costs Difficult to track & audit

36.

– – Azure Storage Account の抽象化 • Azure Blob • Azure File Workspace でデフォルトの Data Store を持つ • – 追加可能 Python SDK もしくは Azure CLI から制御

37.

セキュリティ

38.

簡単な統合 - Filebeat Azure Module - Metricbeat Azure Module

39.

[Metricbeat] Azure からのメトリックの収集オプション Any machine Elasticsearch Ingest Node Azure Monitor metrics Metricbeat Azure module Data Node Metricbeat on Azure instances

40.

サポートされている Azure メトリックス Metricsets (Azure モニター経由) - モニター - compute_vm - compute_vm_scaleset - ストレージ (BLOB、テーブル、キュー、ファイル) - データベースアカウント - コンテナー Azure メトリックスの機能 - 集計 - ディメンション - タイムグレイン metrics: - name: "Requests" namespace: "Microsoft.ApiManagement/service" aggregations: ["Maximum"] timegrain: "PT1M" dimensions: - name: "Hostname" value: "apimanagement.azure-api.net"

https://www.elastic.co/guide/en/beats/metricbeat/master/metricbeat-module-azure.html

41.

[Filebeat] Azure でのログ/イベントの収集オプション Elasticsearch Ingest Node Event Hub Filebeat Data Node Filebeat on VM instances or containers (daemonset)

42.

Azure ログのモジュールアクティビティ, サインイン, 監査 • アクティビティログ ‒ • サインインログ ‒ • サブスクリプション内のリソースに対して実⾏された操作に関する洞察を提供マネージアプリケーションとユーザーのサインインアクティビティの使⽤状況に関する情報を提供監査ログ ‒ Azure AD 内の様々な機能によって⾏われたすべての変更に対して、ログを通じてトレーサビリティを提供

43.

44.

45.

• – – • •

46.

• – – 80% – •

47.

DATA SCIENTIST DATA ENGINEER DEVELOPER

48.

[beta]

主要な深層学習・機械学習ライブラリの抽象化クラス
from azureml.train.estimator import Estimator
script_params = { ‘--learning-rate’: 0.3, '--regularization': 0.8 }
est = Estimator(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train.py’,
conda_packages=['scikit-learn'])

49.

LightGBM Horovod 参考︓Azure Machine Learning で Estimator を使⽤してモデルをトレーニングする https://docs.microsoft.com/ja-JP/azure/machine-learning/service/how-to-train-ml-models

https://docs.microsoft.com/ja-JP/azure/machine-learning/service/how-to-train-ml-models

50.

• Python SDK Training 利⽤ステップ

51.

• • • • • • • Workspace への Configuration 設定 Compute 設定 DataStore 設定 – データのアップロード – – Entry Point 依存パッケージ Script ランタイム設定 (Estimator) Job Submit 結果の確認とモデル保存モデルの展開

52.

• Scikit-learn での MNIST データセットをロジスティック回帰で処理 – MNIST • • • – ⼿書きの数字 – 0 から 9 70,000 データ 28x28 pixels 数字の分類

53.

Step 1 – Create a workspace from azureml.core import Workspace ws = Workspace.create(name='myworkspace', subscription_id='<azure-subscription-id>', resource_group='myresourcegroup', create_resource_group=True, location='eastus2' # or other supported Azure region ) # see workspace details ws.get_details() Step 2 – Create an Experiment experiment_name = ‘my-experiment-1' from azureml.core import Experiment exp = Experiment(workspace=ws, name=experiment_name)

54.

Step 3 – Create remote compute target # choose a name for your cluster, specify min and max nodes compute_name = os.environ.get("BATCHAI_CLUSTER_NAME", "cpucluster") compute_min_nodes = os.environ.get("BATCHAI_CLUSTER_MIN_NODES", 0) compute_max_nodes = os.environ.get("BATCHAI_CLUSTER_MAX_NODES", 4) # This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6 vm_size = os.environ.get("BATCHAI_CLUSTER_SKU", "STANDARD_D2_V2") provisioning_config = AmlCompute.provisioning_configuration( vm_size = vm_size, min_nodes = compute_min_nodes, max_nodes = compute_max_nodes) # create the cluster print(‘ creating a new compute target... ') compute_target = ComputeTarget.create(ws, compute_name, provisioning_config) # You can poll for a minimum number of nodes and for a specific timeout. # if no min node count is provided it will use the scale settings for the cluster compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

55.

Step 4 – Upload data to the cloud 圧縮されたデータを numpy へロード。 ʻload_dataʼ はカスタム関数。 # note that while loading, we are shrinking the intensity values (X) from 0-255 to 0-1 so that the model converge faster. X_train = load_data('./data/train-images.gz', False) / 255.0 y_train = load_data('./data/train-labels.gz', True).reshape(-1) X_test = load_data('./data/test-images.gz', False) / 255.0 y_test = load_data('./data/test-labels.gz', True).reshape(-1) Data Store 設定。これで、どこからでも Azure Storage 上への読み書きが可能に。 ds = ws.get_default_datastore() print(ds.datastore_type, ds.account_name, ds.container_name) ds.upload(src_dir='./data', target_path='mnist', overwrite=True, show_progress=True) これで学習の準備が完了

56.

Step 5 – Train a local model Scikit-learn の logistic regression の学習ジョブを実⾏。通常は数分で終了。 %%time from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(X_train, y_train) # Next, make predictions using the test set and calculate the accuracy y_hat = clf.predict(X_test) print(np.average(y_hat == y_test)) Model の Accuracy の結果が表⽰される [0.915 位]

57.

Step 6 – Train model on remote cluster リモート Computer で実⾏する場合には、以下のステップが必要 • 6.1: Create a directory • 6.2: Create a training script • 6.3: Create an estimator object • 6.4: Submit the job Step 6.1 – Create a directory import os script_folder = './sklearn-mnist' os.makedirs(script_folder, exist_ok=True)

58.

Step 6.2 – Create a Training Script (1/2) %%writefile $script_folder/train.py # load train and test set into numpy arrays # Note: we scale the pixel intensity values to 0-1 (by dividing it with 255.0) so # the model can converge faster. # ‘data_folder’ variable holds the location of the data files (from datastore) Reg = 0.8 # regularization rate of the logistic regression model. X_train = load_data(os.path.join(data_folder, 'train-images.gz'), False) / 255.0 X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0 y_train = load_data(os.path.join(data_folder, 'train-labels.gz'), True).reshape(-1) y_test = load_data(os.path.join(data_folder, 'test-labels.gz'), True).reshape(-1) print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep = '¥n’) # get hold of the current run run = Run.get_context() #Train a logistic regression model with regularizaion rate of’ ‘reg’ clf = LogisticRegression(C=1.0/reg, random_state=42) clf.fit(X_train, y_train)

59.

Step 6.2 – Create a Training Script (2/2) print('Predict the test set’) y_hat = clf.predict(X_test) # calculate accuracy on the prediction acc = np.average(y_hat == y_test) print('Accuracy is', acc) run.log('regularization rate', np.float(args.reg)) run.log('accuracy', np.float(acc)) os.makedirs('outputs', exist_ok=True) # The training script saves the model into a directory named ‘outputs’. Note files saved # in the outputs folder are automatically uploaded into experiment record. Anything written # in this directory is automatically uploaded into the workspace. joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl')

60.

from • import 学習中のメトリックを個別に保存 – Standard Input / Output 以外に https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-track-experiments

https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-track-experiments

61.

[beta]

/output/
•

DataStore のデフォルトの場所への保存
model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
# save model in the outputs folder so it automatically get uploaded
with open(model_file_name, "wb") as file:
joblib.dump(value=reg, filename=os.path.join('./outputs/', model_file_name))

62.

[beta]

Step 6.3 – Create an Estimator
Estimator が 学習ジョブを実⾏
from azureml.train.estimator import Estimator
script_params = { '--data-folder': ds.as_mount(), '--regularization': 0.8 }
est = Estimator(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train.py’,
conda_packages=['scikit-learn'])

Step 6.4 – Submit the job to the cluster for training
run = exp.submit(config=est)
run

63.

Image creation Estimator で指定されたパラメーターを元に Docker Image のビルド。Workspace へ登録。約5分。初回のみ。2度⽬からはキャッシュからロードされる。 Scaling 学習⽤の Cluster が更に Compute Resource が必要になると、⾃動的に追加。 Scale out は通常5分程度 Docker Build の状況は Docker Build のログから確認できる Running Compute Target へ必要なスクリプトなどがコピー。その後、Data Store がマウントもしくはデータのコピーが⾏われる。その後 entry_script で指定した script ファイルが実⾏される。ジョブの実⾏中に stdout は /logs にストリーム出⼒。ジョブ実⾏中にも確認が出来る Post-Processing ./outputs ディレクトリーに実⾏結果を出⼒。 Workspace からそれぞれアクセスできる

64.

Step 7 – Monitor a run Jupyter widget でジョブの状態をモニタリング。10-15 秒程度の遅延で⾮同期で表⽰される from azureml.widgets import RunDetails RunDetails(run).show() Azure Machine Learning services の widget の例:

65.

Step 8 – See the results モデルの学習ジョブは⾮同期で実⾏される。ここでは、それが終わるまで待機するシグナル送信。 wait_for_completion にて run.wait_for_completion(show_output=False) # now there is a trained model on the remote cluster print(run.get_metrics()) {'regularization rate': 0.8, 'accuracy': 0.9204}

66.

Step 9 – Register the model トレーニングジョブの最終ステップを呼び出し: joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl') ファイルを ʻoutputs/sklearn_mnist_model.pklʼ へ出⼒。 ʻoutputsʼ ディレクトリーは、ジョブを実⾏した仮想マシンの中。 • outputs は特別なディレクトリー。この中の全てのファイルは、Workspace のストレージへコピーされる。 • 実⾏ジョブ履歴 • Modelファイル • など # register the model in the workspace model = run.register_model ( model_name='sklearn_mnist’, model_path='outputs/sklearn_mnist_model.pkl’) Model が Workspace に登録され、クエリできるようになる

67.

Step 9 – Deploy the Model

68.

Step 9.1 – Create the scoring script Scoring ⽤の Script作成。score.py。Web Services として設定される必須 function: init() と run (input data) from azureml.core.model import Model def init(): global model # retreive the path to the model file using the model name model_path = Model.get_model_path('sklearn_mnist’) model = joblib.load(model_path) def run(raw_data): data = np.array(json.loads(raw_data)['data’]) # make prediction y_hat = model.predict(data) return json.dumps(y_hat.tolist())

69.

[beta]

Step 9.2 – Create environment file
environment file の作成。ここでは myenv.yml。 Script実⾏のための依存関係 Package を指定したもの。
Docker Image 作成時に使⽤される。
このサンプルでは、 scikit-learn と azureml-sdk が指定されている
from azureml.core.conda_dependencies import CondaDependencies
myenv = CondaDependencies()
myenv.add_conda_package("scikit-learn")
with open("myenv.yml","w") as f:
f.write(myenv.serialize_to_string())

Step 9.3 – Create configuration file
展開⽤の configuration ファイルと、CPU数、GB単位のRAM容量など、ACI 作成に必要なパラメーターを設定。
デフォルト値:1 core 1 gigabyte RAM
from azureml.core.webservice import AciWebservice
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1,
tags={"data": "MNIST", "method" : "sklearn"},
description='Predict MNIST with sklearn')

70.

Step 9.4 – Deploy the model to ACI %%time from azureml.core.webservice import Webservice from azureml.core.image import ContainerImage • • • # configure the image image_config = ContainerImage.image_configuration( execution_script ="score.py", runtime ="python", conda_file ="myenv.yml") service = Webservice.deploy_from_model(workspace=ws, name='sklearn-mnist-svc’, deployment_config=aciconfig, models=[model], image_config=image_config) service.wait_for_deployment(show_output=True)

71.

[beta]

Step 10 – Test the deployed model
using the HTTP end point
import requests
import json
# send a random row from the test set to score
random_index = np.random.randint(0, len(X_test)-1)
input_data = "{¥"data¥": [" + str(list(X_test[random_index])) + "]}"
headers = {'Content-Type':'application/json’}
resp = requests.post(service.scoring_uri, input_data, headers=headers)
print("POST to url", service.scoring_uri)
#print("input data:", input_data)
print("label:", y_test[random_index])
print("prediction:", resp.text)

72.

• Elastic Observability を使⽤した NVIDIA GPU メトリックの監視

73.

依存関係 (1) NVIDIA GPU メトリックを稼働させるには、ソースコード（Go）から NVIDIA GPU 監視ツールを構築 • NVIDIA GPU は、Microsoft Azure、Google Cloud 、AmazonWeb Services（AWS）などの多くのクラウドプロバイダーから⼊⼿可能（図は Genesis Cloud) • • • • NVIDIA の Ubuntu18.04 ⽤ DCGM スタートガイドのインストールセクションに従って、NVIDIA Datacenter Manager をインストール <architecture>パラメーターを独⾃のパラメーターに置き換えることに特に注意 unameコマンドを使⽤してアーキテクチャを⾒つける uname - a X86_64がアーキテクチャであるとの回答。従ってスタートガイドのステップ1は次の通り echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7 fa2af80.pub

74.

依存関係 (2) • インストール後、nvidia-smi コマンドを実⾏することで、GPU 詳細を⾒ることができる

75.

• NVIDIA の gpu-monitoring-tools をビルドするには、Golang をインストールする必要あり cd /tmp wget https://golang.org/dl/go1.15.7.linux-amd64.tar.gz sudo mv go1.15.7.linux-amd64.tar.gz /usr/local/ cd /usr/local/ sudo tar -zxf go1.15.7.linux-amd64.tar.gz sudo rm go1.15.7.linux-amd64.tar.gz • NVIDIA の gpu-monitoring-tools を GitHub からインストールして、NVIDIA のセットアップを終了 cd /tmp git clone https://github.com/NVIDIA/gpu-monitoring-tools.git cd gpu-monitoring-tools/ sudo env "PATH=$PATH:/usr/local/go/bin" make install

https://github.com/NVIDIA/gpu-monitoring-tools.git

76.

• Metricbeat をインストールする準備が整ったので、elastic.co で Metricbeat の最新版を確認 • 以下のコマンドでバージョン番号を調整 cd /tmp wget https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-7.10.2-amd64.deb sudo dpkg -i metricbeat-7.10.2-amd64.deb # • この場合は 7.10.2 がバージョン番号

77.

• • • • • Elastic Stack を起動して実⾏新しい GPU モニタリングデータ⽤のホームが必要なため、Elastic Cloud に新しいデプロイメントを作成 Elastic Cloud を初めて使う場合は、14⽇間の無料トライアルにサインアップ独⾃の展開をローカルで設定することも可能 ElasticCloud に新しい ElasticObservability デプロイメントを作成

78.

• • • Metricbeat の構成ファイルは /etc/metricbeat/metricbeat.yml 前ページのセットアップで取得したパラメーター cloud.id と cloud.auth を編集構成変更例︓ cloud.id: "staging:dXMtY2VudHJhbDEuZ2NwLmNsb3VkLmVzLmlvJDM4ODZkYmUwMWNjODQ2NDM4YjRlNzg5OWEyZD AwNGM5JDBiMTc0YzYyMTVlYTQwYWQ5M2NmMGY4MjVhNzJmOGRk" cloud.auth: "elastic:J7KYiDku2wP7DFr62zV4zL4y" • • Metricbeat の⼊⼒構成はモジュール式 NVIDIA gpu-monitoring-tools は Prometheus を介して GPU メトリックを公開するので、先に PrometheusMetricbeat モジュールを有効化 sudo metricbeat modules enable prometheus • Metricbeat の test コマンドと modules コマンドを使⽤して、Metricbeat の構成が成功したことを確認 sudo metricbeat test config

79.

sudo metricbeat test output • 左記の例のように構成テストがうまくいかない場合は、 Metricbeat トラブルシューティングガイドを参照 • • Metricbeat の構成の最後に、setup コマンドを実⾏いくつかのデフォルトダッシュボードをロードし、インデックスマッピングを設定セットアップコマンドの実⾏には通常数分かかる • sudo metricbeat setup sudo metricbeat modules list

https://www.elastic.co/guide/en/beats/metricbeat/current/troubleshooting.html

80.

• NVIDIA の dcgm-exporter を起動 dcgm-exporter --address localhost:9090 # Output INFO[0000] Starting dcgm-exporter INFO[0000] DCGM successfully initialized! INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] Pipeline starting INFO[0000] Starting webserver • • • • 注: DCP 警告は無視できる dcgm-exporter メトリックの設定は、ファイル /etc/dcgm-exporter/defaultcounters.csv で定義され、デフォルトでは 38 個の異なるメトリックが定義されている。使⽤可能な値の完全なリストについては、DCGM ライブラリ API リファレンスガイドを確認別のコンソールで、Metric Beat を起動 sudo metricbeat -e • • • Kibana で「metricbeat-*」インデックスパターンを更新 [Stack Management] > [Kibana] > [Index Patterns] に移動して、リストから metricbeat-* インデックスパターンを選択次に「フィールドリストを更新」をクリック

https://docs.nvidia.com/datacenter/dcgm/dcgm-api/group__dcgmFieldIdentifiers.html

81.

• • • • GPU メトリクスが Kibana で利⽤できる新しいフィールド名の前には prometheus.metrics.DCGM_ Kibana の Discover で確認これで、Elastic Observability で GPU メトリクスを分析する準備が整う • Metrics Explorer で GPU と CPU のパフォーマンスを⽐較できる • インベントリビューで GPU 利⽤のホットスポットを⾒つけられる

82.

• • これらはほんの⼀部のモニタリング⽅法 Elastic Observability を使えば、すべての⽬標に取り組むことが可能 • NVIDIA による監視するのに適した GPU の他の例をいくつかご紹介︓ • • • GPU temperature: GPU の温度。ホットスポットのチェック GPU power usage: GPUの電⼒使⽤量。予想以上に電⼒使⽤量が多い⇒HWの問題の可能性 Current clock speeds: 現在のクロックスピード。想定よりも低い⇒パワーキャッピングやHWの問題 • また、GPUの負荷をシミュレーションする必要がある場合は、dcgmproftester10 コマンドを使⽤ dcgmproftester10 --no-dcgm-validation -t 1004 -d 30

https://docs.nvidia.com/datacenter/dcgm/latest/dcgm-user-guide/feature-overview.html

83.

まとめ

84.

まとめ • AI ソリューションの課題 • Microsoft AI + Elastic による解決

85.

Open Source Repo Link Azure ML Notebook Examples Azure Machine Learning 公式サンプルコード https://aka.ms/ml-notebooks BERT Large ⾃然⾔語モデル BERT のサンプルコード http://aka.ms/azure-bert Microsoft Recommenders レコメンデーションサンプルコード http://aka.ms/recommenders LightGBM LightGBM トップページ https://aka.ms/lightgbm Natural Language Recipe's ⾃然⾔語サンプルコード https://aka.ms/nlp-recipes ONNX ONNX トップページ https://aka.ms/onnx ONNX RT ONNX Runtimeトップページ https://aka.ms/onnx-rt Kubeflow & MLOps Kubeflow + Azure ML + DevOps サンプルコード https://aka.ms/kubeflow-and-mlops Azure Open Datasets Azure Open Datasets Webページ https://aka.ms/azure-open-datasets Azure ML Free Trial Azure フリートライアル https://aka.ms/amlfree Azure ML Docs Azure Machine Learning ドキュメント https://aka.ms/azureml-ja-docs

86.

• • • Azure ML Studio: https://ml.azure.com Demo Notebook: https://aka.ms/ignite2019brk3303democode Documentation – – – • Datasets Creating data labeling project Labeling data Contact for ML assisted labeling: [email protected]

87.

• • Microsoft Responsible AI Resource Center https://aka.ms/RAIresources • • Azure Machine Learning https://azure.microsoft.com/enus/services/machine-learning/ https://docs.microsoft.com/enus/azure/machine-learning/conceptresponsible-ml • • • • OpenDP http://opendp.io/ https://twitter.com/opendp_io • • • • • • • • • WhiteNoise https://github.com/opendifferentialprivacy https://docs.microsoft.com/azure/machinelearning/concept-differential-privacy https://docs.microsoft.com/azure/machinelearning/how-to-differential-privacy https://aka.ms/WhiteNoiseWhitePaper SEAL https://github.com/Microsoft/SEAL https://docs.microsoft.com/azure/machinelearning/how-to-homomorphic-encryptionseal https://aka.ms/SEALonAML

88.

Elastic リソース • 公式ドキュメント – https://www.elastic.co/guide/index.html • Elasticsearch.Net & NESTドキュメント – https://www.elastic.co/guide/en/elasticsearch/client/net- api/current/index.html • Elastic 事例 – https://www.elastic.co/jp/customers/

89.

アプリケーション開発オンデマンドウェビナー特集 • Elastic の Search API を Visual Studio Code でコーディングする (1) - (3) • Elastic Cloud で Azure Kubernetes Serviecs の様々な Log/Metrics/APM を可視化する • ASP.NET Core 3.x Web アプリのログを Elastic Cloud で収集・分析してみよう︕ https://www.microsoft.com/ja-jp/events/top/apps-innovation-webinars.aspx

https://www.microsoft.com/ja-jp/events/top/apps-innovation-webinars.aspx

90.

.NET lab 2021.5 https://dotnetlab.connpass.com/event/208867/ セッションタイトル・概要 : TBD

91.

Google Cloud Day Digital 2021 https://cloudonair.withgoogle.com/events/google-cloud-day-digital-21?talk=d2-gl-27 クラウドネイティブへの移行における Elastic APM の概要

92.

Thank you for your attention!

Gpu accelerates aimodeldevelopmentandanalyticsutilizingelasticsearchandazure ai

Shotaro Suzuki

関連スライド

Locofy.ai による AI 駆動 エンタープライズフロンドエンド開発実践-s

GitHub Copilot & Copilot Chat で Java コーディングを最大限効率化する-配布用

新たな AI 駆動開発の潮流(SWE Agent, AutoDev,Devin, GitHub Copilot Workspace等)

Azure OpenAI Service 概要とサンプルアプリ等のご紹介

FPT AI Factory で加速する AI 開発-20250213-公開版

最新の React/TypeScript SPA テンプレートを .NET 8 で試してみよう

各ページのテキスト

Locofy.ai による AI 駆動エンタープライズフロンドエンド開発実践-s