I have deployed Kubeflow via Charmed Kubeflow guide
Also enable gpu for mincrok8s and deployed mlmd via juju.
microk8s enable gpu
juju deploy mlmd
I used Tensorflow with CUDA image - kubeflownotebookswg/jupyter-tensorflow-cuda-full:v1.6.0 - for Notebooks and Local Runs And this one for KubeflowDagRunner - Docker Hub
I also installed TFX inside Notebook container, here are my dependencies:
tensorflow 2.9.2
tensorflow-data-validation 1.10.0
tensorflow-estimator 2.9.0
tensorflow-gpu 2.5.0
tensorflow-hub 0.12.0
tensorflow-io-gcs-filesystem 0.27.0
tensorflow-metadata 1.10.0
tensorflow-model-analysis 0.41.1
tensorflow-serving-api 2.9.2
tensorflow-transform 1.10.1
kfp 1.6.3
kfp-pipeline-spec 0.1.16
kfp-server-api 1.6.0
tfx 1.10.0
tfx-bsl 1.10.1
During local run everything works fine: data ingestion, transformation, training. I can check sqlite metadata.db, but when I used Kubeflow Runner it does not work.
Kubeflow Dag Runner creation
def run():
"""Define a kubeflow pipeline."""
metadata_config = tfx.orchestration.experimental.get_default_kubeflow_metadata_config()
## I have read this issue https://github.com/tensorflow/tfx/issues/1287, tried with and without mysql configs
# metadata_config.mysql_db_service_host.value = 'mysql.kubeflow'
# metadata_config.mysql_db_service_port.value = "3306"
# metadata_config.mysql_db_name.value = "metadb"
# metadata_config.mysql_db_user.value = "root"
# metadata_config.mysql_db_password.value = ""
metadata_config.grpc_config.grpc_service_host.value = 'metadata-grpc-service.kubeflow'
metadata_config.grpc_config.grpc_service_port.value = '8080'
output_filename = "{}.yaml".format(PIPELINE_NAME)
runner_config = tfx.orchestration.experimental.KubeflowDagRunnerConfig(
kubeflow_metadata_config=metadata_config,
tfx_image=PIPELINE_IMAGE,
pipeline_operator_funcs=(
[
onprem.mount_pvc(
DATA_PVC.pvc_name, ## I am attaching PV which was used during
DATA_PVC.volume_name, ## local run, it was created via Kubeflow Web
DATA_PVC.mount_path, ## UI
),
onprem.mount_pvc(
TFX_PVC.pvc_name,
TFX_PVC.volume_name,
TFX_PVC.mount_path,
)
]
)
)
tfx.orchestration.experimental.KubeflowDagRunner(
config=runner_config,
output_dir='./',
output_filename=output_filename ## I manually upload output config file
).run(create_pipeline(PIPELINE_NAME, PIPELINE_ROOT,
CORTEX_DATA_ROOT, TF_DATA_ROOT,
TRANSFORM_FILE, TRAINER_FILE,
SERVING_MODEL_DIR))
But during the run I have faced this exception in Trainer component:
INFO:absl:MetadataStore with gRPC connection initialized
WARNING:absl:ContextQuery.property_predicate is not supported.
WARNING:absl:ContextQuery.property_predicate is not supported.
WARNING:absl:ContextQuery.property_predicate is not supported.
INFO:root:Adding KFP pod name image-has-watermark-t2jjq-3690395791 to execution
INFO:absl:MetadataStore with gRPC connection initialized
INFO:root:Component Trainer is finished.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.7/site-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 508, in <module>
main(sys.argv[1:])
File "/opt/conda/lib/python3.7/site-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 504, in main
_dump_ui_metadata(pipeline_node, execution_info, args.metadata_ui_path)
File "/opt/conda/lib/python3.7/site-packages/tfx/orchestration/kubeflow/container_entrypoint.py", line 347, in _dump_ui_metadata
output_model = execution_info.output_dict[name][0]
KeyError: 'model_run'
I do not know what should I do. Can anyone help?