Reader Comments

Does Caffeonspark Support Fault Recovery Stratage

by Leatha Covert (2020-09-09)


Does CaffeOnspark support fault recovery stratage?

My command is:

export SPARK_WORKER_INSTANCES=2

export DEVICES=1

spark-submit --master yarn --deploy-mode cluster --num-executors $SPARK_WORKER_INSTANCES --files ./data/cifar10_quick_solver.prototxt,./data/cifar10_quick_train_test.prototxt,./data/mean.binaryproto --conf spark.driver.extraLibraryPath="$LD_LIBRARY_PATH" --conf spark.executorEnv.LD_LIBRARY_PATH="$LD_LIBRARY_PATH" --class com.yahoo.ml.caffe.CaffeOnSpark $CAFFE_ON_SPARK/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar -train -features accuracy,loss -label label -conf cifar10_quick_solver.prototxt -devices $DEVICES -connection ethernet -model result/cifar10.model.h5 -output result/cifar10_features_result


Below is the logs:

layer

name: "ip2"

type: "InnerProduct"

bottom: "ip1"

top: "ip2"

param

lr_mult: 1


param

lr_mult: 2


inner_product_param

num_output: 10

weight_filler

type: "gaussian"

std: 0.1


bias_filler

type: "constant"




layer

name: "accuracy"

type: "Accuracy"

bottom: "ip2"

bottom: "label"

top: "accuracy"

include

phase: TEST



layer

name: "loss"

type: "SoftmaxWithLoss"

bottom: "ip2"

bottom: "label"

top: "loss"


I :52:39. layer_factory.hpp:77] Creating layer data

I :52:39. net.cpp:106] Creating Layer data

I :52:39. net.cpp:411] data -> data

I :52:39. net.cpp:411] data -> label

I :52:39. net.cpp:150] Setting up data

I :52:39. net.cpp:157] Top shape: (307200)

I :52:39. net.cpp:157] Top shape: 100 (100)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer label_data_1_split

I :52:39. net.cpp:106] Creating Layer label_data_1_split

I :52:39. net.cpp:454] label_data_1_split <- label

I :52:39. net.cpp:411] label_data_1_split -> label_data_1_split_0

I :52:39. net.cpp:411] label_data_1_split -> label_data_1_split_1

I :52:39. net.cpp:150] Setting up label_data_1_split

I :52:39. net.cpp:157] Top shape: 100 (100)

I :52:39. net.cpp:157] Top shape: 100 (100)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer conv1

I :52:39. net.cpp:106] Creating Layer conv1

I :52:39. net.cpp:454] conv1 <- data

I :52:39. net.cpp:411] conv1 -> conv1

I :52:39. net.cpp:150] Setting up conv1

I :52:39. net.cpp:157] Top shape: ( )

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer pool1

I :52:39. net.cpp:106] Creating Layer pool1

I :52:39. net.cpp:454] pool1 <- conv1

I :52:39. net.cpp:411] pool1 -> pool1

I :52:39. net.cpp:150] Setting up pool1

I :52:39. net.cpp:157] Top shape: (819200)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer relu1

I :52:39. net.cpp:106] Creating Layer relu1

I :52:39. net.cpp:454] relu1 <- pool1

I :52:39. net.cpp:397] relu1 -> pool1 (in-place)

I :52:39. net.cpp:150] Setting up relu1

I :52:39. net.cpp:157] Top shape: (819200)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer conv2

I :52:39. net.cpp:106] Creating Layer conv2

I :52:39. net.cpp:454] conv2 <- pool1

I :52:39. net.cpp:411] conv2 -> conv2

I :52:39. net.cpp:150] Setting up conv2

I :52:39. net.cpp:157] Top shape: (819200)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer relu2

I :52:39. net.cpp:106] Creating Layer relu2

I :52:39. net.cpp:454] relu2 <- conv2

I :52:39. net.cpp:397] relu2 -> conv2 (in-place)

I :52:39. net.cpp:150] Setting up relu2

I :52:39. net.cpp:157] Top shape: (819200)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer pool2

I :52:39. net.cpp:106] Creating Layer pool2

I :52:39. net.cpp:454] pool2 <- conv2

I :52:39. net.cpp:411] pool2 -> pool2

I :52:39. net.cpp:150] Setting up pool2

I :52:39. net.cpp:157] Top shape: (204800)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer conv3

I :52:39. net.cpp:106] Creating Layer conv3

I :52:39. net.cpp:454] conv3 <- pool2

I :52:39. net.cpp:411] conv3 -> conv3

I :52:39. net.cpp:150] Setting up conv3

I :52:39. net.cpp:157] Top shape: (409600)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer relu3

I :52:39. net.cpp:106] Creating Layer relu3

I :52:39. net.cpp:454] relu3 <- conv3

I :52:39. net.cpp:397] relu3 -> conv3 (in-place)

I :52:39. net.cpp:150] Setting up relu3

I :52:39. net.cpp:157] Top shape: (409600)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer pool3

I :52:39. net.cpp:106] Creating Layer pool3

I :52:39. net.cpp:454] pool3 <- conv3

I :52:39. net.cpp:411] pool3 -> pool3

I :52:39. net.cpp:150] Setting up pool3

I :52:39. net.cpp:157] Top shape: (102400)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer ip1

I :52:39. net.cpp:106] Creating Layer ip1

I :52:39. net.cpp:454] ip1 <- pool3

I :52:39. net.cpp:411] ip1 -> ip1

I :52:39. net.cpp:150] Setting up ip1

I :52:39. net.cpp:157] Top shape: (6400)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer ip2

I :52:39. net.cpp:106] Creating Layer ip2

I :52:39. net.cpp:454] ip2 <- ip1

I :52:39. net.cpp:411] ip2 -> ip2

I :52:39. net.cpp:150] Setting up ip2

I :52:39. net.cpp:157] Top shape: (1000)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer ip2_ip2_0_split

I :52:39. net.cpp:106] Creating Layer ip2_ip2_0_split

I :52:39. net.cpp:454] ip2_ip2_0_split <- ip2

I :52:39. net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_0

I :52:39. net.cpp:411] ip2_ip2_0_split -> ip2_ip2_0_split_1

I :52:39. net.cpp:150] Setting up ip2_ip2_0_split

I :52:39. net.cpp:157] Top shape: (1000)

I :52:39. net.cpp:157] Top shape: (1000)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer accuracy

I :52:39. net.cpp:106] Creating Layer accuracy

I :52:39. net.cpp:454] accuracy <- ip2_ip2_0_split_0

I :52:39. net.cpp:454] accuracy <- label_data_1_split_0

I :52:39. net.cpp:411] accuracy -> accuracy

I :52:39. net.cpp:150] Setting up accuracy

I :52:39. net.cpp:157] Top shape: (1)

I :52:39. net.cpp:165] Memory required for data:

I :52:39. layer_factory.hpp:77] Creating layer loss

I :52:39. net.cpp:106] Creating Layer loss

I :52:39. net.cpp:454] loss <- ip2_ip2_0_split_1

I :52:39. net.cpp:454] loss <- label_data_1_split_1

I :52:39. net.cpp:411] loss -> loss

I :52:39. layer_factory.hpp:77] Creating layer loss

I :52:39. net.cpp:150] Setting up loss

I :52:39. net.cpp:157] Top shape: (1)

I :52:39. net.cpp:160] with loss weight 1

I :52:39. net.cpp:165] Memory required for data:

I :52:39. net.cpp:226] loss needs backward computation.

I :52:39. net.cpp:228] accuracy does not need backward computation.

I :52:39. net.cpp:226] ip2_ip2_0_split needs backward computation.

I :52:39. net.cpp:226] ip2 needs backward computation.

I :52:39. net.cpp:226] ip1 needs backward computation.

I :52:39. net.cpp:226] pool3 needs backward computation.

I :52:39. net.cpp:226] relu3 needs backward computation.

I :52:39. net.cpp:226] conv3 needs backward computation.

I :52:39. net.cpp:226] pool2 needs backward computation.

I :52:39. net.cpp:226] relu2 needs backward computation.

I :52:39. net.cpp:226] conv2 needs backward computation.

I :52:39. net.cpp:226] relu1 needs backward computation.

I :52:39. net.cpp:226] pool1 needs backward computation.

I :52:39. net.cpp:226] conv1 needs backward computation.

I :52:39. net.cpp:228] label_data_1_split does not need backward computation.

I :52:39. net.cpp:228] data does not need backward computation.

I :52:39. net.cpp:270] This network produces output accuracy

I :52:39. net.cpp:270] This network produces output loss

I :52:39. net.cpp:283] Network initialization done.

I :52:39. solver.cpp:60] Solver scaffolding done.

I :52:39. socket.cpp:219] Waiting for valid port [0]

I :52:39. socket.cpp:158] Assigned socket server port [55211]

I :52:39. socket.cpp:171] Socket Server ready []

I :52:39. socket.cpp:219] Waiting for valid port [55211]

I :52:39. socket.cpp:227] Valid port found [55211]

I :52:39. CaffeNet.cpp:186] Socket adapter: yuntu2:

I :52:39. CaffeNet.cpp:325] 0-th Socket addr:

I :52:39. CaffeNet.cpp:325] 1-th Socket addr: yuntu2:

I :52:39. JniCaffeNet.cpp:110] 0-th local addr:

I :52:39. JniCaffeNet.cpp:110] 1-th local addr: yuntu2:

16/03/18 10:52:39 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 919 bytes result sent to driver

16/03/18 10:52:40 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 3

16/03/18 10:52:40 INFO executor.Executor: Running task 1.0 in stage 1.0 (TID 3)

16/03/18 10:52:40 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 2

16/03/18 10:52:40 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1582.0 B, free 6.7 KB)

16/03/18 10:52:40 INFO broadcast.TorrentBroadcast: Reading broadcast variable 2 took 17 ms

16/03/18 10:52:40 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 9.3 KB)

16/03/18 10:52:40 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 1

16/03/18 10:52:40 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 87.0 B, free 9.4 KB)

16/03/18 10:52:40 INFO broadcast.TorrentBroadcast: Reading broadcast variable 1 took 15 ms

16/03/18 10:52:40 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 344.0 B, free 9.7 KB)

I :52:40. common.cpp:61] 0-th string is NULL

I :52:40. socket.cpp:250] Trying to connect with ...[yuntu1:54029]

I :52:40. socket.cpp:309] Connected to server [yuntu1:54029] with client_fd [282]

I :52:40. socket.cpp:184] Accepted the connection from client [yuntu1]

I :52:50. parallel.cpp:392] GPUs pairs

I :52:50. MemoryInputAdapter.cpp:15] MemoryInputAdapter is used

I :52:50. data_transformer.cpp:25] Loading mean file from: /home/atlas/work/caffe_spark/CaffeOnSpark-master/data/mean.binaryproto

16/03/18 10:52:50 INFO executor.Executor: Finished task 1.0 in stage 1.0 (TID 3). 899 bytes result sent to driver

16/03/18 10:52:51 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 4

16/03/18 10:52:51 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 4)

16/03/18 10:52:51 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3

16/03/18 10:52:51 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1412.0 B, free 11.1 KB)

16/03/18 10:52:51 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 20 ms

16/03/18 10:52:51 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.2 KB, free 13.3 KB)

16/03/18 10:52:51 INFO spark.CacheManager: Partition rdd_5_0 not found, computing it

16/03/18 10:52:51 INFO spark.CacheManager: Partition rdd_0_0 not found, computing it

16/03/18 10:52:51 INFO caffe.LmdbRDD: Processing partition 0

16/03/18 10:52:53 INFO caffe.LmdbRDD: Completed partition 0

16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_0_0 locally

16/03/18 10:52:53 INFO storage.MemoryStore: Block rdd_5_0 stored as values in memory (estimated size 40.0 B, free 13.3 KB)

16/03/18 10:52:53 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 4). 1549 bytes result sent to driver

16/03/18 10:52:53 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7

16/03/18 10:52:53 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 7)

16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4

16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1411.0 B, free 14.7 KB)

16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 18 ms

16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.2 KB, free 16.8 KB)

16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_5_0 locally

16/03/18 10:52:53 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 7). 2003 bytes result sent to driver

16/03/18 10:52:53 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8

16/03/18 10:52:53 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 8)

16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 5

16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 1380.0 B, free 18.2 KB)

16/03/18 10:52:53 INFO broadcast.TorrentBroadcast: Reading broadcast variable 5 took 17 ms

16/03/18 10:52:53 INFO storage.MemoryStore: Block broadcast_5 stored as values in memory (estimated size 2.1 KB, free 20.2 KB)

16/03/18 10:52:53 INFO storage.BlockManager: Found block rdd_0_0 locally

I :52:54. solver.cpp:237] Iteration 0, loss = 2.

I :52:54. solver.cpp:253] Train net output #0: loss = 2.30203 (* 1 = 2.30203 loss)

I :52:54. sgd_solver.cpp:106] Iteration 0, lr = 0.001

E :53:01. socket.cpp:61] ERROR: Read partial messageheader [4 of 12]

16/03/18 11:22:15 INFO storage.BlockManager: Removing RDD 5

16/03/21 08:26:08 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM

16/03/21 08:26:08 INFO storage.DiskBlockManager: Shutdown hook called

16/03/21 08:26:08 INFO util.ShutdownHookManager: Shutdown hook called



If you cherished this write-up and you would like to obtain extra details concerning move contacts from yahoo to outlook kindly pay a visit to our own web site.