
The YOLO network has two components as do most networks:
The paper’s author explains that they used GoogLeNet (inception) inspired architecture for their feature extractor, that was trained on PASCAL VOC data-set prior to making it part of the object detection network. We can skip this step and use a pre-trained network, that performs well on classification tasks. I’ve chosen ResNet for this purpose.
I then add two dense/fully connected layers to the feature extractor’s output that has random weight initialization and produces an output with the desired dimensions.

