Action localization using super-voxel segmentation

We partition the video volume into \(C^N\) non-overlapping regions using GBH segmentation. The segmentation is based on appearance and motion similarity between the local regions. Each segment \(c_i \in C^N\) is comprised of arbitrary shape & sized cloud of points \(x_i=\{x^0_i, x^1_i, ...., x^P_i\}\) in video volume space \(\mathbb{R}^3\).

The practical challenge is to represent segment \(c_i\) efficiently without comprimising on the memory and accuracy. Because it is difficult to fit regular structure such as 3D bounding box or ellipsiod. So we came up with solution to divide the video into regular \(m \times m \times m\) sized cells and construct the representation based on such structure. It does reduce the memory load by \(m^3\) times. Also such cell can act like a building block to construct arbitrary shape and sized 3D regions.

A region \(c_i\) will constitute a vertex \(v_i\) in the video graph \(G(V,E)\). Then cardinality of \(|V|\) is equal to the size of segmented regions, \(|C|\). The BoVW histogram \(h_i\) of local features \(f_i \in c_i\) represent an unary potential for vertex \(v_i\).

Each node \(v_i\) will be associated with two components of histograms: *foreground histogram* \(h_i^{fg}\) and *background histogram* \(h_i^{bg}\):

\(h_i^{fg} = \sum_{j \in c_i} bow(f_j)\) - a frequency of quantized local features \(f_j\) extracted inside the region \(c_i\).

\(h_i^{bg} = \sum_{j \notin c_i} bow(f_j)\) - a frequency of quantized local features \(f_j\) extracted outside the region \(c_i\).

Hence, the histogram representation of node \(v_i\) can defined as:

\[h_i = \|h_i^{fg}\| + \alpha_{bg} \|h_i^{bg}\|\]

\(mAP\) performance with different types of kernel functions vs background histogram weight values, \(\alpha_{bg}\)

Kernel Type |
\(\alpha_{bg} = 0\) | \(\alpha_{bg} = 0.5\) | \(\alpha_{bg} = 0.75\) |

\(Linear\) | 31.91 % | 56.69 % | 65.29 % |

\(Intersection\) | 35.65 % | 58.98 % | 60.89 % |

\(Chi-Square\) | 39.03 % | 63.08 % | 65.96 % |

\(Jenson-Shannon\) | 39.50 % | 63.53 % | 66.38 % |

The runtime evaluation (in mins) with different types of kernel functions vs background histogram weight values, \(\alpha_{bg}\)

Kernel Type |
\(\alpha_{bg} = 0\) | \(\alpha_{bg} = 0.5\) | \(\alpha_{bg} = 0.75\) |

\(Linear\) | 2.5 | 22.3 | 15.1 |

\(Intersection\) | 3.3 | 20.3 | 16.8 |

\(Chi-Square\) | 2.4 | 14.8 | 11.2 |

\(Jenson-Shannon\) | 3.0 | 78.7 | 58.8 |

Edge \(E\) will govern the relationship between the segmented regions \(C = \{c_0, c_1,...,c_N\}\), i.e vertex \(V\) of video graph \(G(V,E)\). In essence, \(e_{ij} \in E\) should reflect the likelihood of vertex \(v_i\) and \(v_j\) belong to the same action category, i.e \(\mathrm{P_{ij}}(l_i = l_j)\) where \(l_i, l_j \in L\).

A video sequence is partitioned into a set of non-overlapping supervoxel regions \(S = \{s_0,s_1,..,s_N\}\). The supervoxel is mapped into graph: \(f_{map}: S \mapsto G(V,E)\).

Each node \(v_i \in V\) has following attributes:

\(size_i\) : the total number of cells making up the region \(s_i\), \(\mathbb{R}\).

\(min_i\) : the minimum location of 3D bounding box point in cell-coordinate \(\mathbb{R}^3\).

\(max_i\) : the maximum location of 3D bounding box point in cell-coordinate \(\mathbb{R}^3\).

\(mean_i\): the mean cell location, \(\mathbb{R}^3\).

\(hist_i\): the bow histogram vector over the local descriptor \(f_i \in s_i\), \(\mathbb{R}^{4k} \times 5\) (Used 5 different descriptor with dictionary size of \(4k\) ).

\(sparsity_i\): the number of non-zeros bins in the bow histogram \(hist_i\).

\(hist^c_i\): the concatenated color histogram vector of s_i region over 3-channels (RGB), \(\mathbb{R}^{10} \times 3\)

\(hist^g_i\): the orientation histogram vector of s_i region with 50 directional bins, \(\mathbb{R}^{50}3\)

\(li\): the action label of \(s_i\) region, \(l_i \in L\).

Each edge \(e_{ij} \in E\) between node \(u_i\) and \(u_j\) has following attribute:

\(intsize_{ij}\): the number of mutually adjacent cells between supervoxel \(s_i\) and \(s_j\), \(\mathbb{R}^{N \times N}\) ( \(27\)-neigborhood is used to compute this attribute).

The SVM classifier is trained on the node level where training instance is node \(u_i\)’s histogram vector \(hist_i \in \mathbb{R}^{20k}\) and its corresponding label \(action_i\). The \(chi\)-\(square\) kernel is used and adopted one-vs-rest training strategy. The trained classifier will return \(Pr(l|hist_j)\) the label likelihood of given test node \(u_j\).

ramialbatalover 1 year ago · Publictest comment for Iveel :)