# End-to-end Learning of Action Detection from Frame Glimpses in Videos

This paper proposes a model that takes a long video as input and outputs the temporal bounds of detected action instances. The key intuition is that the process of detecting an action is one of continuous, iterative observation and refinement.

### Method

The goal is to take a long sequence of video and output any instance of a given action. The authors formulate this model as an RL agent interacts with a video over time.&#x20;

![](/files/-MEzM36iPu-vEAAWMyzm)

The model consists of two main components: an observation network and a recurrent network.&#x20;

* **Observation network**: It encodes the frame into a feature vector $$O\_n$$ and provides this as input to the recurrent network. $$O\_n$$ encodes timestamp of video and what was seen.&#x20;
* **Recurrent network**: As the agent reasons on a video, three outputs are produced at each timestep: candidate detection $$d\_n$$, binary indicator $$p\_n$$ signaling whether to emit $$d\_n$$ as a prediction, and temporal location $$l\_{n+1}$$ indicating the frame to observe next (NOTE: the agent may skip both forwards and backward around a video). Candidate detection is a tuple of (start time, end time, confience level).&#x20;

The candidate detection $$d\_n$$is trained with backpropagation and $$p\_n$$and $$l\_{n+1}$$ are trained with REINFORCE.&#x20;

### Evaluation

The model is evaluated on THUSMOS'14 and ActivityNet. The result shows that their approach outperforms state-of-the-art results while observing in total only 2% or less of the video frames.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://xzhu0027.gitbook.io/blog/video/index/end-to-end-learning-of-action-detection-from-frame-glimpses-in-videos.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
