#pkineto#masterofinformationscience
Trying UNISAL on lecture videos
- Surprisingly, it’s working too well
- The plan below didn’t work out
- Now, what should we do?
- How about evaluating the importance of lecture videos after using this?
- It would be interesting if we could extract a part of the network like UNISAL.
Meeting on 20200319
-
What we want to do: Distinguish between important and unimportant timings in the videos.
-
Method
- Focus on Saliency
- Existing research
- Saliency evaluation in images
- Saliency in videos
-
Decent enough
-
Any issues with lecture videos?
- (Still testing)
-
- What we’re trying (long-term) (tentative)
- First, aim to evaluate the saliency of lecture videos
- It would be great if we could use that saliency information to evaluate importance
- What we’ve tried
- http://predimportance.mit.edu/ (covers both natural and artificial)
- Still in the process of evaluating the saliency of videos
- Weakly supervised learning
- Trying to transfer the character recognition system, for example
- CAM https://qiita.com/bukei_student/items/698383a7118f95c12cce
- Focus on Saliency
-
It seems more practical and interesting to evaluate the high-saliency areas of lecture videos rather than individual frames.
- With Kineto, we can make student annotations lighter in high-saliency areas, for example
- This is actually a good idea
- And then, it would be interesting if we could predict saliency a few seconds later
- With Kineto, we can make student annotations lighter in high-saliency areas, for example
-
Existing research
- Salient Object Detection in the Deep Learning Era: An In-depth Survey
- Is there saliency in slides or posters?
- http://predimportance.mit.edu/
- However, there is no video version of this (haven’t seen any movement)
- Existing video saliency
- https://youtu.be/JNe6A7dszPw
- Focusing on moving objects and temporal changes is not possible with images (obviously)
- But with existing video saliency, it’s difficult to handle artificial things like slides or blackboards
-
What to do
- Submit a scene from a lecture video to predimportance.mit.edu
- Submit lecture videos to existing saliency evaluation methods
-
What does it mean for a place in a lecture video to have high saliency (difference from general saliency)?
- It’s likely to be different from sports, for example, in terms of saliency persistence
-
Q. Can’t we use saliency in lecture images?
- A. It depends on the context, so we need videos
- Or find a way to convey contextual information
- I want to be more specific about “context”
- For now, let’s run existing methods on lecture videos and find any issues
- A. It depends on the context, so we need videos
-
Things to consider
- What about the teacher?
- I want to try something like extracting parts of existing models in lecture videos
- What would be the existing model in lecture videos?
- I want to try something like extracting parts of existing models in lecture videos
- How fine-grained should the annotation be?
- Pixel-level or bounding box level, for example
- Trade-off with computational cost
- Should we deal with audio?
- What about the teacher?
-
In the end, what is importance?
- Let’s start by getting our hands dirty and thinking
- Try the complexity-based approach
- Want to focus more on saliency
-
Idea
- Predict a few seconds ahead in lecture videos
- Based on the teacher’s movements and previous annotations
- Video prediction
- Hmm, but it’s a bit tricky
- Various forms of change- Writing a little bit on the blackboard or having everything already written, like “try it”
-
Different people have different ways of using animations.
20210312
-
I haven’t made much progress in my research on image processing because I’ve been busy with final exams, my thesis, and the development of an untrodden product.
-
The finals are over, and today I finished all the work for the untrodden program.
- From now on, I think I can dedicate more time to my research on image processing.
-
In the presentation on March 25th, I plan to focus on presenting the developed product and also make as much progress as possible in my research in the remaining two weeks.
-
Product
-
Future of the research
- Focus on lecture videos?
- If it’s lecture videos, we can collect them by filming at school every day, right?
- What I want to do: Distinguish between important and unimportant timing in videos
- Distinguish between points where the teacher is providing information and points where there is no significant change
- (I haven’t been able to think about the means or search for previous research yet 💦)
- Voice, intonation
-
Previous research
- https://ieeexplore.ieee.org/document/8269997 (Can’t read)
- [[
SmartPlayer: User-Centric Video Fast-Forwarding]] - [[
Content-Aware Dynamic Timeline for Video Browsing]] - The approach seems useful.
- https://dl.acm.org/doi/10.1145/2970930.2970933
- Obtaining information from gestures.
- [[
A bottom-up summarization algorithm for videos in the wild]]
202101021
- I couldn’t work on image processing.
- I wrote a description for the general public about /kineto/What is Kineto.
- Instead of “lecture videos,” I think of it as a “shared blackboard.”
- Placing it as an extension of Jamboard and Miro.
20210115 Meeting
-
Previous research on Visual Complexity
-
Novelty of the research
- The visual complexity of video frames.
- Adaptive Fast Playback-Based Video Skimming Using a Compressed-Domain Visual Complexity Measure
- It’s doing something similar, but it focuses on spatio-temporal complexity (motion).
- The visual complexity of video frames.
-
Removing the teacher from the video
- It would be good to do it based on line drawings.
- Reason: By binarizing, we can eliminate variations in light, etc.
- To be more precise, both solid square and line square become the same thing, approaching human perception of complexity.
- Reason: By binarizing, we can eliminate variations in light, etc.
- It would be good to do it based on line drawings.
-
https://dronebiz.net/tech/opencv/labeling Labeling process
- Shape recognition?
- Snake
-
Slide change detection
- When there are lines on the whiteboard that interfere, simply taking the average pixel value doesn’t capture the changes cleanly.
- It seems better to observe changes in complexity, etc. to eliminate interfering frames.
- But even that alone may not be enough.
- It’s difficult even for simple tasks for humans.
-
I want to confirm my understanding of Visual Complexity Analysis Using Deep Intermediate-Layer Features.
-
(I haven’t made much progress, so) What to do next
20210105 Meeting
- What I did
- For now, I want to evaluate the “size” of the changes on the slides/blackboard.
-
- Detect the timing of slide changes, take the difference before and after the change, and then draw contours.
- Checking for changes every 30 frames.
- All the parts become scattered.
- Tried adding blur (Gaussian Blur)
- Detect the timing of slide changes, take the difference before and after the change, and then draw contours.
- Enclose the characters as text using OCR.
- - In terms of ↑, numbers and such may not be that important.
- The parts surrounded by yellow indicate the detected changes.
- Things I understand:
- Different responses are required for slide changes and animation changes.
- In the case of slides, if the animation is slow, the difference between each frame may be below the threshold.
- It is inconvenient if sudden images are displayed during the slide.
- ⭐️It is difficult to measure the magnitude of the changes.
- Simply taking the mean of the differences is problematic because of the variation.
- It is heavily influenced by the number of pixels occupied by the object.
- For example, even if a white figure appears on a white background, it cannot be detected because the number of changed pixels is small.
- It is also difficult to determine the changes.
- Animation changes are not a problem, but adding slides is difficult.
- Detecting slide additions requires conditional branching.
- Consult with a teacher.
- It is said that it is better not to change the font size on the blackboard (apparently) (common knowledge among teachers?).
- It is said that using colors is more suitable for expressing importance.
- Knowing this may make OCR easier as well.
- It is said that it is better not to change the font size on the blackboard (apparently) (common knowledge among teachers?).
- Direction:
- Classification of changes in “lecture videos” (live lectures, slide lectures, etc.).
- Persistent differences: writing on the blackboard, slide animations, etc., things that remain afterwards.
- Temporary differences: teacher’s movements (teacher on the blackboard in live lectures, wipe window in slide lectures), etc.
- Reset differences: erasing the blackboard, moving slides, etc.
- (Other noises)
- How to classify:
- Comparing frame differences can determine whether they are persistent.
- People and moving objects can use Lucas-Kanade, etc.
- Want to classify differences and evaluate the complexity of persistent differences.
- Temporary differences become non-semantic due to human bodies, so their complexity may not be meaningful.
- Image of complexity (comparison when occupying the same area):
- Simple shapes < maps
- Monochrome maps < maps with multiple colors
- Shapes < text
- Can use evaluation axes specific to slide/blackboard images?
- Want to consider lectures on blackboards (fragmented temporary differences), slide lectures with many animations, and slide lectures with few animations, all within the same framework (linear structure).
- How to actually implement?
- Searching for “complexity” of images doesn’t yield relevant results.
- If there are better terms, please let me know. 🙏
- “Wide” and “send”
- Histogram variance, etc.?
- Outliers
- Additionally, could try line drawing and counting the number of lines.
- Counting the number of outline lines that intersect with straight lines, etc.
- Or rotate the straight lines 360 times, etc.
- Previous research:
- Novelty:
- Doing this with video differences.
- Adaptive Fast Playback-Based Video Skimming Using a Compressed-Domain Visual Complexity Measure is doing something similar, but it focuses on spatio-temporal complexity (motion, etc.).
- Mine focuses more on still images than motion.
- Classification of changes in “lecture videos” (live lectures, slide lectures, etc.).
- Obtain prior knowledge by consulting with teachers.
- Various things: what can be skipped (contrasting with speaking speed), etc.
- Basis for changing speed.
- As research:
- Tip: If you increase the scale, you can avoid complete failure.
- Aim for a form like this:
- Lecta: http://mprg.jp/data/MPRG/F_group/F028_yokoi2005.pdf
- But it’s from 2005.
- Lecta: http://mprg.jp/data/MPRG/F_group/F028_yokoi2005.pdf
20201202 Meeting
-
There are several studies that select important frames.
- For example, selecting from images on whiteboards, etc.
-
Is there no evaluation of the importance of each frame?
- We want to change the speed, so we need this.
-
Also, there are few approaches that focus on “removing unnecessary parts” rather than “selecting necessary parts”.- Is the concept of an elastic timeline unusual? (It may seem obvious after thinking about it for a few months)
- I could only find research on elastic timelines for cows and content-awareness.
-
Hypothesis: Does the size of the written content determine its importance?
-
Discussion:
- I want to manipulate the speed by combining it with other elements.
- How should we combine them?
- There might be a better method than simple addition or multiplication.
-
Also, for the summary when viewed later:
- What should we do if we can’t catch up even if we try our best to summarize, whether it’s due to being late or not participating?
20201118 Meeting:
-
Potential target videos:
- Live-action classroom videos
- Slide-based lectures
- Instructional videos using various materials
- (Should we narrow down the scope?)
- Yes, we should narrow it down.
- Real-life and lectures have different characteristics.
- With slide-based lectures, text and charts are important.
- Audio seems to be crucial.
- Video: 3-dimensional
- Audio: 1-dimensional (lightweight)
- If we can vectorize them, we can use the same framework.
- Real-time situations make deep learning challenging.
- Especially with video processing.
-
Goal: Bend the lines of this graph.
- Change the speed.
- Find parts that can be changed in speed without affecting understanding of the content.
- Evaluation criteria could be how humans perceive it.
- Skip unnecessary parts.
- Apply video summarization techniques (weaken them?).
- Utilize information from previous studies.
- There might be other methods that we haven’t thought of yet.
- Change the speed.