This is a demonstration of the abilities of OpenAI's CLIP neural network. The tool will download a YouTube video, extract frames at regular intervals and precompute the feature vectors of each frame using CLIP.
You can then use natural language search queries to find a particular frame of the video. The results are really amazing in my opinion...
Just a heads up: in the demo's setup block, it installs pytorch 1.7.0 + cu101 from CLIP (since it requires them) and then immediately uninstalls it to re-install pytorch 1.7.1 by the next command, which takes at least 5 minutes. If we don't really need 1.7.1, we can save some time by removing the manual pytorch installation line.
Not yet, but I had this idea as well. You basically get a score describing how well a phrase is matching each of the images so it won't be difficult to do. I'll look into that!
I think it can. However, you will likely need a bigger model. Currently, OpenAI shares only their small model and I hope they will soon release bigger ones!
If it could take image set as input, then perhaps we can use this to identify our self in a random Internet video e.g. Lengthy tourist video in which you suspect you could have been covered as you were there at that place on that day.
There are people already looking for such solution(I've added the link to that discussion on my profile).
I think this is a valuable application, but I don't think CLIP is well suited for it. The power of CLIP comes from training a model to jointly "understand" text and images. If you are looking at identifying a particular person there are more suitable designs for face recognition.
Wondering whether it would be more efficient if extracting frames where the content has changed (e.g. over a threshold and/or all I-frames)?
Also, could this be used to identify event type in videos? I'd love to run my 25 years of home videos through this an have it annotate: "Christmas, birthday, park, camping...".
No, that wouldn't work too well. It's for YouTubers who stream their desktop screens and I need to extract some information to automatically process it. The desktop streams always look very similar so I don't need advanced AI/neural nets to extract that.
There is one caveat to be aware of - the image is cropped to a square in the center and scaled down to 224x224. So small details will be lost, for example if you want to run it on scanned documents. Photos work great though.
For info, the same tool works well with 2 million images found in the Unsplash dataset [1]. Features only have to be computed once for the dataset, and only the feature vector for the user query has to be computed on the fly. Then matching features can be done in a manner that scales well.
So, the present tool does not scale because the videos are part of the user query, but a company with an easy access to the videos and the computational power to pre-encode the frames as features could create a search engine based on CLIP.
Seems like a key part of ‘organizing all the world’s information’, right? Making YouTube searchable would open up their massive content library to discovery, and enable people to find content without having to rely on The Algorithm guessing what they might want to see...
In general the switch from users going to the internet and asking it to find them certain things, to going to the internet and having things selected for them and offered up as ‘recommendations’ seems like a key shift in how the internet functions to disseminate ideas...
CLIP is quite a new model published by OpenAI in January. Their work is novel and pushed the state-of-the-art in this area by a lot. I'm sure that Google is also working on similar applications.
And I think they already have something similar. Recently, I've seen search results on Google that point you to a specific time of a YouTube video...
Years ago, when I prototyped an orders-of-magnitude physical-properties playground web app, I found the development bottleneck was searching for video clips and images. For example, find clips showing the heartbeat of a {goldfish,mouse,cat,dog,child,adult,horse,elephant,whale}, for a <kg - body mass - metabolic/heart rate> association (metabolic rate scales with body mass). Jiggling an oom kg scale, maybe you're shown a cat or mouse, and maybe a mouse heart patter, and whale's slow swish. Providing a massive (sorry) hidden curriculum. They all exist on youtube, and might be fair use for OER content, but finding them was not plausible. And still isn't, even with commercial use of stock videos.
In the 1950's, the first Powers of Ten zoom book was hand drawn from books over years. Around 1980, a PoT film and book could use photos, but still good people made mistakes. Now creating a PoT zoom book can be homework. A video a school project. An XR a professional project. Technology, media search, acquisition, and handling costs, throttles science education content.
Necessary but not sufficient, of course. The first book might have been imagined and created earlier, but wasn't. 1950's astronomy textbooks needn't have had the color of the Sun wrong, and the same now in 2020. Though, anticipated difficulty of creation does throttle imagination...
If OP search were deployed on youtube, and fair use in its current form was allowed to survive, providing a historical step-change in the abundance and accessibility of reusable content, how might you imagine using that?
I think we will get there soon! CLIP is a new model that OpenAI published in January and I'm sure Google is working on similar technology, which can be used for both video and image search.
Permissions and Restrictions
You may access and use the Service as made available to you, as long as you comply with this Agreement and applicable law. You may view or listen to Content for your personal, non-commercial use. You may also show YouTube videos through the embeddable YouTube player.
The following restrictions apply to your use of the Service. You are not allowed to:
access, reproduce, download, distribute, transmit, broadcast, display, sell, license, alter, modify or otherwise use any part of the Service or any Content except: (a) as expressly authorized by the Service; or (b) with prior written permission from YouTube and, if applicable, the respective rights holders;
access the Service using any automated means (such as robots, botnets or scrapers) except (a) in the case of public search engines, in accordance with YouTube’s robots.txt file; or (b) with YouTube’s prior written permission.
I haven't read the Terms of Use in details, but I guess it depends on what you do with the data. I actually don't store, distribute or sell the videos, so I hope it is OK in the scope of such personal project :)
The problem still exists that you have to provide it the YouTube video to search within, would be nice if there was a tool to search across all of YouTube.
Hmm, this is strange... The Colab notebook should load even if you are not logged in with your Google account (you will need to log in if you want to run it, though).
You can then use natural language search queries to find a particular frame of the video. The results are really amazing in my opinion...
If you want to experiment with it yourself, I prepared a Colab notebook that can easily be run: https://colab.research.google.com/github/haltakov/natural-la...