Show HN: Search inside YouTube videos using natural language queries

vladoh · on Feb 12, 2021

This is a demonstration of the abilities of OpenAI's CLIP neural network. The tool will download a YouTube video, extract frames at regular intervals and precompute the feature vectors of each frame using CLIP.

You can then use natural language search queries to find a particular frame of the video. The results are really amazing in my opinion...

If you want to experiment with it yourself, I prepared a Colab notebook that can easily be run: https://colab.research.google.com/github/haltakov/natural-la...

fireattack · on Feb 12, 2021

Just a heads up: in the demo's setup block, it installs pytorch 1.7.0 + cu101 from CLIP (since it requires them) and then immediately uninstalls it to re-install pytorch 1.7.1 by the next command, which takes at least 5 minutes. If we don't really need 1.7.1, we can save some time by removing the manual pytorch installation line.

vladoh · on Feb 12, 2021

Yes, I know that this is a bit slow. The problem is you really need 1.7.1, because 1.7.0 leads to some strange issues and broken results:

https://github.com/openai/CLIP/issues/13#issuecomment-771143...

fireattack · on Feb 12, 2021

Ah, got it. I just noticed 1.7.1 is already in CLIP's requirements.txt, weird that colab would still install 1.7.0 to begin with.

Edit: I just realized 1.7.0 just came with colab, not installed by CLIP.

vladoh · on Feb 12, 2021

Yes, this is the actual problem...

pininja · on Feb 12, 2021

This is very cool! Does this produce a occurrence index by any chance? It would be neat to explore a word map of a video.

vladoh · on Feb 12, 2021

I pushed a small update and you will now see a heatmap displaying the score of the search query for each frame.

pininja · on Feb 17, 2021

Looks great! Adds a whole dimension to the video. Thanks.

vladoh · on Feb 12, 2021

Not yet, but I had this idea as well. You basically get a score describing how well a phrase is matching each of the images so it won't be difficult to do. I'll look into that!

Crazyontap · on Feb 13, 2021

Can it work for more advance keywords like say, "traffic violation" where it spots a car jumping red light or pedestrian not using a crosswalk, etc?

It could be very useful to help with law enforcement.

vladoh · on Feb 13, 2021

I think it can. However, you will likely need a bigger model. Currently, OpenAI shares only their small model and I hope they will soon release bigger ones!

Abishek_Muthian · on Feb 13, 2021

Excellent work.

If it could take image set as input, then perhaps we can use this to identify our self in a random Internet video e.g. Lengthy tourist video in which you suspect you could have been covered as you were there at that place on that day.

There are people already looking for such solution(I've added the link to that discussion on my profile).

vladoh · on Feb 13, 2021

I think this is a valuable application, but I don't think CLIP is well suited for it. The power of CLIP comes from training a model to jointly "understand" text and images. If you are looking at identifying a particular person there are more suitable designs for face recognition.

canada_dry · on Feb 13, 2021

Great demo.

Wondering whether it would be more efficient if extracting frames where the content has changed (e.g. over a threshold and/or all I-frames)?

Also, could this be used to identify event type in videos? I'd love to run my 25 years of home videos through this an have it annotate: "Christmas, birthday, park, camping...".

vladoh · on Feb 13, 2021

Yes, this is definitely possible. You can maybe try computing some kind of image distance between frames or some keyframe extraction.

Once you compute the features, the search is very efficient! I tried it for searching in the 2M photos dataset from Unsplash and it takes like 2-3 seconds: https://github.com/haltakov/natural-language-image-search

I plan to run my personal photos through it :)

mockingbirdy · on Feb 13, 2021

Awesome! I'm currently working on the exact same thing (but with OCR added). Thank you for releasing this.

canada_dry · on Feb 13, 2021

> the exact same thing (but with OCR added)

Hmmm... what does "with OCR added" mean? If there is text in the video (e.g. street sign) that it can also be searched??

mockingbirdy · on Feb 13, 2021

No, that wouldn't work too well. It's for YouTubers who stream their desktop screens and I need to extract some information to automatically process it. The desktop streams always look very similar so I don't need advanced AI/neural nets to extract that.

ramraj07 · on Feb 13, 2021

This is amazing. I'm going to get this running on my Dropbox. Curious what it gets out of scanned documents as well.

vladoh · on Feb 13, 2021

There is one caveat to be aware of - the image is cropped to a square in the center and scaled down to 224x224. So small details will be lost, for example if you want to run it on scanned documents. Photos work great though.

I tried it on the 2M photos from the Unsplash dataset: https://github.com/haltakov/natural-language-image-search

mandeepj · on Feb 12, 2021

> The tool will download a YouTube video, extract frames at regular intervals

That should be able to scale well :-)

woko · on Feb 13, 2021

For info, the same tool works well with 2 million images found in the Unsplash dataset [1]. Features only have to be computed once for the dataset, and only the feature vector for the user query has to be computed on the fly. Then matching features can be done in a manner that scales well.

So, the present tool does not scale because the videos are part of the user query, but a company with an easy access to the videos and the computational power to pre-encode the frames as features could create a search engine based on CLIP.

[1] https://github.com/haltakov/natural-language-image-search

vladoh · on Feb 13, 2021

Thanks for sharing! :)

Yes, the feature computation on the images has to be sone only once and the representation is very efficient - 512 float16 values per image.

woko · on Feb 13, 2021

Yes, I know. :D Your previous project with Unsplash made me try a similar approach [1] for banners of video games on Steam.

[1] https://github.com/woctezuma/steam-image-search

vladoh · on Feb 13, 2021

Cool application! I wasn't aware that there are so many images available on Steam...

woko · on Feb 13, 2021

Yes, Steam has grown a lot. Last I checked, there were ~50k apps.

Edit: 50,630 apps according to https://www.gamedatacrunch.com/

As I focused on vertical banners, the list was smaller (~30k apps). This is equivalent to the *lite* version of Unsplash's dataset.

akinhwan · on Feb 13, 2021

Wow what sort of business ideas do you think could come of this?

jameshart · on Feb 13, 2021

I’m curious: why hasn’t google done this already?

Seems like a key part of ‘organizing all the world’s information’, right? Making YouTube searchable would open up their massive content library to discovery, and enable people to find content without having to rely on The Algorithm guessing what they might want to see...

In general the switch from users going to the internet and asking it to find them certain things, to going to the internet and having things selected for them and offered up as ‘recommendations’ seems like a key shift in how the internet functions to disseminate ideas...

vladoh · on Feb 13, 2021

CLIP is quite a new model published by OpenAI in January. Their work is novel and pushed the state-of-the-art in this area by a lot. I'm sure that Google is also working on similar applications.

And I think they already have something similar. Recently, I've seen search results on Google that point you to a specific time of a YouTube video...

mncharity · on Feb 13, 2021

Years ago, when I prototyped an orders-of-magnitude physical-properties playground web app, I found the development bottleneck was searching for video clips and images. For example, find clips showing the heartbeat of a {goldfish,mouse,cat,dog,child,adult,horse,elephant,whale}, for a <kg - body mass - metabolic/heart rate> association (metabolic rate scales with body mass). Jiggling an oom kg scale, maybe you're shown a cat or mouse, and maybe a mouse heart patter, and whale's slow swish. Providing a massive (sorry) hidden curriculum. They all exist on youtube, and might be fair use for OER content, but finding them was not plausible. And still isn't, even with commercial use of stock videos.

In the 1950's, the first Powers of Ten zoom book was hand drawn from books over years. Around 1980, a PoT film and book could use photos, but still good people made mistakes. Now creating a PoT zoom book can be homework. A video a school project. An XR a professional project. Technology, media search, acquisition, and handling costs, throttles science education content.

Necessary but not sufficient, of course. The first book might have been imagined and created earlier, but wasn't. 1950's astronomy textbooks needn't have had the color of the Sun wrong, and the same now in 2020. Though, anticipated difficulty of creation does throttle imagination...

If OP search were deployed on youtube, and fair use in its current form was allowed to survive, providing a historical step-change in the abundance and accessibility of reusable content, how might you imagine using that?

vladoh · on Feb 13, 2021

I think we will get there soon! CLIP is a new model that OpenAI published in January and I'm sure Google is working on similar technology, which can be used for both video and image search.

amelius · on Feb 13, 2021

I feel like I've been training most of their road/traffic data through Google's much hated captcha.

vladoh · on Feb 13, 2021

You feel like this, because you are right :) Just search online for "google captcha self driving".

gbennett71 · on Feb 13, 2021

Downloading YouTube videos is a Terms of Use violation, no?

hansel_der · on Feb 13, 2021

lol no.

belive it or not, but your browser/tv has to download the video in order to show it.

gbennett71 · on Feb 14, 2021

https://www.youtube.com/static?template=terms&gl=AU

Permissions and Restrictions You may access and use the Service as made available to you, as long as you comply with this Agreement and applicable law. You may view or listen to Content for your personal, non-commercial use. You may also show YouTube videos through the embeddable YouTube player.

The following restrictions apply to your use of the Service. You are not allowed to:

access, reproduce, download, distribute, transmit, broadcast, display, sell, license, alter, modify or otherwise use any part of the Service or any Content except: (a) as expressly authorized by the Service; or (b) with prior written permission from YouTube and, if applicable, the respective rights holders;

access the Service using any automated means (such as robots, botnets or scrapers) except (a) in the case of public search engines, in accordance with YouTube’s robots.txt file; or (b) with YouTube’s prior written permission.

>>> So, yeah: not allowed

vladoh · on Feb 13, 2021

I haven't read the Terms of Use in details, but I guess it depends on what you do with the data. I actually don't store, distribute or sell the videos, so I hope it is OK in the scope of such personal project :)

hddu · on Feb 16, 2021

Why would anyone care if they violated YouTube's terms of use?

suyash · on Feb 13, 2021

The problem still exists that you have to provide it the YouTube video to search within, would be nice if there was a tool to search across all of YouTube.

woko · on Feb 13, 2021

Which would require an easy access to all of the videos, which only Googe/Youtube itself has.

Many nice things could be done, but the platform (or the data owner) has all the power in its hands.

vladoh · on Feb 13, 2021

Yeah, indexing all YouTube videos is surely possible, but out of the scope of a personal project :)

purplecats · on Feb 12, 2021

notebook isn't loading for me but this seems really cool

vladoh · on Feb 12, 2021

Hmm, this is strange... The Colab notebook should load even if you are not logged in with your Google account (you will need to log in if you want to run it, though).