巴西专利BR112020014184A2 activity recognition method using video tubes

专利PDF首页>>巴西专利

专利附录

专利说明

权利要求

类似技术

同族专利

引用文献

法律状态

优先权

专利摘要:
An activity recognition device comprises a port configured to receive a video stream from a video source to a first object and a second object; a memory configured to store video stream instructions and image frames; and one or more processors, where the one or more processors execute the instructions stored in memory, the one or more processors configured to: select parts of the image frames based on the presence of the first object; determine areas within parts of image frames, where locations of the first object in the video frames are limited by the determined areas; determine movement of the first object and locations of a second object within the areas of the image frames; and identify an activity according to the determined movement and locations of the second object, and generate an alert according to the identified activity.
公开号:BR112020014184A2
申请号:R112020014184-4
申请日:2018-12-11
公开日:2020-12-01
发明作者:Fatih Porikli；Qijie Xu；Luis Bill；Huang Wei
申请人:Huawei Technologies Co., Ltd.；
IPC主号:

专利说明:

[001] [001] This application claims priority for application US 15 / 867,932, filed on January 11, 2018, and entitled “Activity Recognition Method Using Videotubes”, which is incorporated into this document in its entirety by reference. TECHNICAL FIELD
[002] [002] The present disclosure is related to automated activity recognition, and in particular to an automated driver assistance system. BACKGROUND
[003] [003] Vehicle perception refers to detecting information about the vicinity of a vehicle related to the operation of the vehicle. Vehicle perception acts like the vehicle's eyes to feed itself with knowledge of what's going on around it. In-cab perception is an important aspect of vehicle perception because the status and activity of the driver and passengers provide crucial knowledge in terms of helping the driver to drive safely and to provide an improved human-machine interface (HMI). With knowledge of the driver's activities, the vehicle can determine whether the driver is distracted, fatigued, distressed, enraged or inattentive, and thus can provide alerts or support mechanisms to keep the driver safe from accidents and to improve the level of driver comfort. Automated activity recognition is an emerging technology. Recognition methods for current activities rely heavily on powerful computing resources that can consume a large amount of energy while taking up a large amount of vehicle space. The present inventors recognized a need with regard to the detection of improved activity for vehicle perception. SUMMARY
[004] [004] Several examples are now described to introduce a selection of concepts in a simplified form, which are further described below in the Detailed Description. The Summary is not proposed to identify key or essential features of the matter in question, nor is it proposed to be used to limit the scope of the matter in question.
[005] [005] In accordance with one aspect of the present disclosure, a computer implemented method of machine recognition of an activity is provided. The method comprises obtaining a video stream of a first object and a second object using a video source; select parts of image frames from the video stream based on the presence of a first object in the parts; determine areas within the parts of the image frames that limit locations of the first object; determining a movement of the first object and locations of the second object within the determined areas; identify an activity using the determined movement of the first object and locations of the second object; and generate one or both of an audible alert and a visual alert according to the activity identified.
[006] [006] Optionally, in the previous aspect, another implementation of the aspect provides obtaining a video stream of an image using a video source, and generating a video tube using the video stream using one or more processors. The video tube includes rearranged parts of the picture frames that include an image of a human hand. The video tube can be reconstructed from a given video stream around the active areas of activity. An active area of activity can include a combination of hands, objects and the pixels of interest that enable you to detect the type of activity. The video tube can include multiple windowed, processed, rearranged video frame regions and corresponding features such as motion, gradient and heat maps of objects. The combination of all these regions and images of computed characteristics can be normalized, scaled and rearranged in a scalable tensor video structure and in a temporal structure. The method additionally includes determining a hand movement, or gesture, and a heat map using the hand image, identifying an activity using the given hand movement and heat map, and generating one or both of an audible alert and an alert visual according to the identified activity.
[007] [007] Optionally, in any of the previous aspects, another implementation of the aspect establishes that generating a video tube includes: receiving a first image frame and a second subsequent image frame of the video stream; determining a similarity score between a first windowed part of the first image frame and the first windowed part of the second image frame, where the video tube is positioned in the first windowed part of the image frames;
[008] [008] Optionally, in any of the previous aspects, another implementation of the aspect establishes that generating a video tube includes determining in a recurring way a window size of the video tube, in which the window size is minimized to fully include the hand image.
[009] [009] Optionally, in any of the previous aspects, another implementation of the aspect establishes that determining a hand movement in the hand area of the video tube includes identifying pixels that include the hand image, and tracking the change in pixels that include the hand image between the image frames of the video stream; and where the video tube includes hand movement information.
[010] [010] Optionally, in any of the previous aspects, another implementation of the aspect establishes that generating the video tube includes generating a video tube that includes a collection of rearranged parts of the image frames of the video stream that include hands, objects of interest, and maps of corresponding characteristics.
[011] [011] Optionally, in any of the previous aspects, another implementation of the aspect provides a method including additionally determining object information in the video tube, in which the object information includes a heat map of the object; and where associating the activity includes determining the activity using the object information and the determined hand movement.
[012] [012] Optionally, in any of the previous aspects, another implementation of the aspect establishes that identifying an activity using the hand movement or determined gesture includes applying the object information and the hand movement information obtained from the video tube as input for a machine learning process carried out by the processing unit to identify the activity.
[013] [013] Optionally, in any of the previous aspects, another implementation of the aspect establishes that obtaining a video stream of an image includes obtaining a video stream of an image from a vehicle compartment using a vehicle imaging set; and wherein generating a video tube includes generating a video tube by a vehicle processing unit using the vehicle compartment image video stream.
[014] [014] In accordance with another aspect of the present disclosure, an activity recognition device comprises a port configured to receive a video stream from a video source; a memory configured to store frames of images from the video stream; and one or more processors. The one or more processors execute instructions stored in memory. The instructions configure one or more processors to select parts of the image frames based on the presence of the first object;
[015] [015] Optionally, in any of the previous aspects, another implementation of the aspect provides one or more processors that include a global region of interest (ROI) detector component configured to generate a video tube using the image frames; a dynamic active area generator (AAA) component configured to detect parts of image frames that include a person's hand, where the video tube includes rearranged AAAs; a generator component of key features configured to determine a hand movement and a heat map using the hand area; and an activity recognition classifier component configured to identify an activity according to the determined hand movement, and generate an alert according to the identified activity. The key feature generator component can use heat maps of identified objects to determine hand movement.
[016] [016] Optionally, in any of the previous aspects, another implementation of the aspect provides a global ROI detector component configured to: determine a similarity score between a windowed first part of the first image frame and the same first part provided with windows of a second image frame, in which the video tube is included in the first part provided with windows of the first and second image frames; omit processing of the windowed first part of the second image frame when the similarity score is greater than a specified similarity threshold; and when the similarity score is less than the specified similarity threshold, perform hand detection on the second image frame to generate a windowed second part of the image frames of the video stream more likely to include the hand image than other parts of the image frames, and include the second windowed part of the image in the video tube.
[017] [017] Optionally, in any of the previous aspects, another implementation of the aspect provides a dynamic active area of activity (AAA) generating component configured to recurrently establish a window size of the video tube, in which the size of window is minimized to fully include the hand image.
[018] [018] Optionally, in any of the previous aspects, another implementation of the aspect provides a dynamic AAA generating component configured to: determine a center of a hand area that includes the hand image; identify a research area when scaling a hand area boundary in relation to the given center; perform hand detection in the identified research area; and establishing the size window according to a hand detection result.
[019] [019] Optionally, in any of the previous aspects, another implementation of the aspect provides a dynamic AAA generating component configured to: determine a center of a hand area that includes the hand image; identify a research area by climbing a boundary of the hand area in relation to the determined center; perform hand detection in the identified research area; and establishing the size window according to a hand detection result.
[020] [020] Optionally, in any of the previous aspects, another implementation of the aspect provides a dynamic AAA generating component configured to: use the determined hand movement to predict a next window; perform handheld image detection using the next window; replace the current window with the next window when the next window contains the boundaries of a detected hand image; and when the limits of the detected hand image extend beyond the next window, merge the current window and the next window; identify a hand image in the merged windows; and determining a new minimized window size that contains the identified hand image.
[021] [021] Optionally, in any of the previous aspects, another implementation of the aspect provides a component generating key characteristics configured to identify pixels in the hand area that include a hand image; and tracking the change in pixels that include the image of the hand between the windowed parts of the image frames to determine hand movement.
[022] [022] Optionally, in any of the previous aspects, another implementation of the aspect provides a component generating key characteristics configured to determine locations of fingertips and joining points in the image frames; and tracking the change in fingertips and joining points between the windowed parts of the picture frames to determine hand movement.
[023] [023] Optionally, in any of the previous aspects, another implementation of the aspect provides a video tube that includes a collection of rearranged parts of the image frames of the video stream that include hands, objects of interest, and maps of corresponding characteristics .
[024] [024] Optionally, in any of the previous aspects, another implementation of the aspect provides a component generating key characteristics configured to identify an object in the video tube; and identify the activity using the identified object and the determined hand movement.
[025] [025] Optionally, in any of the previous aspects, another implementation of the aspect provides an activity recognition classifier component configured to compare a combination of the identified object and the determined hand movement to one or more combinations of objects and hand movements stored in memory; and identify the activity based on a comparison result.
[026] [026] Optionally, in any of the previous aspects, another implementation of the aspect provides an activity recognition classifier component configured to: detect a sequence of hand movements using video tube image frames; compare the detected sequence of hand movements to a specified sequence of hand movements from one or more specified activities; and select an activity from one or more specified activities according to a comparison result.
[027] [027] Optionally, in any of the previous aspects, another implementation of the aspect provides a component generating key characteristics configured to store video tube information in memory as a scalable tensor video tube; and where the activity recognition classifier component is configured to apply the scalable tensor video tube as input to a deep learning algorithm performed by the activity recognition classifier component to identify the activity.
[028] [028] Optionally, in any of the previous aspects, another implementation of the aspect provides an activity recognition classifier component configured to select a line-wise configuration of AAAs within the scalable tensor video tube according to the identity of the person and apply the selected line sense configuration of AAAs as the input to the deep learning algorithm to identify the person's activity.
[029] [029] Optionally, in any of the previous aspects, another implementation of the aspect provides an activity recognition classifier component configured to select an AAA column-like configuration within the scalable tensor video tube according to multiple identities people and apply the configuration in the sense of the selected AAA column as the input to the deep learning algorithm to identify interactivity between multiple people.
[030] [030] Optionally, in any of the previous aspects, another implementation of the aspect provides an activity recognition classifier component configured to select configuration in the sense of multiple AAA columns within the tensor video tube scalable according to identities of multiple groups of people and apply the selected multi-column configuration of AAAs as the input to the deep learning algorithm to identify multiple interactivities between multiple groups of people.
[031] [031] Optionally, in any of the previous aspects, another implementation of the aspect provides a video source that includes an imaging set configured to provide a video stream of an image from a vehicle compartment; and wherein the processing unit is a vehicle processing unit configured to generate the video tube using the vehicle compartment image video stream.
[032] [032] In accordance with another aspect of the present disclosure, there is a computer-readable storage medium including instructions that, when performed by one or more processors of an activity recognition device, cause the activity recognition device to perform procedures comprising: obtaining a video stream of an image using a video source; select parts of image frames from the video stream based on the presence of a first object in the parts;
[033] [033] Optionally, in any of the previous aspects, another implementation of the aspect includes a computer-readable storage medium including instructions that make the activity recognition device perform procedures including: generating a video tube using the stream video, wherein the video tube includes rearranged portions of image frames from the video stream that include a handheld image; determine a hand movement and a heat map using the hand image; associate an activity with the determined hand movement and heat map; and generate one or both of an audible alert and a visual alert according to the activity.
[034] [034] Optionally, in any of the previous aspects, another implementation of the aspect includes a computer-readable storage medium including instructions that cause the activity recognition device to perform procedures including: recurrently determining a window size of the video tube, in which the window size is minimized to fully include the hand image.
[035] [035] Optionally, in any of the previous aspects, another implementation of the aspect includes a computer-readable storage medium including instructions that cause the activity recognition device to perform procedures including: predicting a next window using the hand movement determined for; perform handheld image detection using the next window; replace the current window with the next window when the next window contains the boundaries of a detected hand image; and when the limits of the detected hand image extend beyond the next window, merge the current window and the next window; identify a hand image in the merged windows; and determining a new minimized window size that contains the identified hand image. BRIEF DESCRIPTION OF THE DRAWINGS
[036] [036] Figure 1 is an illustration of an occupant in a vehicle cabin according to example modalities; figure 2 is a flow chart of a machine recognition method for an activity according to example modalities; figure 3 is a block diagram of a system for recognizing activity according to example modalities; figure 4 is a flow chart of a method implemented by machine or computer for detecting a region of global interest in image data according to example modalities; Figure 5 is an illustration of a joint image for image processing windows according to example modalities; figure 6 is a flowchart of a computer-implemented method for detecting a hand in image data according to example modalities; figures 7A-7D illustrate the establishment of search windows for the detection of hand according to example modalities; figure 8 is an illustration of more detailed image detection according to example modalities; figure 9 is a block diagram of a dynamic window component according to example modalities; figure 10 is an illustration of the activated dynamic window process according to example modalities; figure 11 is a block diagram of parts of the system for automated activity recognition according to example modalities; figure 12 shows a result of using optical flow to determine movement flow information according to example modalities; figure 13 is an illustration of heat map generation according to example modalities; figure 14 is an illustration showing key features for a video tube according to example modalities; figure 15 is an illustration showing normalization of image frames to a spatial dimension according to example modalities; figure 16 is an illustration of standardization of video tubes according to example modalities; figure 17 is a flow chart illustrating the rearrangement of key characteristics for two different video tube structures according to example modalities;
[037] [037] In the following description, reference is made to the attached drawings that form part of this application, and where specific modalities that can be practiced are shown by way of illustration. These modalities are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other modalities can be used and that structural, logical and electrical changes can be made without departing from the scope of the present invention. The following description of example modalities, therefore, is not to be considered in a limited sense, and the scope of the present invention is defined by the appended claims.
[038] [038] The functions or algorithms described in this document can be implemented in software in a modality. The software may consist of computer executable instructions stored on computer-readable media or a computer-readable storage device such as one or more non-transitory memories or other types of hardware, local or network-based storage devices. Additionally, these functions correspond to the components, which can be software, hardware, firmware or any combination of them. Multiple functions can be performed on one or more components as desired, and the described modalities are merely examples. The software can be run on a digital signal processor, application-specific integrated circuit (ASIC), field programmable port array (FPGA), microprocessor or another type of processor operating on a computer system, such as a personal computer , server, or other computer system, turning that computer system into a specifically programmed machine.
[039] [039] As explained earlier in this document, automated activity recognition is desirable for applications such as vehicle perception to improve the safety of vehicle operation. Current approaches to activity recognition require complex computations by computing devices that use an excessively large amount of energy and space from a vehicle.
[040] [040] Figure 1 is an illustration of an occupant in a vehicle cabin. As shown in the figure, occupants of a vehicle cabin often perform different activities using their hands, such as driving or operating the vehicle radio. Hand areas can be used by an activity recognition system as areas to focus on in order to recognize occupant activities. The vehicle cabin or compartment includes an imaging device 105 (e.g., a camera) or set of devices that provide video from an internal view of the vehicle. The hand areas 103 and 107 can be seen clearly from the point of view of the imaging device 105.
[041] [041] The image sensor assembly 105 in figure 1 is connected to a vehicle processing unit (not shown). The vehicle processing unit can include one or more video and memory processors, and the vehicle processing unit performs the activity recognition processes. The image sensor assembly 105 can capture video from the total vehicle cabin. A region of interest (ROI) component of the vehicle processing unit receives the video stream captured by the image sensor assembly, and searches the entire image to locate the rough hand area. This global ROI detector uses a classifier / detector that is faster and cheaper in terms of energy and complexity to roughly identify whether a detected object is actually a human hand using the corresponding detection confidence.
[042] [042] Video tubes contain corrections and local video characteristics generated from the raw video images returned by the imaging device 105. Video tubes can include rearranged parts of the image frames that include the image of a human hand. The video tube can be reconstructed from a given video stream around the active areas of activity. An active area of activity can include a combination of hands, objects and the pixels of interest that enable you to detect the type of activity. Video tubes can include multiple windowed, processed, rearranged video frame regions and corresponding features such as motion, gradient and heat maps of objects. The combination of all these regions and images of computed characteristics can be normalized, scaled and rearranged in a scalable tensor video structure and in a temporal video structure.
[043] [043] For use in automated activity recognition, a video tube is produced from the raw images by removing areas that are not related to the driver (or passenger) activity. The video tube can contain several pieces of information that describe an activity taking place inside a vehicle. Activities inside a vehicle typically involve the hands of the driver and the passenger. A video tube can be generated that contains a windowed portion of the original image that contains a human hand and the object with which the hand is interacting. A video tube can also contain the hand movement profile (for example, hand movement flow or hand movement). In some embodiments, the video tube may include a heat map. A heat map can be defined as the location of the first object (for example, a hand) within the determined active activity area. The location information can be represented in relation to a coordinate system centralized in the active area or in the image frame. Using the picture frame coordinates enables the capture of the relative positions of multiple objects of the first type (for example, multiple hands visible in a given image).
[044] [044] In some modalities, a video tube is the collection of rearranged parts of the image frames of the video streams, and can include images of hands, other objects of interest and maps of corresponding characteristics. In some embodiments, a video tube contains a data structure that can be referred to as a scalable tensor video tube, which is used to organize information regarding parts of the original image that include hands and objects, the profile of hand movement and the heat map of the objects being used by each occupant inside the vehicle cabin.
[045] [045] To generate a video tube, hand area detection is performed first on the raw video streams to locate approximate hand areas in the image data. A fine-grained hand detector and an activity-oriented object detector are then performed within the approximate hand areas. Bounding boxes for these hand and object locations are determined and fused to generate the video tubes. A complete picture of hands and objects is contained in the video tubes while the scale of the video tubes is kept as small as possible. For example, a video tube can be generated for figure 1 exactly in the area of the hand operating the radio. Keeping the video tubes as small as possible reduces the amount of image processing that needs to be performed to identify activities.
[046] [046] In some modalities, hand movement (for example, a hand gesture) can be detected using optical flow processing that is performed only on video tubes. The optical flow produces temporal information regarding a hand or hands of the occupant. Hand motion detection information and detected object information can be provided to a recurrent neural network (or other automated decision technology) to detect and identify the occupant's activities. In other embodiments, each hand section of a video tube can be provided for a feature puller. Time information related to extracted characteristics can then be provided to a classifier based on deep learning to identify activity.
[047] [047] Figure 2 is a high-level flow chart of an activity recognition machine method. Method 200 can be performed on a vehicle using a vehicle processing unit that can include one or more processors. In operation 205, a raw image video stream is obtained or read using a video source. The video source can be an imaging device in a vehicle cabin (for example, car, truck, tractor, plane, etc.). The video stream includes images of a first object and a second object. The first object can include a vehicle occupant's hand and the second object can include an object with which the hand is interacting (for example, a smartphone, drink container, etc.).
[048] [048] In operation 210, a global region of interest (ROI) is detected based on the presence of the first object in the images. An ROI detector receives the raw image as input and produces the region of gross interest. The image frames are processed to detect the first object in the image frames. For activity detection, the first object can be a hand. Machine learning can be used to recognize features in an image that represents a human hand. The ROI can contain a detected hand area and nearby objects within a certain range of the hand area.
[049] [049] In operation 215, the active area of activity (AAA) within a part of the image frames is determined. Locations of the first object in the video frames are limited by the determined areas. The areas are scaled and minimized on a recurring basis to reduce necessary image processing, however they still include the entire image of the first object. The vehicle processing unit includes an active area generator. The active area generator attempts to achieve minimum window dimensions to generate a video tube while retaining information regarding objects such as hands and objects related to activity. The image processing used to establish the active areas enclosing the image is more extensive than the image processing used to identify the ROI. When the first object is a hand, the AAA is generated and updated using the locations of hands and objects near the hands by proposing search boxes at different scales and different aspect ratios. AAA is used to generate a video tube. The video tube is a specific organization of the image data that is optimized for further processing to identify activity.
[050] [050] The generation of video tube 217 is performed using operations 220 to 245 of method 200 of figure 2. In operation 220, key characteristics of the video tube (or video tubes) are determined. The vehicle processing unit includes a feature generator. The characteristic generator determines movement of the first object and locations of a second object within the active areas of the image frames. Motion determination can involve tracking the location of the image of the first object in a previous frame versus the current frame. If the first object is a human hand, the feature generator receives AAA as input, and can produce key features such as hand movement flow information and a heat map of the second object or objects, which can be objects with which the detected hand is interacting. Since active areas are optimized by minimizing recurring windows, the image processing required to determine motion is reduced.
[051] [051] In operation 225, spatial normalization is performed. In spatial normalization, a video tube for a specific time "T" is determined using the key characteristic information obtained in that time "T". This information is then concatenated together and normalized to a dimension where each piece of information can be used as a picture frame and feature data.
[052] [052] In operation 230, rearrangement of key characteristics is performed. In rearrangement of key features, the frames of key features are organized into two structures. The first structure stores the key characteristic information for multiple occupants of the vehicle. In operation 235, the method may include identity designation to designate key characteristics for different occupants. The second structure organizes the frames of key features in a scalable tensioner video tube which is described below. Key characteristic information obtained for a specific time "T" can be a part of the scalable tensioner video tube. In operation 240, image information from the first and second objects (for example, hand-object information) can be used to optimize the AAA again; referred to as AAA tracking.
[053] [053] In operation 245, time normalization is performed. In some ways, pairs of hands-objects and movement information can be concatenated together, and video tubes can be optimized for objects such as hands in picture frames. However, before the generated video tubes can be supplied for the activity recognition process, the video tubes must be scaled to the same dimension. The video tubes can be scaled (up or down) in order to obtain a flow of multiple video tubes with the same dimensions (time normalization).
[054] [054] In operation 250, activity recognition is performed using video tubes. The video tubes can be inserted into an activity classifier. The activity classifier can be a classifier based on deep learning that identifies an activity according to the determined movement of the first object and locations of the second object or objects. For example, hand area video tubes can be inserted into the activity classifier and hand-object information can be used to identify an activity in a vehicle cabin occupant. Since the video tube is a small processing area, less power and computing time is required for activity recognition.
[055] [055] The activities identified by the vehicle's cabin occupant can be monitored. An alert can be generated by the vehicle processing unit according to the activity identified. For example, machine recognition can be included in an automated driver assistance system, and the activity identified can indicate that the driver is inattentive to operating the vehicle. The alert can be an audible alert generated using a loudspeaker or it can be a visual alert generated using a screen in the vehicle cabin. The driver can then take corrective action.
[056] [056] The method of figure 2 can be performed by modules of the vehicle processing unit. The modules can include or be included in one or more processors such as a microprocessor, a video processor, a digital signal processor, ASIC, FPGA or another type of processor. The modules can include software, hardware, firmware or any combination of them to perform the described operations.
[057] [057] Figure 3 is a block diagram of an example of an automated activity recognition system. In the example in Figure 3, system 300 includes an activity recognition device 310 and a video source 305 operationally coupled to activity recognition device 310. Video source 305 can include a near-infrared (NIR) camera or set of NIR cameras and generates a video stream that includes frames of image data.
[058] [058] System 300 in the example of figure 3 is included in a vehicle 301 and the activity recognition device 310 can be a vehicle processing unit. In some embodiments, the activity recognition device 310 may include one or more video processors. The activity recognition device 310 includes a port 315 that receives the video stream and a memory 320 for storing the image frames of the video stream. The activity recognition device 310 can include one or more video processors to process the image frames to perform activity machine recognition using the video stream.
[059] [059] The activity recognition device 310 includes a global ROI detector component 325, a dynamic AAA detector component 330, a generator component of key characteristics 335, a spatial normalizing component 340, a rearrangement component of key characteristic 345, a time normalizing component 350 and an activity recognition classifier component 355. The components can include or be included in one or more processors such as a microprocessor, a video processor, a digital signal processor, ASIC, FPGA or other type of processor. Components can include software, hardware, firmware or any combination of software, hardware and firmware.
[060] [060] Figure 4 is a flow chart of an example of a method implemented by machine or computer for detecting a global region of interest (ROI). The global ROI is a rough or rough hand area of the video stream image data. Method 400 can be performed using the global ROI detector component 325 of the activity recognition device 310 in figure 3. The global ROI detector component 325 selects parts of the image frames based on the presence of the first object such as a hand, for example. Detection by the global ROI detector component is rough or approximate detection of the first object. The gross detection of the presence of the first object is applied to a large area of the image frames. Coarse image detection can be a fast objectivity detection that uses an image level similarity method to detect the presence of the first object. The global ROI detector component 325 receives raw image data as input and produces the global ROI. The global ROI can contain a hand area and nearby objects within a certain range of the hand area.
[061] [061] In operation 405, raw image data is received from the video source or retrieved from memory. Raw images can be colored, gray level, near infrared, thermal infrared, etc. and are acquired from an image sensor assembly. The image data includes a first image frame and a subsequent image frame of the video stream.
[062] [062] These images are masked with a global region of interest (ROI) that can be known offline for specific camera configurations using 3D information. The global ROI defines the part or parts of the image frames where there are prominent and important objects for recognizing action and activity, such as hands, human body and objects. In a vehicle, the overall ROI refers to areas in the vehicle cabin where the hands of the occupants (including driver and passengers) are potentially visible in the video images. In other words, the global ROI contains all possible hand areas and nearby objects within a certain range of hand areas. This allows hands that are outside the vehicle, for example, in addition to the front and rear windshields or away from the side windows, to be excluded from processing to identify activity.
[063] [063] The global ROI is used to select the areas where consecutive processing will be applied. When determining whether the global ROIs of consecutive images have a high similarity score (this score can be obtained, for example, by using a change detection technique or a logistic regression method) it is also possible to use the global ROI to ignore many similar images in order to speed up the activity recognition process by focusing only on video images that are different. A similarity threshold like this is either set manually or known automatically from the data. This threshold can be used to control the number of frames to be ignored (this can help to make better use of available computational resources). Global ROI can also be extracted using objectivity detectors based on deep learning that also extract characteristics to represent salient and important objects in different shapes, colors, scales and positions. Using objectivity detectors and a given set of training data, objectivity scores for all image pixels in the training images and corresponding bounding boxes are obtained and aggregated on a spatial map, and the spatial map is used to establish the ROI global.
[064] [064] In operation 410, the similarity between image frames is determined. In some embodiments, the global ROI detector component 325 includes a spatial constraint component (not shown). The raw image frames are provided for the spatial constraint component that can use a similarity estimation algorithm to determine a similarity score for the images. In the similarity estimate, similar images receive a higher similarity score. The similarity score may reflect overall similarity between the first image and the second image or it may reflect similarity between a windowed first part of the first image frame and the windowed first part of the second image frame. In certain modalities, logistic regression is used to determine the similarity score for the images. In variations, the output of the logistic regression is binary and the images are either considered similar or not similar.
[065] [065] In operation 415 of figure 4, the similarity estimate ignores images that are indicated to be similar according to the similarity score in order to speed up hand detection in the images. The similarity score threshold that you determine to ignore an image can be specified manually (for example, programmed) or it can be known by the activity recognition device from training data. The number of skipped or skipped frames in hand detection processing is determined by the similarity score threshold. The initial similarity score can be cleared or set to zero as soon as the first image is received in order to activate an initial detection for the first object.
[066] [066] In operation 420, if the similarity score between the two image frames is below the specified similarity threshold, hand detection is activated and performed on the second image frame. The global ROI detector can include a machine learning component to perform object detection. The machine learning component can use deep learning technology (for example, convolutional neural networks (CNN), recurrent neural networks (RNN) or long / short term memory (LSTM)) to learn to recognize features in an image that represents the first object of interest. In some ways, these image characteristics may include different shapes, colors, scales and movements that indicate a hand. The output of the hand detection can be a detection confidence in relation to one or both of category and type. Detection confidence can be a probability of correct detection. The detection output of the first object is also a bounding box that defines the object detection limit. The bounding box can be used to calculate intersection over union in the image area. Figure 5 is an illustration of image on union (IoU) for image processing windows, where IoU = (Intersection Area) / (Union Area).
[067] [067] Returning to figure 4 for operation 425, the IoU and Trust are used to determine whether the results of image detection are reliable. Thresholds for IoU and Confidence can be specified manually or determined using machine training such as for the similarity score threshold. If either IoU or trust does not satisfy the threshold, the image is omitted and the method returns to 405 to acquire the next image for analysis. In 435, the global ROI is established for the image data frames. The boundary box output is treated as the initial global ROI for the video tube.
[068] [068] Returning to figure 3, the activity recognition device 310 includes a dynamic active area detector component 330 to determine an active area of activity (AAA) using the global ROI determined by the global ROI detector component
[069] [069] Figure 6 is a flow chart of a method of determining AAA. In operation 605, the previous global ROI and Active Areas of Activity (AAAs) are combined in order to discover a region of local interest for each AAA. The local ROI is used to determine the research area and is derived from the previous AAA after tracking it within the global ROI. The local ROI is larger than your AAA to ensure complete detection of nearby hands and objects. Search boxes or areas can be used to locate objects of interest in the global ROI. In some modalities, research boxes of different scales and relationships are proposed as research areas. Based on the approximate approximate object area of the global ROI, research areas in different scales and aspect ratios (or length-width ratios) are generated. In some respects, a predefined set of scales and length-to-width ratios can be used to multiply the bounding box of a hand area to generate search boxes. This predefined set can be established manually based on an experiment or automatically known from original data, such as using a grouping method, for example. The detection of hands and objects can be performed in these generated research areas.
[070] [070] Figures 7A-7D illustrate the determination of search windows or search boxes for detection of hand and object. Figure 7A represents the approximate area initially identified from the first object. In some respects, the center of the starting hand area is determined and the size of the hand area window is scaled relative to the center to identify a research area. Figure 7B is an example of scaling down the initial survey area while changing the length and width of scaled windows. The window scale is reduced by 1 to n, with n being a positive integer. Figure 7C is an example of maintaining the scale of the initial research area, and figure 7D is an example of expanding the scale of the initial research area. The scaling used can be predefined and specified manually, or the scaling can be determined through machine training. For example, scaling can be known by machine from initial data using a grouping method.
[071] [071] Returning to figure 6, detection in operation 610 can be performed in the proposed research area to identify an image of the first object. In operation 612, the search region can be updated based on the result of the hand detection. In operation 612, object detection in operation 612 can be performed in the search area, such as to identify objects with which a detected hand may be interacting. This iteration of image detection of the first object is more detailed than the detection of coarse search area described earlier in this document. The window size of the video tube can be minimized by reducing the window size based on the result of image detection. In some ways, a handheld detector is applied to the local ROI to discover the hand's locations in the current video frame (image). Each AAA can correspond to a single hand (for example, an AAA is based on a hand region).
[072] [072] Figure 8 is an illustration of an example of more detailed image detection. In some respects, a rough or rough area of a 805 hand and the resized search area window are introduced into a deep convolution neural network 810. The deep convolution neural network 810 can be an image detector based on deep learning that is trained to detect 815 hands and objects within resized windows. In certain modalities, the image detector based on deep learning is trained to detect hands and objects in relation to the activities that take place in a vehicle cabin.
[073] [073] This detailed version of hand and object detection can be computationally intensive. However, detailed hand detection operates within the range of the search area windows, which reduces the area of the image to be processed to identify a hand or object. This can speed up detection, but it can also reduce the focus of detection to an area that is more likely to contain a hand, which reduces the possibility of wrong detection. Additionally, the spatial constraint component of the global ROI detector can determine (for example, using logistic regression) when the processing of an image can be omitted based on the similarity of images.
[074] [074] Coarse or approximate image detection used by the global ROI detector component identifies and selects parts of an image that may contain an object of the first type (for example, a human hand for the task of recognizing activity in a vehicle) . This is followed by more detailed, accurate, but potentially more computationally intensive object detection by the AAA detector component. The two-stage hand detection process decreases the overall computational load and improves detection accuracy by allowing the detector to focus on the areas that are most likely to include hands. Coarse image detection is fast, computationally inexpensive and applied to a large part of the image, and preferably has a low false negative rate (that is, it does not miss any of the real hand areas, however it may incorrectly identify or select non-hand areas as hands).
[075] [075] Considering these compensatory exchanges, the coarse image detection can be a quick objectivity detection (for example, of human hand) that uses an image level similarity method such as one or more of a logistic regression, an algorithm tracking, a conventional classifier (using simple region descriptors and traditional classification methods such as support vector machines, intensification, random trees, etc.), and a classifier based on deep learning. Detailed object detection can be a conventional classifier as well as a classifier based on deep learning. In the event that both coarse image detection and detailed object detection use deep learning models, coarse image detection can operate at a lower spatial resolution, using only early layers of deep architecture (for example, some early ones) convolutional layers connected to a fully connected layer), trained as a binary classifier without estimating the object window size, or a combination of all of these. Detailed object detection can use the feature maps generated by coarse image detection, use much deeper processing layers,
[076] [076] Detailed hand detection cannot always guarantee correct results. In some embodiments, the dynamic AAA detector component 330 can apply the results of detailed image detection to a false detection filter designed to be free of false positive and negative detections, or to determine erroneous detections. This can result in consistent categories for the same hand and detected objects and provide secure information for activity recognition. New detected hand and object locations are updated based on valid hand and object categories.
[077] [077] Video tube windows are dynamically sized and search regions are updated based on the first object and second object detection. The detected location of the first can be used to update the search region to detect nearby objects. For different applications, AAA can be based on different objects, including human body, face, legs, animals, etc. As part of generating a video tube, the video processor tries to minimize the resolution of the video tube to minimize the amount of image processing required to identify an activity. The window size of the video tube is determined on a recurring basis to find a window size that is minimized, but which still includes the first identified object area. In some embodiments, the window size is determined on a recurring basis according to the example method shown in figures 7A-7D. The window size is updated on a recurring basis based on the approximate determined areas, and search areas or windows of different scales and length-to-width ratios are generated.
[078] [078] Returning to figure 6 for operation 617, overlap scores between the corresponding hand box and each nearby object can be computed for each hand detected. The IoU (Intersection over Union) as shown in figure 5 can be used to measure the overlap area between two bounding boxes. However, nearby objects are usually obstructed by the hand and thus only partial objects can be detected in many cases, which generates low IoU scores and can be treated as irrelevant objects. The distance between two bounding boxes is another measure for computing the score. However, distance is subject to the sizes of the bounding boxes. For example, bounding boxes A and B have the same distance to C. Only bounding box A should be considered as a close object. Therefore, another method for determining bounding boxes is to compute the overlap score that takes into account both the distance and size of the bounding box while being able to measure overlap between obstructed objects and an image of the hand.
[079] [079] An example of a formula for computing overlap is as follows: ã ã Overlap Area = ∗ ∗ ∗ (1) ã ã where

权利要求:
Claims (20)
[1]
1. Activity recognition device, the device characterized by the fact that it comprises: a port configured to receive a video stream from a video source for a first object and a second object; a memory configured to store video stream instructions and image frames; and one or more processors, where the one or more processors execute the instructions stored in memory, the one or more processors configured to: select parts of the image frames based on the presence of the first object; determine areas within the parts of image frames, where locations of the first object in the video frames are limited by the determined areas; determine movement of the first object and locations of a second object within the areas of the image frames; and identify an activity according to the determined movement and locations of the second object, and generate an alert according to the identified activity.
[2]
2. Activity recognition device, according to claim 1, characterized by the fact that the one or more processors are configured to: determine a similarity score between a first part provided with windows of a first image frame and the same first part provided with windows of a second image frame, wherein the active area is included in the first part provided with windows of the first and second image frames; omit processing of the windowed first part of the second image frame when the similarity score is greater than a specified similarity threshold; and when the similarity score is less than the specified similarity threshold, perform detection of the first object in the second image frame to generate a second windowed portion of the image frames of the video stream most likely to include an image of the first object than other parts of the image frames, and include the second windowed part of the image in a video tube that includes a collection of rearranged parts of the image frames of the video stream.
[3]
3. Activity recognition device according to claim 1, characterized by the fact that the one or more processors are configured to recurrently establish a window size of the active area, in which the window size is established to include the hand image.
[4]
4. Activity recognition device, according to claim 3, characterized by the fact that the first object is a hand and in which the one or more processors are configured to: determine a center of an active hand area; identify a research area when scaling a boundary of the active hand area in relation to the given center; perform hand image detection in the identified research area; and establishing the size window according to a result of hand image detection.
[5]
5. Activity recognition device, according to claim 3, characterized by the fact that the one or more processors are configured to: use the determined movement of the first object to predict a next window; perform image detection of the first object using the next window; replace a current window with the next window when the next window contains the boundaries of a detected image of the first object; and when the limits of the detected image of the first object extend beyond the next window: merge the current window and the next window; identify an image of the first object in the merged windows; and determining a new minimized window size that contains the identified image of the first object.
[6]
6. Activity recognition device, according to claim 1, characterized by the fact that the first object is a hand, and in which the one or more processors are configured to: identify pixels of the determined areas that include a hand image ; and tracking a change in pixels that include the hand image between windowed parts of the image frames to determine hand movement.
[7]
7. Activity recognition device, according to claim 1, characterized by the fact that the first object is a hand, and in which the one or more processors are configured to: determine locations of fingertips and points of union in the picture frames; and tracking the change in fingertips and joining points between windowed parts of the picture frames to determine hand movement.
[8]
8. Activity recognition device, according to claim 1, characterized by the fact that the first object is a hand, and in which the one or more processors are configured to: determine a hand movement; and identify the activity using the determined hand movement and the second object.
[9]
9. Activity recognition device, according to claim 8, characterized by the fact that the one or more processors are additionally configured to: compare a combination of the determined hand movement and second object to one or more combinations of movement movements hands and objects stored in memory; and identify the activity based on a comparison result.
[10]
10. Activity recognition device, according to claim 8, characterized by the fact that the one or more processors are additionally configured to: detect a sequence of hand movements using the determined areas of the image frames;
compare the detected sequence of hand movements to a specified sequence of hand movements from one or more specified activities; and select an activity from one or more specified activities according to a comparison result.
[11]
11. Activity recognition device according to claim 1, characterized by the fact that the one or more processors are additionally configured to generate a video tube that includes a collection of rearranged parts of the image frames of the video stream that include the first and second objects, and corresponding feature maps.
[12]
12. Activity recognition device according to claim 11, characterized by the fact that the one or more processors are configured to store video tube information in memory as a scalable tensor video tube; and where the activity classifier component is configured to apply the scalable tensor video tube as input to a deep learning algorithm performed by the activity classifier component to identify the person's activity.
[13]
13. Activity recognition device, according to claim 12, characterized by the fact that the one or more processors are configured to select a configuration in the line direction of parts of the image frames within the scalable tensioner video tube. according to the person's identity and apply the configuration in the selected line direction as the input to the deep learning algorithm to identify the person's activity.
[14]
14. Activity recognition device according to claim 12, characterized by the fact that the one or more processors are configured to select a column-like configuration of parts of the image frames within the scalable tensor video tube. according to identities of multiple people and apply the configuration towards the selected column as the input to the deep learning algorithm to identify interactivity between multiple people.
[15]
15. Activity recognition device according to claim 12, characterized by the fact that the one or more processors are configured to select configuration in the direction of multiple columns of parts of the image frames within the scalable tensor video tube. according to identities of multiple groups of people and apply the configuration in the sense of multiple columns selected as the input to the deep learning algorithm to identify multiple interactivities between multiple groups of people.
[16]
16. Activity recognition device according to claim 1, characterized by the fact that the video source includes an imaging set configured to provide a video stream of an image from a vehicle compartment; and wherein the one or more processors are included in a vehicle processing unit configured to identify an activity using the vehicle compartment image video stream.
[17]
17. Computer implemented method of machine recognition of an activity, the method characterized by the fact that it comprises: obtaining a video stream of a first object and a second object using a video source; select parts of image frames from the video stream based on the presence of a first object in the parts; determine areas within the parts of the image frames that limit locations of the first object; determining a movement of the first object and locations of the second object within the determined areas; identify an activity using the determined movement of the first object and locations of the second object; and generate one or both of an audible alert and a visual alert according to the activity identified.
[18]
18. Method according to claim 17, characterized in that determining areas within parts of the image frames that limit locations of the first object includes: receiving a first image frame and a subsequent second image frame from the video stream ; determining a similarity score between a first windowed part of the first image frame and the first windowed part of the second image frame, where the location of the first object is positioned in the first windowed part of the image frames; omit processing of the windowed first part of the second image frame when the similarity score is greater than a specified similarity threshold; and when the similarity score is less than the specified similarity threshold, enable detection of the first object in the second image frame to generate a windowed second part of the image frames more likely to include the first object than other parts of the frames of images, and include the second part provided with windows in the determined areas.
[19]
19. Non-transient computer-readable storage medium, characterized by the fact that it includes instructions that, when performed by one or more processors of an activity recognition device, induce the activity recognition device to perform procedures comprising: obtaining a flow video of a first object and a second object using a video source; select parts of image frames from the video stream based on the presence of a first object in the parts; determine areas within the parts of the image frames that limit locations of the first object; determining a movement of the first object and locations of the second object within the determined areas; identify an activity using the determined movement of the first object and locations of the second object; and generate one or both of an audible alert and a visual alert according to the activity identified.
[20]
20. Non-transient, computer-readable storage medium according to claim 19, characterized by the fact that it includes instructions that induce the activity recognition device to perform procedures including: predicting a next window using the motion determined for; perform image detection of the first object using the next window; replace the current window with the next window when the next window contains the limits of a detected image of the first object; and when the limits of the detected hand image extend beyond the next window: merge the current window and the next window; identify an image of the first object in the merged windows; and determining a new minimized window size that contains the identified image of the first object.

类似技术:

公开号 | 公开日 | 专利标题

BR112020014184A2|2020-12-01|activity recognition method using video tubes

Kooij et al.2019|Context-based path prediction for targets with switching dynamics

US10769411B2|2020-09-08|Pose estimation and model retrieval for objects in images

US10706334B2|2020-07-07|Type prediction method, apparatus and electronic device for recognizing an object in an image

CN107358149B|2020-09-22|Human body posture detection method and device

Janoch et al.2013|A category-level 3d object dataset: Putting the kinect to work

Le et al.2017|Robust Hand Detection and Classification in Vehicles and in the Wild.

JP6395481B2|2018-09-26|Image recognition apparatus, method, and program

Marton et al.2009|Probabilistic categorization of kitchen objects in table settings with a composite sensor.

US10311704B1|2019-06-04|Passenger-related item loss mitigation

WO2019179442A1|2019-09-26|Interaction target determination method and apparatus for intelligent device

CN113015984A|2021-06-22|Error correction in convolutional neural networks

Frigieri et al.2017|Fast and accurate facial landmark localization in depth images for in-car applications

Juang et al.2016|Stereo-camera-based object detection using fuzzy color histograms and a fuzzy classifier with depth and shape estimations

Lyu et al.2016|Joint shape and local appearance features for real-time driver drowsiness detection

US11068701B2|2021-07-20|Apparatus and method for vehicle driver recognition and applications of same

CN111797670A|2020-10-20|Method and device for determining whether a hand cooperates with a manual steering element of a vehicle

Li et al.2017|Presight: Enabling real-time detection of accessibility problems on sidewalks

Czúni et al.2017|The fusion of optical and orientation information in a Markovian framework for 3D object retrieval

Meger et al.2011|Mobile 3d object detection in clutter

Jeong et al.2016|Facial landmark detection based on an ensemble of local weighted regressors during real driving situation

US20210319585A1|2021-10-14|Method and system for gaze estimation

Hoang Ngan Le et al.2017|Robust Hand Detection and Classification in Vehicles and in the Wild

Frintrop2016|Attentive Robots

Yadav2021|Occlusion Aware Kernel Correlation Filter Tracker using RGB-D

同族专利:

公开号 | 公开日

WO2019137137A1|2019-07-18|

EP3732617A1|2020-11-04|

KR20200106526A|2020-09-14|

US20190213406A1|2019-07-11|

US10628667B2|2020-04-21|

JP2021510225A|2021-04-15|

EP3732617A4|2021-03-10|

US20200320287A1|2020-10-08|

CN111587437A|2020-08-25|

US11100316B2|2021-08-24|

引用文献:

公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US5454043A|1993-07-30|1995-09-26|Mitsubishi Electric Research Laboratories, Inc.|Dynamic and static hand gesture recognition through low-level image analysis|

US5809437A|1995-06-07|1998-09-15|Automotive Technologies International, Inc.|On board vehicle diagnostic module using pattern recognition|

DE69626208T2|1996-12-20|2003-11-13|Hitachi Europ Ltd|Method and system for recognizing hand gestures|

JP3414683B2|1999-11-16|2003-06-09|株式会社国際電気通信基礎技術研究所|METHOD AND APPARATUS FOR MEASURING SURFACE MOTION OF OBJECT|

CN1200397C|2000-11-14|2005-05-04|三星电子株式会社|Method for object action set-up mold|

AU2004233453B2|2003-12-03|2011-02-17|Envysion, Inc.|Recording a sequence of images|

KR20060070280A|2004-12-20|2006-06-23|한국전자통신연구원|Apparatus and its method of user interface using hand gesture recognition|

JP5028751B2|2005-06-09|2012-09-19|ソニー株式会社|Action recognition device|

US8189866B1|2008-08-26|2012-05-29|Adobe Systems Incorporated|Human-action recognition in images and videos|

KR20100065480A|2008-12-08|2010-06-17|한국전자통신연구원|System for activity recognition|

JP5118620B2|2008-12-24|2013-01-16|日立Ｇｅニュークリア・エナジー株式会社|Dynamic recognition device, dynamic recognition system, and dynamic recognition method|

KR101640077B1|2009-06-05|2016-07-15|삼성전자주식회사|Apparatus and method for video sensor-based human activity and facial expression modeling and recognition|

US8345984B2|2010-01-28|2013-01-01|Nec Laboratories America, Inc.|3D convolutional neural networks for automatic human action recognition|

US20120200486A1|2011-02-09|2012-08-09|Texas Instruments Incorporated|Infrared gesture recognition device and method|

US9278255B2|2012-12-09|2016-03-08|Arris Enterprises, Inc.|System and method for activity recognition|

US9701258B2|2013-07-09|2017-07-11|Magna Electronics Inc.|Vehicle vision system|

CN105612473B|2013-09-12|2018-08-28|三菱电机株式会社|Operation input device and method|

US9501693B2|2013-10-09|2016-11-22|Honda Motor Co., Ltd.|Real-time multiclass driver action recognition using random forests|

KR101537936B1|2013-11-08|2015-07-21|현대자동차주식회사|Vehicle and control method for the same|

US9296421B2|2014-03-06|2016-03-29|Ford Global Technologies, Llc|Vehicle target identification using human gesture recognition|

US9354711B2|2014-09-30|2016-05-31|Xerox Corporation|Dynamic hand-gesture-based region of interest localization|

US9778750B2|2014-09-30|2017-10-03|Xerox Corporation|Hand-gesture-based region of interest localization|

KR20160090047A|2015-01-21|2016-07-29|현대자동차주식회사|Vehicle, controlling method thereof and gesture recognition apparatus therein|

US10043084B2|2016-05-27|2018-08-07|Toyota Jidosha Kabushiki Kaisha|Hierarchical context-aware extremity detection|

WO2017206147A1|2016-06-02|2017-12-07|Intel Corporation|Recognition of activity in a video image sequence using depth information|

US10628667B2|2018-01-11|2020-04-21|Futurewei Technologies, Inc.|Activity recognition method using videotubes|US10187343B2|2014-12-18|2019-01-22|Facebook, Inc.|Location data for defining places and traffic|

CN110832493A|2017-07-04|2020-02-21|日本电气株式会社|Analysis device, analysis method, and program|

EP3493116A1|2017-12-04|2019-06-05|Aptiv Technologies Limited|System and method for generating a confidence value for at least one state in the interior of a vehicle|

US10628667B2|2018-01-11|2020-04-21|Futurewei Technologies, Inc.|Activity recognition method using videotubes|

CN108537151A|2018-03-27|2018-09-14|上海小蚁科技有限公司|A kind of non-maxima suppression arithmetic unit and system|

US10706584B1|2018-05-18|2020-07-07|Facebook Technologies, Llc|Hand tracking using a passive camera system|

US10922573B2|2018-10-22|2021-02-16|Future Health Works Ltd.|Computer based object detection within a video or image|

GB2582775A|2019-04-02|2020-10-07|Jaguar Land Rover Ltd|Attention level determination|

EP3796209A1|2019-09-17|2021-03-24|Aptiv Technologies Limited|Method and system for determining an activity of an occupant of a vehicle|

US11043003B2|2019-11-18|2021-06-22|Waymo Llc|Interacted object detection neural network|

US11222200B2|2020-02-13|2022-01-11|Tencent America LLC|Video-based 3D hand pose and mesh estimation based on temporal-aware self-supervised learning|

WO2021183142A1|2020-03-13|2021-09-16|Google Llc|Context-based speaker counter for a speaker diarization system|

JP2022014776A|2020-07-07|2022-01-20|株式会社日立製作所|Activity detection device, activity detection system and activity detection method|

法律状态:
2021-12-07| B350| Update of information on the portal [chapter 15.35 patent gazette]|

优先权:

申请号 | 申请日 | 专利标题

US15/867,932|US10628667B2|2018-01-11|2018-01-11|Activity recognition method using videotubes|

US15/867,932|2018-01-11|

PCT/CN2018/120397|WO2019137137A1|2018-01-11|2018-12-11|Activity recognition method using videotubes|

[返回顶部]