专利摘要:
A method for obtaining depth information from a scene is disclosed wherein the method comprises the steps of: a) acquiring a plurality of images of the scene by means of at least one camera during a time of shot wherein the plurality of images offer at least two different views of the scene; b) for each of the images of step a), simultaneously acquiring data about the position of the images referred to a six-axis reference system; c) selecting from the images of step b) at least two images; d) rectifying the images selected on step c) thereby generating a set of rectified images; and e) generating a depth map from the rectified images. Additionally devices for carrying out the method are disclosed.
公开号:ES2747387A2
申请号:ES201990066
申请日:2017-02-06
公开日:2020-03-10
发明作者:Claret Jorge Vicente Blasco;Alvaro Carles Montoliu;Perino Ivan Virgilio;Uso Adolfo Martinez
申请人:Photonic Sensors and Algorithms SL;
IPC主号:
专利说明:

[0001] DEVICE AND METHOD FOR OBTAINING DEPTH INFORMATION AT
[0002]
[0003] Technical field
[0004]
[0005] The present invention falls within the field of digital image processing, and more particularly to methods and systems for estimating distances and generating depth maps from images.
[0006]
[0007] Prior art
[0008]
[0009] The retrieval of 3D information from images is an extensively investigated problem in machine vision, which has important applications in robotics, scene understanding, and 3D reconstruction. Depth map estimation is obtained, for the most part, by processing more than one view (usually two views) of a scene, or taking several images of the scene with one device or taking several images using several devices (usually two cameras). in a stereo camera setup). This is known as multiview (or stereo vision in the case of two cameras or two views) and is based on triangulation techniques. A general approach to extracting depth information from an object point is to measure the image displacement of this point relative to the various images captured from the scene. Displacement or disparity is directly related to the actual depth of the object. In order to obtain the disparity of a point, it is necessary to identify the position of the same point in the rest of the views (or at least in two views). This problem is usually solved using matching algorithms, a procedure that is well known in the field of image processing research. However, stereo vision techniques have two relevant weaknesses compared to the invention proposed in this document: first, the need to have (at least) two cameras is an important limitation in many cases, and second, the fact that stereoscopic approaches are much more computationally expensive since they usually require high-calculus matching algorithms (that match patterns from two or more images).
[0010] An alternative to having multiple devices or taking multiple photographs of a scene would be to use a full-lens camera. Plenoptical cameras are devices images that can capture not only spatial information but also angular information of a scene in a structure called a bright field. Typically, plenoptic cameras are composed of a main lens (or a set of lenses equivalent to said main lens), a microlens array (MLA) and a sensor.
[0011] Time of Flight (ToF) cameras produce a depth map that can be used directly to estimate the 3D structure of the object's world without the help of traditional machine vision algorithms. ToF cameras work by measuring the phase delay of reflected infrared (IR) light previously transmitted by the camera itself. Although already present in some mobile devices, this technology is still far from being accepted as a regular capability due to the fact that it features much higher volume and power dissipation (the imaging camera, IR transmitter, and IR camera). , and the procedure for pairing images between both cameras), in addition, the distance that can be discriminated with feasible IR transmitters in the art is quite limited, and the outdoor conditions on sunny days further limits its use, since a large Light power from daylight masks IR sensors.
[0012] Usually, mobile devices incorporate at least one camera to continue taking images and videos. Integrated cameras on mobile devices provide many capabilities to the user, however among these capabilities manufacturers cannot offer a realistic depth map of a scene when only one camera is available.
[0013] There are approaches that take the depth estimation task into account only from a single still image as input, most of the time based on perspective heuristics and size reductions for objects known to be of constant size. However, these approaches formulate hypotheses that often fail to generalize for all possible image situations, such as assuming the particular perspective of the scene. They are also based on the use of previous knowledge about the scene; which is generally a highly unrealistic hypothesis. Depth maps obtained in this way, while useful for other tasks, will always be inherently incomplete and not accurate enough to produce visually satisfying 3D images.
[0014] Another methodology for obtaining 3D information from images is the obtaining of integral synthetic aperture images (SAII). This method requires a set of cameras (or a mechanical movement of a camera that simulates taking pictures sequential simulating a set of cameras), which obtains multiple perspectives in high resolution with the camera at different points of the set.
[0015] The present invention uses some of the concepts derived from methods used by the prior art in stereo photography in a novel way: a first step in stereo photography is the "calibration" of the cameras (a step that can be avoided in the present invention due to the fact that assuming the camera is already calibrated), a second stage is called “rectification” (in which the images from the two cameras in the stereoscopic pair are processed appropriately to deduce the images that would have been recorded if the two cameras of the stereoscopic pair were fully aligned and coplanar), the "camera rectification" in the present invention is very different from what is performed in stereo imaging and is described in detail below. The third stage in stereo photography is the "correspondence", a procedure to identify patterns in the two images of the stereoscopic pair already "rectified", and then triangulations to calculate the distances to the object world and to compose the 3D images. The three steps described "camera calibration", "image rectification" and "correspondence between views" (usually two views) are common with respect to "registration". The invention uses the same terminology, but the "matching" and "rectification" (and therefore "registration") procedures are different from those of the prior art, ie different with respect to stereo cameras or multi-view cameras .
[0016] The proposed invention assumes a situation where the user wants to obtain a high resolution depth map from a conventional camera with a single shot acquisition in real time. The invention benefits from the movement that the camera experiences during the shooting time, this movement being recorded from the data provided by, for example, the accelerometer and gyroscope devices (devices that are present in almost any mobile phone at the moment that this patent is drawn up). The image processing proposed herein improves the prior art approaches in 3D vision in terms of number of images (hence number of cameras) required, computational efficiency, and power requirements. On the other hand, the invention improves approaches based on plenoptic cameras in terms of spatial resolution and reliability for great depths in the resulting depth map.
[0017]
[0018] Summary of the invention
[0019] The processing method described herein implements an extremely simplified matching algorithm between multiple images captured by a mobile device with a single conventional camera, multiple images that are captured sequentially, and the position at which each image is captured can be calculated using the accelerometer, gyroscope, or any other such capability built into the mobile device, car, or any moving object. Once the image matching is done, the images are used to create a dense depth map of the scene. Images are taken in a single shot by a portable mobile device, the movement of the mobile device can be detected and processed during the time span during which the shot takes place. This movement can be caused by the inherent movement of the hands (hand tremors), by vibrations from incoming calls (conveniently programmed to vibrate while shooting a photograph or video), or by the camera being over a moving object. (for example, a vehicle or car) or because the user is moving. The methods described in this document can be efficiently adapted in order to implement them in parallel processors and / or GPUs (increasingly widespread) as well as specific parallel processors for mobile devices that are powered by batteries. The invention provides real-time processing for video recording.
[0020] For the description of the present invention the following definitions will be taken into account hereinafter:
[0021] - Plenoptic camera : a device that can capture not only the spatial position but also the direction of arrival of the incident light rays.
[0022] - Bright field : four-dimensional structure LF ( px, py, lx, ly) that contains the information from the light captured by the pixels (px, py) under the microlenses (lx, ly) in a plenoptic camera or system of comprehensive synthetic aperture imaging.
[0023] - Depth : distance between the plane of a point that is the object of a scene and the main plane of the camera, both planes being perpendicular to the optical axis.
[0024] - Epipolar image : two-dimensional segment of the compound bright field structure choosing a certain value of ( px, lx) (vertical epipolar) or ( py, ly) (horizontal epipolar) as described in Figure 3.
[0025] - Epipolar line : set of pixels connected within an epipolar image corresponding to image edges in the object world.
[0026] - Plenoptic view : two-dimensional image formed by taking a segment of the bright field structure choosing a certain value ( px, py), the same (px, py) for each of the microlenses ( lx, ly).
[0027]
[0028] - Depth map : two-dimensional image in which the calculated depth values of the object world (dz) are added as an additional value to each position (dx, dy) of the two-dimensional image, composing ( dx, dy, dz). Each pixel in the depth map encodes the distance to the corresponding point in the scene
[0029] - Microlens matrix (MLA) : matrix of small lenses (microlenses).
[0030] - Microimage : image of the main aperture produced by a specific microlensing with respect to the sensor.
[0031] - Reference line : distance between the center of the openings of two images (taken by plenoptical or conventional cameras or any camera).
[0032] - Stereo compatibility (also called matching algorithms): this term refers to the procedure of, when receiving two images of the same scene, knowing which pixels of one image represent the same points of the scene in the pixels of the second image. A parallelism can be made with human eyes, the problem then is which points observed by the left eye correspond to which points observed by the right eye.
[0033] - Shoot : act of pressing the button in order to take a picture. Ultimately, many frames can be acquired during this situation.
[0034] - Shooting : act of having pressed the button in order to take a picture.
[0035] - Exposure : A camera sensor is exposed to incoming light if its aperture is open, allowing light to enter the camera.
[0036] - Accelerometer : device that records the linear acceleration of movements of the structure to which it is attached (usually in the x, y, and z directions).
[0037] - Gyroscope : device that allows the acceleration of angular rotation (as opposed to the linear acceleration of the accelerometer) usually with respect to the axis of three rotations (pitch, roll and yaw; as opposed to ax, y and z in accelerometers).
[0038] - IMU and AHRS : Inertial Measurement Units (IMU) and Attitude and Heading Reference Systems (AHRS) are electronic devices that monitor and report on an object's specific force, angular velocity, and sometimes the surrounding magnetic field the body using a combination of accelerometers and gyros, and sometimes also magnetometers. IMUs and AHRSs are typically used within aircraft, including unmanned aerial vehicles (UAVs) and vessels including submarines and unmanned underwater vehicles (UUVs). The main difference between an inertial unit of measurement (IMU) and an AHRS is the addition Integrated Processing System (which may include microprocessors and memories, for example) in an AHRS that provides attitude and heading information, compared to IMUs that only provide sensor data to an additional device that calculates attitude and heading.
[0039] - Speedometer : an instrument that measures and indicates the change in position of an object over time (speed).
[0040] - GPS : the global location system (GPS) is a global navigation system through the use of satellites that provide geolocation and weather information to a GPS receiver.
[0041] - Image rectification : in the context of this invention, the procedure of applying two-dimensional homographies to images acquired at different moments in time by moving the cameras whose three-dimensional geometry is known, so that lines and patterns in the original images (with respect to a six-axis reference system [x ", y", z ", pitch", roll "and yaw"] in which the moving camera shoots after a certain amount of time t1) are mapped to align lines and patterns in the transformed images (with respect to a six-axis reference system [x, y, z, pitch, roll and yaw] in which the camera was at zero time), resulting in two images (initially acquired at times t1 and zero) that are comparable images as if they had been acquired by coplanar cameras with the same z, pitch, roll and yaw, and with "rectified" values of x and y that depend on the movement at or along those two axes (reference lines in x and y between time 0 and time t1). After image rectification, the shooting procedure at time 0 and at time ti can be used to compose different views of "virtual stereo cameras", and / or different views of "virtual multi-view cameras" and / or different views of "plenoptic cameras virtual ”.
[0042] - Mobile device : small computing device, generally small enough to be operated manually. They also have built-in cameras and other capabilities such as GPS, accelerometer, gyroscope, etc. They can be mobile phones, tablets, laptops, cameras, and other devices.
[0043] - Conventional camera: device that only captures the spatial position of the light rays incident on the image sensor, so that each pixel of the sensor integrates all the light that comes from any direction from the entire opening of the device.
[0044] - Obtaining synthetic aperture integral image (SAII): an array of image sensors (cameras) distributed in a homogeneous random network or (alternatively).
[0045] In essence, the present invention discloses a method of obtaining depth information from a scene, comprising the steps of:
[0046] a) acquiring a plurality of images of the scene by means of at least one camera during a shooting time in which the plurality of images offers at least two different views of the scene;
[0047] b) for each of the images in step a), simultaneously acquire data on the position of the images with respect to a six-axis reference system; c) selecting from the images of step b) at least two images;
[0048] d) rectify the images selected in step c) thereby generating a set of rectified images; Y
[0049] e) generate a depth map from the rectified images.
[0050] The position of the images during the shooting time can be measured from a set of positioning data acquired by means of at least one positioning device, for example a device selected from the group of: an accelerometer, an inertial measurement unit (IMU), an attitude and heading reference system (AHRS), a GPS, a speedometer and / or a gyroscope.
[0051] Inertial measurement units (IMUs) and attitude and heading reference systems (AHRSs) are electronic devices that monitor and report on an object's specific force, angular velocity, and sometimes the magnetic field surrounding the body. using a combination of accelerometers and gyros, and sometimes also magnetometers. Typically, IMUs and AHRS are used within aircraft, including unmanned aerial vehicles (UAVs) and vessels including submarines and unmanned underwater vehicles (UUVs). The main difference between an inertial measurement unit (IMU) and an AHRS is the addition of an integrated processing system (which, for example, can include microprocessors and memories) in an AHRS that provides attitude and heading information, compared to IMUs that only provide sensor data to an additional device that calculates attitude and heading.
[0052] In order to achieve better precision, the positioning device can be rigidly attached to at least one of the cameras.
[0053] In one embodiment, at least one camera is associated with a mobile device. Such a mobile device can be, for example, a smartphone, a tablet, a laptop or a compact camera.
[0054] In a more preferred embodiment, in step c), images are selected based on their positions in the six axis reference system.
[0055] In a first preferred embodiment, the images are selected so that their relative distances are small enough to cause a maximum disparity of at most one pixel. In this case, in step e) a virtual synthetic aperture comprehensive imaging system can be generated with the rectified images thereby generating a set of epipolar images. In a second preferred embodiment, the images are selected so that their relative distances are large enough to cause a disparity of more than one pixel. In this case, in step e), a virtual plenoptic-stereoscopic system is generated with the rectified images thereby generating a set of extended epipolar images.
[0056] Once the epipolar images are generated by, for example, the first preferred embodiment or the second preferred embodiment, step e) may further comprise calculating epipolar line slopes from the epipolar images. With these epipolar images, a depth map of the scene can be generated by converting the slopes of the epipolar lines into depths. Additionally, the slopes can be obtained by analyzing the combined horizontal and vertical epipolar lines in a multidimensional matrix.
[0057] The method of the present invention may comprise a step of generating a three-dimensional image of the scene from the depth map. In particular, depths / slopes can be calculated on the combined horizontal and / or vertical epipolar lines directly on a two-dimensional sparse depth / slope map. Additionally, the sparse depth / slope map can be filled by applying image fill techniques to obtain depth / slope values for each pixel. Preferably, for depth estimation, calculations are performed only for those pixels of the sensors in which the edges of the object world have been detected.
[0058] In step a), at least one camera moves, preferably, during the shooting time, for example, due to undetermined random movements caused by human hand tremors or attaching at least one camera to a moving structure. relative to the scene (for example the camera is mounted or placed in an automobile location with wide visibility to areas of interest outside the automobile, or to measure distances within the automobile for applications such as gesture recognition).
[0059] Also, the plurality of images from step a) are preferably acquired by at least two cameras. In this case, the at least two cameras can be aligned with their known relative positions.
[0060] In a preferred embodiment, a video sequence is comprised of at least two levels of foreground depth, optional mid-planes, and previous two-dimensional images (located at different depths in the subject world) and in which said combination of different levels of two-dimensional images in successive frames and / or the change of occlusions in two-dimensional images closer to the previous ones and / or the change of perspective and size in two-dimensional images closer to the foreground produces a perception in 3D to the user. Furthermore, in an exemplary embodiment, only some or all of the epipolar images distributed along the vertical / horizontal dimension are considered in order to reduce statistical noise.
[0061] Brief description of the drawings
[0062] Next, a series of drawings that help to better understand the invention and that are expressly related to embodiments of said invention are briefly described, presented as non-limiting examples thereof.
[0063] Figure 1 shows a schematic of the plenoptic chamber 100, including a sensor, an MLA (microlens matrix), and a main camera lens. It also displays two micro images.
[0064] Figure 2 shows an embodiment (2A) of a plenoptic chamber with the pattern produced on the sensor (2B) for a point in the object world located further from the camera than the conjugated plane of the MLA.
[0065] Figure 3 shows the horizontal and vertical central epipolar imaging procedure from a bright field to radiate points in the object world. Figure 4 shows a possible embodiment of a multi-view synthetic aperture (SAII) comprehensive imaging system: a two-dimensional array of MxN cameras.
[0066] Figure 5A illustrates a comparison between the reference line of a plenoptic chamber ("narrow reference line") and the reference line between two cameras in a stereo configuration ("wide reference line"). The camera at the top is a plenoptic camera and the camera at the bottom is a conventional camera, both arranged in a stereo plenoptic camera setup.
[0067] Figure 5B shows a mobile device with a plenoptic camera and two additional conventional cameras (however, either of the two additional cameras can be either a conventional camera or a plenoptic camera).
[0068] Figures 6A and 6B illustrate the procedure of extending an epipolar image captured with a full-lens camera with a two-dimensional image of the same scene captured by a conventional camera with both cameras in a stereo configuration as in Figure 5A.
[0069] Figure 7A shows a six-axis reference system (x, y, z, pitch, roll, and yaw) that includes all possible movements that can be recorded by accelerometers and gyroscopes on a mobile phone (or any mobile device) that includes a camera.
[0070] Figure 7B shows an example of data acquired from the accelerometer of a mobile device (accelerations in the x, y and z directions).
[0071] Figure 8 shows the "rectification" and "matching" procedures in a stereoscopic pair system consisting of two cameras.
[0072] Figure 9 illustrates the “rectification” procedure for a 4-chamber matrix. Figure 10 shows how the 6-axis reference system associated with a given camera changes if camera movement involves positive deltas at the x, y, and z position, as well as negative yaw rotation.
[0073] Figure 11 shows a change in the camera reference system for a negative translation in x and y, a positive translation in z, as well as a positive roll rotation.
[0074] Figure 12 shows a change in the camera reference system for a positive translation in the x and y directions, a negative translation in z, as well as a positive pitch rotation.
[0075] Figure 13 illustrates a multiview system with an example of a single camera path moving through positions A, B, C, and D along a two-dimensional area the same size as an array of MxN cameras.
[0076] Figure 14 shows a 2-second recording of spatial movements (in millimeters) detected by accelerometers from mass-produced smartphones in the x and y directions as the phone is held by a human being to take a picture.
[0077] Figure 15A shows an electronic mobile device that includes a multiview system that acquires images that are subjected to treatment through a processor that includes a multi-core processor.
[0078] Figure 15B is just like Figure 15A, but with two CPUs (central processing units) instead of a multi-core processor.
[0079] Figure 15C is just like Figure 15B, but the CPUs are replaced by a GPU (Graphics Processing Unit) that includes a large number of parallel processors. Figure 16A shows a stereo camera matching and image rectification procedure.
[0080] Figure 16B shows a method for calculating a depth map according to the invention in this disclosure.
[0081]
[0082] Detailed description
[0083] The present invention relates to a device and method for generating a depth map from a bright field. A bright field can be captured can be captured by multiple types of devices. For the sake of simplicity, a first example will be considered in which a conventional camera is moving while taking multiple images. The method described in this document creates an equivalent imaging system for those images captured by a moving device and applies full-optical algorithms to generate a depth map of a scene.
[0084] In a further example, the method is described by applying it to systems consisting of multiple moving cameras with the possibility of including one or more plenoptical cameras and one or more conventional cameras. However, the method described in this document can be applied to clear fields captured by any other device, including other integral imaging devices.
[0085] Figure 1 illustrates an embodiment of a plenoptic camera 100: a sensor 1, microlenses 22, and the top cylinder of optical components (or main camera lens 3). Figure 1 shows two sets of rays that cross the main opening of the plenoptic system and that reach the central part and close to the central microlenses. The micro images 11, 12 do not overlap if the optical system is properly designed.
[0086] Fig. 2 shows an object point 210 that is further than the conjugate plane of the microlens array 22 through the main lens 3. Therefore, it illuminates more than one microlens, so that the focus point is closer to the main lens 3 than the position of the microlens array 22 and, therefore, the pattern captured by the image sensor 206 is shown in Figure 2B. The gray levels in some of the micro images 212 correspond to partially illuminated pixels while in the white pixels the entire pixel area has been impacted by light coming from point 210 object in the object world.
[0087] The basis of plenoptic imaging is that objects in the world at different depths or distances from the camera will produce different illumination patterns on the sensor of a plenoptic camera. The various patterns captured by the sensor can be represented in epipolar images, which provide implicit depth information of objects in the world.
[0088]
[0089] Figure 3 shows the horizontal 300 and vertical 302 central epipolar imaging procedure from a clear field 206 to radiate points in the object world 210 located at different distances from a plenoptic camera 100: at the conjugated distance from the 22 microlenses (Figure 3A), closest at the conjugate distance (figure 3B), and further from the conjugate distance (figure 3C), thereby showing the inherent ability of plenoptic cameras to calculate distances from the object world. The case of FIG. 3C is visualized in FIGS. 2A and 2B, which show how light from radiation point 210 in the object world propagates inside chamber 100, crossing microlenses 22 and printing a light pattern on sensor 206.
[0090] The procedure to transform the patterns found in epipolar images for depth information requires the application of some image processing techniques that are well known in the prior art. Epipolar images contain epipolar lines; they are connected pixels that form a line (several sensor pixels corresponding to the same point in the object world), as clearly shown in Figures 2B and 3C for world radiation sources farther away than the focus point of microlenses (epipolar line tilted to the left in Figure 3C), for world radiation sources closer than the focus of microlenses (epipolar line tilted to the right in Figure 3B), and for world radiation sources focused in a way accurate on the surface of the microlens (vertical epipolar line in Figure 3A). The slopes of these epipolar lines are directly related to the shape of the illuminated pattern on the microlenses and to the corresponding depth of that point in the object world. In short, the procedure, the patterns found in epipolar images, the epipolar lines, provide information about the depth of objects in the real object world. These lines can be detected using edge detection algorithms and their slopes can be measured. Therefore, the slope of each epipolar line provides a value that, when processed conveniently, provides the actual depth of the point in the object world that produced such a pattern.
[0091]
[0092] Although it is a very promising technology, obtaining full-lens imaging also comes at a cost, since the performance of a full-lens camera is limited by the resolution of the microlens array, resulting in much lower image resolution than traditional imaging devices. Additionally, plenoptic cameras are a fairly new technology that is still difficult to find on mobile devices.
[0093] Figure 4 shows a possible SAII configuration (integral synthetic aperture imaging) of a camera array. This matrix can feature MxN cameras or a single camera that moves along the matrix (for example, starting at position 1, then 2, 3, etc., up to position MN) that takes a still image at each position of the matrix. The parallelism with a plenoptic camera is evident and the The same epipolar images previously described for a canoptic camera can be obtained with an SAII, as is well known, a plenoptic camera such as in figure 2A with "OxP" pixels per microlens and "TxS" microlens is functionally equivalent to "OxP" Conventional cameras with "TxS" pixels with the cameras uniformly spaced above the entrance pupil of the plenoptic chamber. Similarly, the matrix of MxN cameras in figure 4 (with QxR pixels per camera) is equivalent to a full-optical system such as in figure 1 with MxN pixels per microlens 22 and several pixels per camera 51 equivalent which are the same as the total number of microlenses in the equivalent plenoptic chamber. The only practical difference is that the size of this number (QxR) in a SAII system, due to technology and implementation limitations, is much larger than the number of microlenses that can be designed in a plenoptic chamber. Depth maps calculated from an SAII can benefit from a wider reference line than plenoptic cameras since the distance between nodes in Figure 4 (which can be as high as several cm or even higher) is greater than the distance between the “OxP” equivalent chambers of the plenoptic chambers (several millimeters and in small chambers of up to a tenth of a millimeter). Figures 5A and 5B (a two-dimensional side view of cameras that can be obviously extrapolated to a three-dimensional configuration in which the third dimension would be perpendicular to the paper without losing the generality of the subsequent discussion) compares the reference line of a plenoptic camera (“Narrow reference line” d showing the separation d between the OxP equivalent chambers of a plenoptic chamber with TxS pixels per microlens, with each 51 equivalent chamber having as many pixels as microlenses in the plenoptic chamber) and the “reference line wide ”B between the two cameras of a stereo camera or a wider SAII reference line: in a practical example of a stereo camera or a SAII the“ wide reference line ”B may be a few centimeters, whereas in a typical plenoptic chamber the "narrow reference line" d can reach values as small as millimeters or in Even a tenth of a millimeter. The matrix of MxN cameras in figure 4 (with QxTR pixels per camera) is equivalent to a full-optical system such as in figure 1 with MxN pixels per microlens 22 and several pixels per camera 51 equivalent that are the same as the total number of microlenses in the plenoptic equivalent chamber (Qx), the size of this number (QxT) in this case (a SAII system) is much larger than the number of microlenses that can be designed in a plenoptic chamber. Obviously, SAII systems offer a higher resolution than a full-optical camera and the wider reference line makes it more accurate to calculate depth over long distances from the camera.
[0094] The proposed invention obtains a high resolution depth map from a conventional camera with a single shot acquisition and, in case of video recording, the depth map is obtained in real time. The invention uses the movement and vibrations experienced by the camera during the time a shot is taken to obtain a sequence of frames, thereby simulating the various images of a SAII (or the equivalent cameras of a plenoptic camera) with the sequence of frames acquired by the moving camera. The present invention uses the camera distances between acquisitions chosen as reference lines (distances between views) of a multi-view system that can be used to estimate the depth of the scene. The primary goal of these methods is to provide the ability to create a high-resolution depth map when only a conventional camera is available and in just one shot (triggering involves multiple frame acquisition). The present invention is very computationally efficient, so effective that it can be used to obtain real-time depth maps in video streams even on inexpensive mobile devices (most of the time with inexpensive battery-powered processors, where they are effective calculations necessary to avoid draining batteries quickly).
[0095] The proposed invention has two main stages after recording several consecutive frames. First, a step to “correct” the images acquired during the shooting time (each image acquired with the camera in slightly different positions for x, y, z, yaw, pitch and roll) to obtain a set of “corrected images "That are related to each other as if they had been performed by a single plenoptic camera or a single SAII imaging system (a procedure of" image rectification "such as in Figure 9, or also by producing a series of A images, B, C and D in figure 13). This first stage performs inter-image rectification (such as in Figure 9) using recordings from the accelerometer and gyroscope, or any other capabilities that can be found in any current smartphone, car, or moving object. A second stage is applied to create a depth map using plenoptic algorithms. This consists of calculating the depth value of each point in the scene by detecting the slope of the epipolar lines of an epipolar image. In one embodiment, this calculation can be performed only for the detected edges in the scene, rather than for all the pixels in the scene. The method of the present invention can process video images in real time (approximately 15 frames per second and more) while implementations Previous ones use from hundreds of milliseconds to minutes just to process a single frame.
[0096] A normal hand tremor (or physiological tremor) is a small, almost imperceptible tremor that is difficult to perceive by the human eye and does not interfere with activities. The frequency of the vibrations is between 8 and 13 cycles per second and it is a normal tremor in any person (it is not considered to be associated with any disease process). Even these little tremors can be used as a source to generate camera shake that can create a reference line for depth detection.
[0097] The most common sensors that determine the position and orientation of an object are the gyroscope and the accelerometer. Both are present in most of the current mobile devices (smart phones and others), and when the information is recorded by both devices simultaneously with the image acquisition procedure, it is possible to know for each recorded frame the exact FOV (field of vision) (in terms of the x, y, z position of the camera in the three-dimensional world, and the direction the camera is facing at the time of shooting, defined by the 3 fundamental angles, pitch, roll, and yaw, as described in Figure 7A). To record movements, the normal prior art sampling rate of accelerometers and gyros is approximately 500Hz, this means that the accelerometer and gyroscope are sensitive enough to record hand shake movements (between 8 and 13 cycles per second) . Figure 7B shows a sequence of movements recorded by the accelerometer of a mobile phone in the X, Y and Z directions. It starts with the mobile in hand, in a position as if we were going to take a picture. At a certain time, the button is “pressed” (the activation element is triggered) to take a photo and, after this, the mobile device is left on the table. The entire sequence takes 10 seconds with a sampling rate of 100Hz (resulting in approximately 1000 samples). These data can also be obtained for the gyroscope device. Although accelerometers and gyroscopes show some delay with the information they provide, their measurements have different characteristics. Accelerometers measure triaxial physical accelerations (XYZ) while gyroscopes measure triaxial angular acceleration (PRY) along each axis of rotation, and the combination of both devices provides 6-axis motion detection, capturing any possible movement of the device. Mobile for fast and accurate determination of the camera's relative position and orientation. These relative position and orientation parameters are used in the formation of “virtual uptake” from “virtual SAII systems” (or “virtual plenum cameras”) and to compose epipolar images as will be explained below. Figure 7A shows a 6-axis coordinate system associated with a camera on a mobile phone that will be used to describe the movements recorded by the accelerometer and gyroscope.
[0098] If a certain starting position of a mobile device is assumed and a period of image acquisition that begins when the user presses the button to take a picture. As explained, using the data from the accelerometer and gyroscope the relative position of the mobile device with respect to that initial position can be carried out at any time during the exposure time of the image sequence acquisitions that occur after press the shutter button. Figure 13 shows an example of a path followed by the mobile device during a certain time interval. During this time, the mobile device has completed the trajectory indicated by the dashed line, and also takes pictures when it was in positions A, B, C, and D. The example in the figure also shows a matrix MxN as above in order to compare the sequential image acquisition procedure described with a virtual SAII system located on a plane closest to the location where the ABCD shots occurred. Therefore, if the movement of the mobile device is properly recorded and processed, both systems (a SAII and the proposed invention) are functionally equivalent. Now the time interval in which the invention acquires the images and records the movement of the mobile device will be described in detail. Most of today's mobile devices can acquire images with a frame rate frequency of approximately 120 frames per second (fps), which is significantly higher than what is considered real time (a subjective value set by some 15fps to 30fps or higher number of frames per second). By assuming that a mobile device of this nature includes a conventional camera and will take a picture when held in a position given by a human hand (these are not intended to be considered as limiting factors but as examples). If images are recorded for 1 second, at 120 fps, for example, four images can be chosen within this period with reference lines given between them. Suppose, also, that the trajectory shown in figure 13 has been drawn in front of a matrix of MxN positions to maintain a better parallelism between the proposed method and a MII-chamber SAII system or a plenoptic chamber with MxN pixels per microlens. From this path that is unintentionally caused by hand tremors, it is possible to select, for example, those points that maximize the total distance (both horizontally and vertically) within the path. The resolution of depth maps over long distances improves with wider reference lines and therefore the selection of those images that they are as far apart from each other as possible is the best solution to discriminate distances from the object world as much as possible. Note that the path example in Figure 13 is a 2D simplification. To make the proposed invention work as a SAII system, the different images taken along the path must be "rectified" according to the movement parameters recorded by the accelerometer, gyroscope or any other device of this type, taking into account the 6 degrees of freedom (the six positions x, y, z, P, R and Y).
[0099] Now how image processing rectification for stereo imaging is performed and the differences for the present invention will be defined. Figure 8 shows how a pattern 81 is recorded by two different cameras in a stereo configuration. Pattern 81 is captured from two different points of view, recording two flat images 82 and 83. These two stereo images are “rectified” to obtain the images that would have been obtained if the two cameras had been completely aligned, which is in the same position y and z in space, with a known fixed x distance between them, both cameras being located in the same plane (usually known coplanar condition, which means that its roll and pitch difference is zero, or that its optical axes are parallel), and with a yaw difference of zero between them (equivalent to affirming that both images 84 and 85 must present the same degree of horizontality). Figure 9 shows how a shaking camera from a human hand records four different shots (91 to 94) at four different times with four different camera positions on a five-axis reference system (x, y, pitch, roll and yaw), is different from what a SAII system with four cameras located in the dotted position (95 to 98) would have recorded. The "rectification" procedure for this system involves calculating a set of rectified images 95-98 from a set of acquired images 91-94. This is a simplified view, since it does not involve z-movements and assumes a good overlap between the acquired images 91 and the place where you want the images to be, or the rectified images 95-98. Note, however, that z-rectification is also very important when the camera is placed on a mobile structure such as an automobile, this value being directly proportional to its speed. A more realistic embodiment of the present invention performs sequential recording of multiple video frames (which may be 120 frames per second, for example) and simultaneous recording of camera position within a six-axis reference system (x , y, z, pitch, roll and yaw).
[0100] This is exemplified in figure 10: at a given time the camera captures a frame, at which point the camera is positioned with its 6 associated axes at a given location in space (x, y, z, pitch, roll, and yaw ), when the camera captures the next frame, the six-axis reference system moves to a new location that is known because its new position (x ", y", z ", pitch", roll "and yaw") has recorded by the accelerometer and gyroscope associated with the camera. In this particular example in Figure 10, there were three positive movements in x, y, and z, as well as one negative yaw rotation. Figure 11 shows another example in which between the first frame and the second frame x and y there was a negative movement, z a positive movement, as well as a positive yaw rotation. Figure 12 is yet another example in which between the first and second frames the movements in x and y were positive, z negative, as well as a positive pitch rotation.
[0101] Now they are going to buy the necessary times and what is feasible from the technological point of view to achieve the objective. Hand tremors show low-frequency movements of 8 to 13 cycles per second, during which 120 shots can be taken seconds by prior art camera and photo sensor systems, and during which second 500 readings can be sampled using accelerometers and gyroscopes. the prior art. Figure 14 is a 2-second recording of spatial movements detected by accelerometers from mass-produced smartphones in the x and y directions (z-direction and yaw, roll and pitch can also be recorded and used in calculations), in this case particular figure 14 the phone is held by an amateur rifle user (for a normal person the movements are slightly greater, for a person suffering from Parkinson's the movements are much greater), the figure shows an interval of movements of almost 4 millimeters on the x-axis (vertical in figure 14) and almost 2 millimeters on the y-axis (horizontal in figure 14). These displacements are greater than the "reference line close" usual d separating equivalent chambers of a plenoptic camera usual (an entrance pupil 2 mm and 10 pixels per microlens produces a line of minimum reference ( "reference line close" d in Figure 5A of 0.2 mm); or if a typical reference line d of a 0.1 to 0.3 mm plenoptic chamber is compared, Figure 14 shows that the same reference line is likely to produce every 100 to 200 milliseconds if produced by hand tremors. Reason why the proposed invention takes approximately 200ms to acquire enough images and data to create a depth map of the scene. In one embodiment, when capturing images at a usual frame rate of 120 fps within a 200ms time interval the invention acquires 24 frames These frames are taken when the mobile device is on movement due to hand tremors or any other vibration. From these 24 frames, the 2 frames with the largest reference line between them can be chosen, this reference line being long enough to improve the quality of the depth map of a multiview camera in terms of the precision obtained for longer distances. long.
[0102] Once several images and their corresponding motion parameters (x, y, z, P, R, Y position) have been captured with a conventional camera as the camera moves in 3D space for a certain period of time, creates the equivalent SAII (or equivalent plenoptic camera) by rectifying all these images according to the movement parameters (new positions in the 6-axis reference system). Then, epipolar images 300 and 302 are formed and plenoptic algorithms are applied to generate a depth map of a scene. A characteristic of plenoptic cameras is that the maximum disparity between consecutive equivalent cameras is -1 pixel, which implies that the pixels that form an epipolar line are always connected to each other. Therefore, in order to appropriately apply plenoptic algorithms to the created equivalent SAII (or equivalent plenoptic camera), the reference line between consecutive images must ensure that no gaps are created when epipolar images are formed. However, this is not always possible to guarantee since the movements of human tremors in figure 14 are sometimes contaminated by abnormally large movements that cannot be modeled as SAII systems (or plenoptic cameras) but are extremely beneficial in increasing the line of benchmark and therefore beneficial for calculating large distances in the object world with very high reliability. These abnormally large movements can be produced artificially, for example by recording the frames that occur when the user is starting to move the phone away, or by someone accidentally hitting the arm of the person taking the picture, or by the large vibrations of a stick to take selfies (which obviously produces greater movements than in figure 14); and are best modeled by a novel device that is also part of this disclosure: a stereo plenoptic 5200 device (Figures 5A and 5B) including at least one plenoptic camera 100 and at least one conventional camera, but in a preferred embodiment shown in the FIG. 5B, two conventional cameras or plenoptical cameras 1304 (or one conventional camera and one plenoptic camera) were added to plenoptic chamber 100. The prototypes of this device have shown evidence that the device has utility on its own (for example, on a mobile phone such as in Figure 5B) and also for modeling abnormally large camera movements that cannot be modeled by a plenoptic camera or SAII system, movements that are especially satisfactory for calculating long distances with respect to very distant objects in the world. It is also worth noting that hand tremors as in Figure 14 are common when the user is trying to hold the camera in place as steady as possible, however, the movements at the moment after pressing the shutter are much greater Even so, they are beneficial because they still face the same FOV, field of view, but the reference line can be increased by several centimeters, yielding much better distance estimates. Also, the statistical distribution of movements in the x and y directions in Figure 14 usually shows a large peak-to-average ratio (most of the time the movement is millimeters, but every once in a while there is one or more samples moving upwards several centimeters), which is beneficial to improve the reference line and is better modeled through a stereo plenoptic device such as in figure 5B since in this case the vertical and / or horizontal epipolar images have large gaps between the various rows (captured images) as in Figures 6A and 6B.
[0103] The embodiment in Figure 5B is a novel combination of two of the recently mentioned technologies (plenoptic and stereo) to create a depth map, which goes much further than the prior art (since it includes plenoptic cameras mixed with conventional cameras or with other plenoptic cameras in a multiview configuration: a superset that may include more cameras than in Figure 5B). Figure 5A shows a basic configuration of a stereo plenoptic device, a multi-view system that significantly improves the depth estimation precision of plenoptic cameras for long distances due to the addition of a conventional camera facing the same FOV (field of view) that the plenoptic chamber. This invention and its methods for estimating depth in real time are comprised of at least one bright field plenoptic camera and include additional conventional or plenoptic cameras. Such a multiview system, with the appropriate image processing methods, can create a depth map of the scene with very high quality resolution, overcoming the disadvantages of plenoptic cameras (limited by unreliable depth measurements for great depths) and of multi-camera systems (which require much more processing power). This multi-perspective invention is, at the same time, extremely efficient in terms of computational requirements. Figure 6A shows a recording of a plenoptic device (on the left) in which an epipolar line 62 within and an epipolar image of the plenoptic device is combined with the resulting image from a conventional camera (right side) that has much more resolution. This figure 6A also shows how a point of the conventional camera 61 such as, for example, the lower camera of figure 5A (or the right or upper cameras of figure 5B) is used to extend the reference line of the plenoptic camera with an image from a conventional camera (such as, for example, the lower camera in Figure 5A or camera 1304 in Figure 5B), resulting in better capabilities and distance estimation performances for the combination of both cameras than the plenoptic chamber by itself. One of the main advantages of this embodiment is the use of plenoptic algorithms for depth estimation (much more computationally efficient than stereo pairing), which are also used in the present disclosure as described below. An additional advantage of this approach is that the lateral resolution of the multiview system can be a lateral resolution of the conventional camera (usually much higher than the lateral resolution of plenoptic cameras), and that it is possible to calculate clear fields with as many points as there are points in the conventional camera (s).
[0104] FIG. 6B is an embodiment of a method of "rectifying" the conventional camera (s) 1304 (s) to match its images with the plenoptic camera 100: an epipolar line 1404 is detected within a epipolar image 400 from plenoptic chamber 100; the distance B between the central view 1516 of the plenoptic chamber 100 and the conventional chamber (s) 1304 (s) is evident from the plenoptic chamber 100 in Figures 5A and 5B, is obtained based on the relationship between the "wide reference line" B between the plenoptic chamber 100 and the conventional chamber (s) 1304 (s) and the "narrow reference lines" d of the plenoptic chamber 10 in Figures 5A, 5B and 6D; the distance H is chosen to coincide with the common part of the FOVs, fields of view, of the plenoptic camera 100 and the conventional camera (s) 1304 (s); the epipolar line 1404 of the plenoptic camera (a set of pixels connected in the epipolar image 400 of the plenoptic camera 100, which by definition marks an edge of the object world) is drawn linearly (1506) to reach the intersection with the row of pixels 1406 of the conventional camera (s), the intersection of the plenoptic camera sensor at pixel 1504, however, in many cases pixel 1504 (sampled by the camera (s) Conventional (s) does not match the "edge patterns" sampled by the plenoptic chamber 100, which is why the search area 1512 in the conventional chamber (s) is defined to finally find the pixel 61 of the conventional camera (s) 1304 (s) matching the edges detected by the plenoptic chamber 100. Through this method, the number of views 1510 of the plenoptical camera 100 captured by the equivalent cameras 51 of the plenoptic camera 100 with additional view (s) of Conventional camera (s) located at a distance (s) from the plenoptic camera much greater (centimeters or even more) than the usual separation between views of the plenoptic camera (approximately tenths of millimeters), which greatly enhances the line of reference (from d to B) and therefore the precision of depth measurements for long distances from the camera (s). This can be summed up with the aid of Fig. 6B as follows: the narrow gap "d" between views 1510 of a plenoptic camera 100 would require large increases in pattern depth in the object world to produce very small variations in line slope 1404 epipolar, however, by adding additional view (s) 1406 from conventional camera (s) or additional plenoptic cameras 1304 it is possible to adjust very accurate “1508 extended epipolar line slopes”, offering a Higher depth measurement accuracy for long distances.
[0105] Figure 5B shows an embodiment of a device of this invention within a mobile device: a plenoptic camera 100 associated with two conventional cameras (or associated with a conventional camera and a plenoptic camera, or associated with two additional plenoptic cameras 1304), a horizontally aligned and the other vertically aligned in order to improve reference lines in both directions (x and y), while saving the high computational requirements of stereo and multiview image pairing using a small search area 1512 (which can be uni or two-dimensional). It is obvious to a person skilled in the art how to modify / expand this device to have several different options: only one plenoptic camera and one conventional camera, only two plenoptic cameras, three plenoptic cameras, any camera array including at least one plenoptic camera, etc.
[0106] The situation illustrated in Figures 6A and 6B (image / images 63 captured (s) by a plenoptic camera, and image / images 64 of the same situation (s) captured by a conventional camera) is equivalent to a single conventional camera that has several images captured in slightly different positions at small distances from each other and an additional image captured by the same conventional camera in a position quite distant from the rest. As shown in Figures 6A and 6B, the formed epipolar image has gaps d between the captured images (where d in a plenoptic camera is the size of the entrance pupil divided by the number of pixels per microlens in one dimension [x and y ]). If the gap B (between the central view of a virtual plenum camera and the equivalent view of the conventional 1304 camera simulated by a moving camera) is greater than the distance D between the central view of the plenoptic camera (or a virtual plenum camera) simulated by a camera in motion) and the end view of said plenoptic camera (ie four times d in the examples of figure 6B) it is possible to create a virtual plenoptic stereo equivalent system. The main criteria for creating either an equivalent SAII (or an equivalent plenoptic system) or an equivalent stereoscopic plenoptic system with a wider reference line is to have at least one large reference line (i.e. between the distances between adjacent images) that is greater than d, if the reference line is less than the distance d an equivalent SAII system is recommended. Also, an equivalent SAII system will be selected if the reference line B is less than the distance D. What should be observed is if in the epipolar images in Figure 6B there is at least one large gap (BD) (greater than the small gaps d), which requires defining a search region 1512 and finding the corresponding edge point 61. On the other hand, in the event that all the reference lines are equal to or smaller than d, the rows of epipolar images are in contact so that the matching algorithms (between the different rows of epipolar images) are avoided and applied common plenoptic algorithms.
[0107] Note that in a device such as in Figure 5B the number of microlenses in a plenoptic chamber is usually smaller than the number of pixels in the associated conventional chamber, however, in the invention where the 1510 plenoptic views are different views extracted of a moving camera the number of pixels in views 1510 is equal to the number of pixels of the equivalent camera in a B reference line.
[0108] In one embodiment, the way to determine whether the equivalent system to create is a virtual plenoptic system (which can also be modeled using a virtual SAII system) or a virtual plenoptic-stereoscopic system directly depends on the greatest distance between consecutive captured images (consecutive in the spatial domain or adjacent images), so that its greatest distance is greater than d , where d is the maximum distance between “chosen captured images” of a virtual plenoptical camera that guarantees that the maximum disparity between said “chosen captured images” is a pixel.
[0109] The captured images are classified by connecting each of these images in x and y dimensions with their adjacent images forming a network. If all the distances between connected images are equal to or smaller than d (disparity smaller than one pixel) these images can be used to compose a virtual SAII (or in the same way, a virtual plenoptic camera). On the other hand, if one or more images are captured at distances in the x and y directions greater than d, those images 64 they can be used to compose additional views 1406 of a virtual stereoscopic plenoptic system such as in Figures 6A and 6B.
[0110] In one embodiment, in order to determine which images among all those captured are consecutive to each other, the x and y coordinates are used to create a network as in Figure 13. Then, the "chosen consecutive image" (in spatial domain) of a certain image is the one located at the minimum distance (in the x and y directions) from said determined image, but always at a shorter distance than d.
[0111] The “rectification” procedure described above for conventional cameras versus plenoptic cameras, even if it makes sense for the device in Figure 5B and similar devices, is an oversimplification of what happens when cameras 1304 are not physical cameras but “Virtual cameras” that shoot different exposures from different points of view from a real moving camera. In figure 6B an image “rectification” has been performed for the reference line (B) and an H “rectification” to match the common part of the FOV of both cameras (100 and 1304); if 1304 were a virtual camera that has been moved several centimeters with movements greater than millimeters due to human tremors while the user is deliberately trying to hold the camera as still as possible (such as in figure 14), the procedure of " rectification ”, instead of the reference line B and the field of view H must consider random movements in the 6 axes (x, y, z, yaw, pitch and roll) that can be determined taking into account that this accelerometer, the gyroscope, or any other positioning device associated with the camera that recorded the new position (x ", y", z ", yaw", pitch "and roll") was the virtual camera 1304 that captured the image a certain amount of time after the plenoptic 100 captured the first image. In a different embodiment, camera 100 is not a physical camera but a "virtual plenoptic camera" (or virtual SAII system) that captures multiple shots (such as in Figure 13: shots A, B, C, D) due to tremors handheld as in figure 14.
[0112] Figure 16A shows a first procedure (1600) related to obtaining stereo camera images. This figure shows a simplified 1600 procedure assuming a fixed (known) position of stereo cameras (known from the calibration procedure of the two stereo cameras). This procedure comprises image rectification (1604) which is simple taking into account the known position (and orientation) of the two cameras and a second matching stage (1606) which involves matching the patterns that are common to the two acquired images Obviously, the compatibility procedure between pixels of the two cameras is different depending on the distances of the object world from the sources of Light produced by the patterns in both cameras, or in other words, an object point in the world far from both cameras will produce practically zero disparity between its two images in the two cameras, while an object point very close to the cameras will produce a very large disparity in the sensors of the two cameras.
[0113] A second method (1610) according to the present invention is described in Figure 16B. This procedure comprises: a step 1614 that records consecutive frames (for example at 120 fps-frames per second), simultaneously records the 6-axis position of the camera (x, y, z, P, R, Y) for each of the “recorded frames” (at 120 frames per second and, for example, records the 6-axis position sampled at approximately 5 positions per frame or 600 samples per second); The next step 1616 selects the positions with large reference lines d appropriately (such as positions A, B, C, and D in Figure 13, positions that may be different for an "Olympic Pistol Shooter" than for a Parkinson's disease) to compose a "virtual SAII system" (or a virtual plenoptic camera) and if they exist, also positions with "higher reference lines" -D suitably to compose a "virtual stereo plenoptic system"; a third stage 1618 rectifies the chosen shots or frames as in Figures 8 and 9 but the rectification depends on the positions of the 6 axis of the camera (different values of x, y, z, pitch, roll and yaw for each one of the shots chosen in step 1616); a fourth step 1620 creates the equivalent SAII system (or the equivalent plenoptic chamber) for the shots that have been chosen and / or if some of the displacements in the x and / or y directions are abnormally large an equivalent plenoptic-stereoscopic system such as in the Figures 5A-5B (but most likely with quite a few different values of z, pitch, roll, and yaw for the equivalent camera with “wide reference line” B, as the cameras in Figures 5A and 5B align and they are coplanar, which is not usually the case with a moving camera (s). Once the equivalent system is created (in step 1620 of Fig. 16B) it is possible to perform an additional fifth step (1622) intended to calculate distances to objects in the world after the traditional epipolar line slope analysis (as in Figures 3A-3C), or enlarged epipolar line analysis (such as Figures 6A and 6B) if the reference lines are large enough (at least one of the images is at a distance in the xy / oy directions greater than d with respect to the “connected image set” [where the distance of each image within the “connected set” is equal to or less than d from its closest images within the “connected image set”]), obtaining a slope map of common FOV images, field of view, from all cameras in the “equivalent system” of step 1620. The slope of the previously obtained epipolar lines can also be used to obtain a depth map through traditional epipolar slope with respect to depth conversions (step 1624 ), obtaining a depth map of the common FOV images, field of view, of all the cameras of the “equivalent system” of step 1620. It is possible to create 3D images (step 1626) from the slope and the maps of depth from previously calculated 3D images that comply with any 3D format (stereo images, integral images, etc.) The robustness of the proposed procedure has been experimentally demonstrated with different users and devices and at different times of the day. Furthermore, all experimentation has been repeated several times to avoid randomization of the procedure.
[0114] In a particular embodiment, the input of images of the invention can be a video sequence: a video sequence is assumed that is captured at 120 fps and it is desired that the invention uses 4 frames (4 images) to calculate depth values of the scene. This will mean that the system will produce depth maps (or 3D images) at approximately 30 fps (considered by many to be real time). The frames selected to calculate the depth map (or to compose a 3D image) are those that show a reference line wide enough, not necessarily consecutive frames.
[0115] So far, the procedure of "recording" two or more images taken by a mobile device using data from the accelerometer, gyroscope, or any other positioning device has been described. It should be remembered that the registration procedure involves image "rectification" (to ensure that the 2 or more acquired images are "recalculated" to become comparable coplanar images, such as in Figures 8 and 9) and "matching" or "compatibility" pattern ”(shown by way of example in figure 8 looking for common pattern 86). The "match" or "pattern match" in SAII, in plenoptic cameras and in one embodiment of this invention is performed by identifying the epipolar lines on the epipolar images).
[0116] In another embodiment, the procedure can be performed within a time interval in which the described procedure can be considered as a real time procedure.
[0117] The movements recorded by the mobile device are good enough to obtain a robust depth map. For this, the reference line obtained by means of the sub-openings of a plenoptic chamber will be compared again with the reference line obtained by means of the proposed invention.
[0118] The reference line of a plenoptic chamber is the distance between the centers of two consecutive subabertures (the distance d between the centers of two equivalent cameras 51 in Figure 5B), and the size of the reference line (as well as the 2D diameter maximum) is directly related to the maximum distance to the object world that the device can estimate with acceptable accuracies; the larger the reference line and the diameter (d and D) the better the depth map (obtaining better estimates of large distances from the object world). As discussed above, one tenth of a millimeter can be considered a normal reference line in a plenoptic chamber (a typical opening of the entrance pupil of 1 or 2 mm and a typical number of 10-20 pixels per microlens). The proposed invention can operate similarly to a SAII system (or a plenoptic camera) but only with a conventional camera taking sequential views. The proposed invention can use the same algorithms based on calculating slopes from epipolar images such as a plenoptic camera (or as a SAII system) to estimate a depth map. However, the invention can work with reference lines greater than the reference line of a plenoptic camera (approximately 0.1 mm), since hand tremors are normally greater than these, therefore, the proposed invention can obtain maps of depth of higher quality as regards precision for longer distances. In addition to this important advantage, it is even more important to point out that the proposed invention can obtain depth maps with much greater spatial resolution than those obtained using a plenoptic camera since the system has the entire resolution of the conventional camera sensor, solving the Main disadvantage of plenoptic cameras (which have the same small spatial resolution as microlenses, and at approximately 100 pixels per square microlens, their resolution is approximately 100 times lower).
[0119] In one embodiment, the movement of the mobile device due to hand tremors can be enhanced or replaced by vibration produced by a small vibration motor included in the mobile device (which may be vibrations used as a substitute or complement for ringtones). or by placing the camera on a moving object during the exposure time (for example the camera is mounted or placed in a car location with wide visibility to areas of interest outside the car).
[0120] In another embodiment, the plenoptic and stereo plenoptic methods described herein to solve the matching problem using the accelerometer, gyroscope, or any other positioning device can be replaced by algorithms that match different images (stereo compatibility or compatibility). multiview). In yet another embodiment, the objects in the foreground can be identified, while in a composite video sequence the background can move in front of the still objects in the foreground (or by moving the objects in the foreground at a slower speed) creating effects. 3D combining two-dimensional images at different distances from the camera: when in a video sequence the background image occlusions change with time, when the foreground objects move at a slower speed than the faster movements in the background, when the perspective or size of foreground objects is slowly changing (for example a shark swimming towards the camera, and increasingly occluding the background in successive frames; or a shark swimming at a constant distance from the camera plane to FOV length, changing occlusions in successive video frames); Or just the opposite, the foreground moves faster than the background and changes occlusions. As an example, but not exclusively, in the cases mentioned above, a combination of video sequences from several different 2D close-up levels, mid-shots, and background-level images located at several different distances from the camera (levels that can be related to their calculated distances due to the techniques mentioned in this disclosure allow the calculation of real-time depth maps from video images), allow a combination of two or more two-dimensional images to produce a 3D perception to the observer.
[0121] A bright field can be created in many ways, for example with SAII systems that include a camera array or, equivalently, a camera that automatically moves to take images of the scene from well-defined locations. A bright field can also be created using a plenoptic camera. The invention proposed herein is implemented in a mobile device that acquires multiple images within a time interval and then rectifies these images using data from the accelerometer, gyroscope, or any other such capabilities integrated into the device, such as previously described. This procedure also composes a bright field of the scene. Various embodiments of processing procedures to produce a depth map of a scene from this bright field are described in detail below.
[0122] One way to obtain depth information for a scene from a bright field is to analyze the patterns captured by the sensor in epipolar images. In the proposed invention each of the acquired images (conveniently rectified) is treated as a plenoptic view, and each plenoptic view is used to create the epipolar images. Figures 3A-3B-3C show how horizontal and vertical epipolar images 300 and 302 are composed from a bright field, and within from these images it is possible to identify connected pixels that form lines, the so-called epipolar lines. All illuminated pixels of epipolar lines 62 correspond to the same point in the object world. Additionally, the slopes of these lines are directly related to the size of the illuminated pattern on the microlenses and to the corresponding depth of the point in the object world. Therefore, by knowing this pattern it is possible to inversely trace the patterns sampled by the pixels through the camera and obtain the exact depth of the point in the object world that produced such a pattern. It is well known that in a plenoptic camera the relationship between depth and slope depends on the physical dimensions and the design of the device used to capture the bright field. In this invention, the formation of patterns in epipolar images depends on the displacement (reference line) between the different images acquired (different views). This offset can also be calculated using matching algorithms (stereo compatibility algorithms). These algorithms look for patterns that can appear in two or more images in order to establish a one-to-one relationship between the pixels of said two or more images. These are computationally intensive algorithms that can be avoided using the present invention. In the present invention, the displacement between images is calculated using the data from the accelerometer, gyroscope or any other such capacity integrated in the device. This involves calculations of continuous rotational and translational movements that after "the image rectification procedure" end with a ratio of one to one between the pixels of both images.
[0123] Objects at different depths or distances from the camera will produce different illumination patterns on the sensor of a plenoptic camera, as well as on the proposed composition of images taken by a moving camera on a mobile device. As already mentioned, in the same way that in a plenoptic camera the so-called plenoptic views (that make up a bright field) can be represented in epipolar images, in the present invention the various “rectified views” that can be obtained sequentially from a Single moving camera (which also composes a bright field) can also be represented by epipolar images, in both cases, epipolar images are composed by taking two dimensional segments of the bright field as explained in figure 3.
[0124] In one embodiment, the plenoptic algorithms used in this invention for depth estimation can apply a linear regression technique to the points that form an epipolar line to obtain the slope of said epipolar line. When analyzing an epipolar line in a horizontal / vertical epipolar image, all images (as with plenoptic views) distributed along the vertical / horizontal dimension are taken into account since the same object point has been captured by several of these views and the epipolar lines produced by the same point in the world can appear in several epipolar images. Therefore, this linear regression technique and the use of different epipolar images to calculate distances to the same point in the object world reduce statistical noise by benefiting from redundant information along one dimension.
[0125] In yet another embodiment, all lines formed in the horizontal and vertical epipolar images are identified and their corresponding slopes are calculated. Then, the corresponding depth of the object is calculated from the slope. In another embodiment, only one slope (and / or depth) value is calculated per epipolar line since an epipolar line is formed by the same object point captured from various points of view. Therefore, the amount of data is drastically reduced due to two factors: (i) lines are only detected in the epipolar lines corresponding to edges in the object world (since completely uniform areas of the object world, without borders, do not produce any epipolar line) and, (ii) it is possible to calculate / store only one slope value per line instead of calculating / storing a value for each pixel that forms the epipolar line, as traditionally performed in the prior art. In at least one embodiment, the result of this calculation procedure may simply be the corresponding depth values of these detected slopes.
[0126] In another possible embodiment, the slopes obtained by analyzing the horizontal and vertical epipolar lines are combined in a multi-dimensional matrix to reduce statistical noise. This redundancy improves the result of the invention since the same sensor pixel is taken into account when analyzing both the vertical and horizontal epipolar images and, therefore, several slope values are produced by the same point of the object world.
[0127] The calculated slopes for the epipolar lines are transformed into the corresponding object depths. In another embodiment, that transformation step can be performed after combining all redundant slopes, dramatically reducing the number of depth slope transformations. In another embodiment, the calculated depths / slopes on the horizontal and vertical epipolar lines are combined directly into a two-dimensional sparse depth / slope map (sparse because it includes depth / slope calculations only for the points on the epipolar lines, and not for each point in the image as in the prior art), therefore performing a single combining step, which increases the computational efficiency.
[0128] In another embodiment, the scattered depth / slope map can be filled by applying image fill techniques to obtain depth / slope values for each pixel. As a result, the invention provides a dense depth map where each point is associated with the depth estimate of that point in the scene.
[0129] In another embodiment, the methods described herein for estimating a depth map can be combined with or replaced by stereo compatibility algorithms or multi-view compatibility algorithms to improve the final result. In at least one embodiment, the methods described herein can be implemented on mobile devices equipped with a plenoptic camera.
[0130] In one embodiment, the epipolar lines can be detected using edge detection algorithms and their slopes can be measured by linear regression techniques (both methodologies, edge detection and linear regression can be used with subpixel precision).
[0131] In one embodiment, for depth estimation, all calculations can be performed only for those sensor pixels where the edges of the object world have been detected, avoiding performing calculations on a large number of sensor pixels.
[0132]
[0133] The dissipation of energy in mobile terminals (which depend on batteries) is extremely important, which is why the efficiency of calculation in algorithms is of paramount importance. It is in the public domain that some 3D phones (using 2 cameras) disable the second camera (and the 3D function) in low battery conditions. These examples clarify that in order to obtain depth maps in real time on mobile devices, it is convenient to implement the algorithms in an extremely efficient way. The present invention will allow conventional cameras to provide 3D images on mobile devices (mobile phones, tablets ...) using extremely efficient algorithms to calculate depth only for the identified edges.
[0134] For this, it is possible to take advantage of the multiple cores included today in the processors (even in processors of mobile devices). The essential idea is to create several algorithm execution threads in such a way that each of them is in charge of performing different operations. For example, FIG. 15A shows an electronic mobile device 1000 that includes the present multi-view system 1001, which captures images 1002, which are processed through a processor 1004, which may be a multi-core processor 1006. Processor 1004 may be comprised of two or more CPUs (Central Processing Units) 1008a and 1008b (Figure 15B).
[0135] You can use technical c o m p u ta c io n s to the m ore va n a d a s for u m e n ta r e fica c ia c o m p u ta c io n a l. For example, the 1004 ac tu le ro lers can include a 1010 graphic pro cessing (GPU) id ides, including those GPUs ñ adas for d isp os it tive s 1010 mobile phones, including several sc nu ns of nu le le s that can be run by the ness users. Co ns uently, in at least one re-lo cation, each im agenep ip or the se p ro ce s s im u ltánea ly in a core if you re of a GPU to accelerates the radio to the r ect of the rhythm.
权利要求:
Claims (39)
[1]
1. Method to obtain depth information from a scene, comprising the stages of:
a) acquiring a plurality of images of the scene by means of at least one camera during a shooting time in which the plurality of images offers at least two different views of the scene;
b) for each of the images in step a), simultaneously acquire data on the position of the images with respect to a six-axis reference system; c) selecting from the images of step b) at least two images;
d) rectify the images selected in step c) thereby generating a set of rectified images; Y
e) generate a depth map from the rectified images.
[2]
The method according to claim 1, wherein the position of the images during the shooting time is measured from a set of positioning data acquired by means of at least one positioning device selected from the group of: an accelerometer, an IMU, an AHRS, a GPS, a speedometer and / or a gyroscope.
[3]
3. The method according to claim 2, wherein the positioning device is rigidly attached to at least one camera.
[4]
4. Method according to any of the preceding claims, in which at least one camera is associated with a mobile device.
[5]
5. The method of claim 4, wherein the mobile device is a smartphone, tablet, laptop or compact camera.
[6]
6. Method according to any of the preceding claims, in which, in step c) the images are selected based on their positions in the six-axis reference system.
[7]
The method according to claim 6, wherein the images are selected such that their relative distances from adjacent images (a) causes a disparity between adjacent images of at most one pixel.
[8]
The method according to claim 7, wherein step e) comprises generating a virtual synthetic aperture comprehensive imaging system (16200) as a network of rectified images thereby generating a set of epipolar images
[9]
9. The method of claim 6 wherein the images are selected so that at least one image is such that its relative distance from its adjacent images causes a disparity of more than one pixel.
[10]
The method according to claim 9, wherein step e) comprises generating a virtual plenoptic-stereoscopic system with a network of rectified images and another more distant rectified image, thereby generating a set of epipolar images.
[11]
The method according to claims 8 or 10, wherein step e) further comprises calculating at least a slope of at least one epipolar line from the set of epipolar images.
[12]
12. The method of claim 11, wherein the epipolar lines are calculated using edge detection algorithms at the subpixel level.
[13]
13. The method according to claim 11, wherein the epipolar line slopes are calculated using linear regression algorithms at the subpixel level.
[14]
14. The method according to claim 11, wherein step e) further comprises obtaining a depth map of the scene by converting the slopes of the epipolar lines into depths.
[15]
15. The method according to any of the preceding claims, wherein the method further comprises a step of generating a three-dimensional image of the scene from the depth map.
[16]
16. Method according to any of the preceding claims, in which in step a) at least one camera moves during the shooting time.
[17]
17. The method of claim 16, wherein the movements of the at least one camera are indeterminate random movements produced by human hand tremors.
[18]
The method according to claim 16, wherein the at least one camera is attached to a moving structure relative to the scene, wherein the moving structure is selected from at least one automobile, a smartphone, a tablet, a laptop or compact camera.
[19]
19. Method according to any of the preceding claims, wherein the plurality of images from step a) are acquired by at least two cameras.
[20]
20. The method according to claim 19, wherein the at least two cameras are aligned and their relative positions are known.
[21]
21. The method according to claims 20 or 19, wherein at least one of the chambers is a plenoptic chamber.
[22]
22. Method according to any of the preceding claims, in which a video sequence is composed of at least two levels of foreground depth, optional mid-planes and two-dimensional background images located at different depths in the object world and in which said combination of different levels of two-dimensional images in successive frames and / or change occlusions in two-dimensional images closer to the background and / or the change of perspective and size in two-dimensional images closer to the foreground produces a 3D perception to the user.
[23]
23. Method according to any of the preceding claims, in which only some or all of the epipolar images distributed along the vertical / horizontal dimensions are considered in order to reduce statistical noise.
[24]
24. Method according to any of the preceding claims, in which the slopes obtained by analyzing the horizontal and vertical epipolar lines are combined in a multi-dimensional matrix.
[25]
25. Method according to any of the preceding claims, wherein the calculated depth / slope values on the horizontal and / or vertical epipolar lines are combined directly into a two-dimensional sparse depth / slope map [page 7, lines 23-29 ].
[26]
26. The method of any of the preceding claims, wherein the sparse depth / slope map is filled by applying image fill techniques to obtain depth / slope values for each pixel.
[27]
27. Method according to any of the preceding claims, in which, for depth estimation, calculations are performed only for those pixels of the sensors in which the edges of the object world have been detected [page 7, lines 23 29] .
[28]
28. Device for obtaining depth information from a scene comprising at least one camera, at least one positioning device and processing means configured to execute a method according to any of claims 1-27.
[29]
29. Device according to claim 28, wherein the device comprises at least two cameras.
[30]
30. Device according to claim 29, in which at least one of the cameras is a plenoptic camera.
[31]
31. Device according to claim 28, in which the cameras are aligned and their relative positions are known.
[32]
32. A device according to any one of claims 28 to 31, wherein the device comprises at least a third chamber and wherein one of the chambers is horizontally aligned with the plenoptic chamber, and at least one of the chambers is vertically aligned with said plenoptic chamber.
[33]
33. Device according to any claim 28 to 32, wherein the method for obtaining depth information comprises the steps of:
a) acquiring a plurality of images of the scene during a shooting time in which the plurality of images offers at least two different views of the scene from at least two cameras;
b) rectify the images of step a) thereby generating a set of rectified images;
c) generate a depth map from the rectified images.
[34]
34. Device according to any of claims 28-33, in which a video sequence is made up of at least two levels of foreground depth, optional medium planes and two-dimensional background images (located at different depths in the object world) and in which said combination of different levels of two-dimensional images in successive frames and / or the change of occlusions in two-dimensional images closer to the background and / or the change of perspective and size in two-dimensional images closer to the foreground produces a perception in 3D to the user.
[35]
35. Device according to any of claims 28-34, in which only some or all of the epipolar images distributed along the vertical / horizontal dimensions are taken into consideration in order to reduce statistical noise.
[36]
36. Device according to any of claims 28-35, in which the slopes obtained by analyzing the horizontal and vertical epipolar lines are combined in a multi-dimensional matrix.
[37]
37. Device according to any of claims 28-36, wherein the calculated depth / slope values on the horizontal and / or vertical epipolar lines are combined directly into a two-dimensional scattered depth / slope map.
[38]
38. Device according to any of claims 28-37, wherein the scattered depth / slope map is filled by applying image fill techniques to obtain depth / slope values for each pixel.
[39]
39. Device according to any of claims 28-38, wherein for depth estimation, calculations are performed only for those pixels of the sensors in which the edges of the object world have been detected.
类似技术:
公开号 | 公开日 | 专利标题
US7126630B1|2006-10-24|Method and apparatus for omni-directional image and 3-dimensional data acquisition with data annotation and dynamic range extension method
US10085011B2|2018-09-25|Image calibrating, stitching and depth rebuilding method of a panoramic fish-eye camera and a system thereof
CN106643699A|2017-05-10|Space positioning device and positioning method in VR | system
CN106871878B|2019-09-10|Hand-held range unit and method, the storage medium that spatial model is created using it
US7751651B2|2010-07-06|Processing architecture for automatic image registration
ES2747387B1|2021-07-27|DEVICE AND METHOD TO OBTAIN DEPTH INFORMATION FROM A SCENE.
US20130271579A1|2013-10-17|Mobile Stereo Device: Stereo Imaging, Measurement and 3D Scene Reconstruction with Mobile Devices such as Tablet Computers and Smart Phones
US9781405B2|2017-10-03|Three dimensional imaging with a single camera
US20170127045A1|2017-05-04|Image calibrating, stitching and depth rebuilding method of a panoramic fish-eye camera and a system thereof
EP2510379A2|2012-10-17|System and method for determining geo-location| in images
US9071819B2|2015-06-30|System and method for providing temporal-spatial registration of images
WO2002065786A1|2002-08-22|Method and apparatus for omni-directional image and 3-dimensional data acquisition with data annotation and dynamic range extension method
US20070116457A1|2007-05-24|Method for obtaining enhanced photography and device therefor
US20160044295A1|2016-02-11|Three-dimensional shape measurement device, three-dimensional shape measurement method, and three-dimensional shape measurement program
WO2019061064A1|2019-04-04|Image processing method and device
CN110419208B|2021-03-26|Imaging system, imaging control method, image processing apparatus, and computer readable medium
JP2013074473A|2013-04-22|Panorama imaging apparatus
KR101386773B1|2014-04-21|Method and apparatus for generating three dimension image in portable terminal
US7839490B2|2010-11-23|Single-aperture passive rangefinder and method of determining a range
CN206300653U|2017-07-04|A kind of space positioning apparatus in virtual reality system
JPH1023465A|1998-01-23|Image pickup method and its device
CN109565539A|2019-04-02|Filming apparatus and image pickup method
JP2022021027A|2022-02-02|Information processing equipment, information processing methods, and computer-readable recording media
JP2021150942A|2021-09-27|Image capture device and image capture processing method
JP2021150882A|2021-09-27|Image capture device and image capture processing method
同族专利:
公开号 | 公开日
US11145077B2|2021-10-12|
ES2747387R1|2020-03-25|
JP2020506487A|2020-02-27|
CN110462686A|2019-11-15|
WO2018141414A1|2018-08-09|
US20200134849A1|2020-04-30|
JP6974873B2|2021-12-01|
ES2747387B1|2021-07-27|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

JP2834790B2|1989-10-03|1998-12-14|日本電信電話株式会社|Image information generation method|
JPH08147497A|1994-11-25|1996-06-07|Canon Inc|Picture processing method and device therefor|
JP2009134693A|2007-10-30|2009-06-18|Canon Inc|Image processing apparatus and image processing method|
US20110304706A1|2010-06-09|2011-12-15|Border John N|Video camera providing videos with perceived depth|
US20130127988A1|2011-11-17|2013-05-23|Sen Wang|Modifying the viewpoint of a digital image|
US9179126B2|2012-06-01|2015-11-03|Ostendo Technologies, Inc.|Spatio-temporal light field cameras|
US9025859B2|2012-07-30|2015-05-05|Qualcomm Incorporated|Inertial sensor aided instant autofocus|
US9384551B2|2013-04-08|2016-07-05|Amazon Technologies, Inc.|Automatic rectification of stereo imaging cameras|
US9319665B2|2013-06-19|2016-04-19|TrackThings LLC|Method and apparatus for a self-focusing camera and eyeglass system|
EP2854104A1|2013-09-25|2015-04-01|Technische Universität München|Semi-dense simultaneous localization and mapping|
EP2887642A3|2013-12-23|2015-07-01|Nokia Corporation|Method, apparatus and computer program product for image refocusing for light-field images|
CA2848794C|2014-04-11|2016-05-24|Blackberry Limited|Building a depth map using movement of one camera|
US10074158B2|2014-07-08|2018-09-11|Qualcomm Incorporated|Systems and methods for stereo depth estimation using global minimization and depth interpolation|
US10008027B1|2014-10-20|2018-06-26|Henry Harlyn Baker|Techniques for determining a three-dimensional representation of a surface of an object from a set of images|
US9813621B2|2015-05-26|2017-11-07|Google Llc|Omnistereo capture for mobile devices|
WO2017107192A1|2015-12-25|2017-06-29|Boe Technology Group Co., Ltd.|Depth map generation apparatus, method and non-transitory computer-readable medium therefor|
US10728529B2|2018-09-06|2020-07-28|Qualcomm Incorporated|Synchronization of frame captures from multiple cameras with different fields of capture|
CA3145736A1|2019-07-01|2021-01-07|Geomagical Labs, Inc.|Method and system for image generation|EP3416371A1|2017-06-12|2018-12-19|Thomson Licensing|Method for displaying, on a 2d display device, a content derived from light field data|
JP2019015553A|2017-07-05|2019-01-31|ソニーセミコンダクタソリューションズ株式会社|Information processing device, information processing method, and solid-state imaging device|
US10935783B1|2019-09-17|2021-03-02|Aquabyte, Inc.|Optical system for capturing digital images in an aquaculture environment in situ|
法律状态:
2020-03-10| BA2A| Patent application published|Ref document number: 2747387 Country of ref document: ES Kind code of ref document: A2 Effective date: 20200310 |
2020-03-25| EC2A| Search report published|Ref document number: 2747387 Country of ref document: ES Kind code of ref document: R1 Effective date: 20200318 |
2021-07-27| FG2A| Definitive protection|Ref document number: 2747387 Country of ref document: ES Kind code of ref document: B1 Effective date: 20210727 |
优先权:
申请号 | 申请日 | 专利标题
PCT/EP2017/052542|WO2018141414A1|2017-02-06|2017-02-06|Device and method for obtaining depth information from a scene|
[返回顶部]