Скачать презентацию Computational Geometry and Spatial Data Mining Marc van Скачать презентацию Computational Geometry and Spatial Data Mining Marc van

d647b8c7f60c9e15637a5c82578036ab.ppt

  • Количество слайдов: 88

Computational Geometry and Spatial Data Mining Marc van Kreveld Department of Information and Computing Computational Geometry and Spatial Data Mining Marc van Kreveld Department of Information and Computing Sciences Utrecht University

Clustering? • Are the people clustered in this room? How do we define a Clustering? • Are the people clustered in this room? How do we define a cluster? • In spatial data mining we have objects/ entities with a location given by coordinates • Cluster definitions involve distance between locations

Clustering - options • • Determine whether clustering occurs Determine the degree of clustering Clustering - options • • Determine whether clustering occurs Determine the degree of clustering Determine the clusters Determine the largest cluster • Determine the outliers

Co-location • Are the men clustered? • Are the women clustered? • Is there Co-location • Are the men clustered? • Are the women clustered? • Is there a co-location of men and women?

Co-location • Like before, we may be interested in – is there co-location? – Co-location • Like before, we may be interested in – is there co-location? – the degree of co-location – the largest co-location – the co-locations themselves – the objects not involved in co-location

Spatio-temporal data • Locations have a time stamp • Interesting patterns involve space and Spatio-temporal data • Locations have a time stamp • Interesting patterns involve space and time

Trajectory data • Entities with a trajectory (time-stamped motion path) • Interesting patterns involve Trajectory data • Entities with a trajectory (time-stamped motion path) • Interesting patterns involve subgroups with similar heading, expected arrival, joint motion, . . . • n entities = trajectories; n = 10 – 100, 000 • t time steps; t = 10 – 100, 000 input size is nt • m size subgroup (unknown); m = 10 – 100, 000

Examples of trajectory data • • • Tracked animals (buffalo, birds, . . . Examples of trajectory data • • • Tracked animals (buffalo, birds, . . . ) Tracked people (potential terrorists) Tracked GSMs (e. g. for traffic purposes) Trajectories of tornadoes Sports scene analysis (players on a soccer field)

Example pattern in trajectories • What is the location visited by most entities? location Example pattern in trajectories • What is the location visited by most entities? location = circular region of specified radius

Example pattern in trajectories • What is the location visited by most entities? location Example pattern in trajectories • What is the location visited by most entities? location = circular region of specified radius 4 entities

Example pattern in trajectories • What is the location visited by most entities? location Example pattern in trajectories • What is the location visited by most entities? location = circular region of specified radius 3 entities

Example pattern in trajectories • Compute buffer of each trajectory Example pattern in trajectories • Compute buffer of each trajectory

Example pattern in trajectories • Compute buffer of each trajectory • Compute the arrangement Example pattern in trajectories • Compute buffer of each trajectory • Compute the arrangement of the buffers and the cover count of each cell 1 1 1 2 0 1 1

Example pattern in trajectories • One trajectory has t time stamps; its buffer can Example pattern in trajectories • One trajectory has t time stamps; its buffer can be computed in O(t log t) time • All buffers can be computed in O(nt log t) time • The arrangement can be computed in O(nt log (nt) + k) time, where k = O( (nt)2 ) is the complexity of the arrangement • Cell cover counts are determined in O(k) time

Example pattern in trajectories • Total: O(nt log (nt) + k) time • If Example pattern in trajectories • Total: O(nt log (nt) + k) time • If the most visited location is visited by m entities, this is O(nt log (nt) + ntm) • Note: input size is nt ; n entities, each with location at t moments

Patterns in entity data Spatial data Spatio-temporal data • n points (locations) • Distance Patterns in entity data Spatial data Spatio-temporal data • n points (locations) • Distance is important • n trajectories, each has t time steps • Distance is timedependent – clustering pattern • Presence of attributes (e. g. man/woman): – co-location patterns – flock pattern – meet pattern • Heading and speed are important and are also time-dependent

Entities in subdivisions • Also co-location pattern • Discovered simply by overlay E. g. Entities in subdivisions • Also co-location pattern • Discovered simply by overlay E. g. , occurrences of oaks on different soil types

Clustering entities in subdivisions • What if it is known that the entities only Clustering entities in subdivisions • What if it is known that the entities only occur in regions of a certain type? Situation without subdivision radius of cluster bird nests

Clustering entities in subdivisions • What if it is known that the entities only Clustering entities in subdivisions • What if it is known that the entities only occur in regions of a certain type? Situation with subdivision land-water radius of cluster bird nests

Clustering entities in subdivisions house car burglary Clustering entities in subdivisions house car burglary

Region-restricted clustering Joint research with Joachim Gudmundsson (NICTA, Sydney) and Giri Narasimhan (U of Region-restricted clustering Joint research with Joachim Gudmundsson (NICTA, Sydney) and Giri Narasimhan (U of F, Miami), 2006 • Determine clusters in point sets that are sensitive to the geographic context (at least, for the relevant aspects) Assume that a set of regions is given where points can only be, how should we define clusters?

Region-restricted clustering • Given a set P of points, a set F of regions, Region-restricted clustering • Given a set P of points, a set F of regions, a radius r and a subset size m, a region-restricted cluster is a subset P’ P inside a circle C where – P’ has size at least m – C has radius at most 2 r – C contains at most r 2 area of regions of F r ≤ 2 r sum area ≤ r 2

Region-restricted clustering • Given a set P of n points, a set F of Region-restricted clustering • Given a set P of n points, a set F of polygons with nf edges in total, and values for r and m, report all region-restricted clusters of exactly m points • Exactly m points? • “Real” clustering (partition)? • Outliers?

Region-restricted clustering • Exactly m points? Every cluster with >m points consists of clusters Region-restricted clustering • Exactly m points? Every cluster with >m points consists of clusters with m points with smaller circles • “Real” clustering (partition)? • Outliers? m=5

Region-restricted clustering • Exactly m points? Every cluster with >m points consists of clusters Region-restricted clustering • Exactly m points? Every cluster with >m points consists of clusters with m points with smaller circles • “Real” clustering (partition)? • Outliers? m=5

Region-restricted clustering 1. Determine all smallest circles with m points of P inside 2. Region-restricted clustering 1. Determine all smallest circles with m points of P inside 2. Test if the radius is ≤ r (report) or > 2 r (discard) 3. If the radius is in between, determine the area of regions of F inside

Region-restricted clustering 1. Determine all smallest circles with m points of P inside • Region-restricted clustering 1. Determine all smallest circles with m points of P inside • • Use (m-2)-th order Voronoi diagram: cells where the same (m-2) points are closest Its vertices are centers of smallest circles around exactly m points

ordinary = order-1 VD ordinary = order-1 VD

order-2 VD order-2 VD

order-3 VD order-3 VD

Region-restricted clustering • The m-th order Voronoi diagram (or (m-2)) has O(nm) cells, edges, Region-restricted clustering • The m-th order Voronoi diagram (or (m-2)) has O(nm) cells, edges, and vertices • It can be constructed in O(nm log n) time we get O(nm) smallest circles with m points inside; for each we also know the radius

Region-restricted clustering 2. Test if the radius is ≤ r (report) or > 2 Region-restricted clustering 2. Test if the radius is ≤ r (report) or > 2 r (discard) Trivial in O(1) time per circle, so in O(nm) time overall

Region-restricted clustering 3. Determine the area of regions of F inside Brute force: O(nf) Region-restricted clustering 3. Determine the area of regions of F inside Brute force: O(nf) time per circle, so in O(nmnf) time overall

Region-restricted clustering • Complication: This need not give all region -restricted clusters! – Need Region-restricted clustering • Complication: This need not give all region -restricted clusters! – Need to compute area of F inside a circle with moving center – Requires solving high-degree polynomials

Region-restricted clusters • The anti-climax: we cannot give an exact algorithm! • If we Region-restricted clusters • The anti-climax: we cannot give an exact algorithm! • If we takes squares instead of circles, we can deal with the problem. .

Region-restricted clustering 3. Determine the area of regions of F inside Brute force: O(nf) Region-restricted clustering 3. Determine the area of regions of F inside Brute force: O(nf) time per square, so in O(nmnf) time overall The total time for steps 1, 2, and 3 is O(nm log n) + O(nmnf) = O(nm log n + nmnf) time

Region-restricted clustering 3. Determine the area of regions of F inside Using a suitable Region-restricted clustering 3. Determine the area of regions of F inside Using a suitable data structure (only possible for squares): O(log 2 nf) time per square, so in O(nm log 2 nf) time overall The total time becomes O(nm log n + nf log 2 nf + nm log 2 nf) order- (m-2) VD construction preprocessing of data structure total query time in data structure

Region-restricted clustering • The squares solution generalizes to regular polygons (e. g. 20 -gons) Region-restricted clustering • The squares solution generalizes to regular polygons (e. g. 20 -gons) 16 -gon • An approximation of the radius within (1+ )r gives a O(n/ 2 + nf log 2 nf + n log nf /(m 2)) time algorithm

Region-restricted clustering • Open problems: – Develop a region-restricted version of k-means clustering, single Region-restricted clustering • Open problems: – Develop a region-restricted version of k-means clustering, single link clustering, . . . – Region-restricted co-location? – Replace region-restricted by gradual model typical: 0 /unit 2 /unit 5 /unit clusters: 8 /unit

Patterns in trajectories • n trajectories, each with t time steps n polygonal lines Patterns in trajectories • n trajectories, each with t time steps n polygonal lines with t vertices • Already looked at most visited location

Patterns in trajectories • Flock: near positions of (sub)trajectories for some subset of the Patterns in trajectories • Flock: near positions of (sub)trajectories for some subset of the entities during some time • Convergence: same destination region for some subset of the entities • Encounter: same destination region with same arrival time for some subset of the entities • Similarity of trajectories • Same direction of movement, leadership, . . . flock convergence

Patterns in trajectories • Flocking, convergence, encounter patterns – – Laube, van Kreveld, Imfeld Patterns in trajectories • Flocking, convergence, encounter patterns – – Laube, van Kreveld, Imfeld (SDH 2004) Gudmundsson, van Kreveld, Speckmann (ACM GIS 2004) Benkert, Gudmundsson, Huebner, Wolle (ESA 2006). . . • Similarity of trajectories – Vlachos, Kollios, Gunopulos (ICDE 2002) – Shim, Chang (WAIM 2003) –. . . • Lifelines, motion mining, modeling motion – – Mountain, Raper (Geo. Computation 2001) Kollios, Scaroff, Betke (DM&KD 2001) Frank (GISDATA 8, 2001). . .

Patterns in trajectories • Flock: near positions of (sub)trajectories for some subset of the Patterns in trajectories • Flock: near positions of (sub)trajectories for some subset of the entities during some time – clustering-type pattern – different definitions are used • Given: radius r, subset size m, and duration T, a flock is a subset of size m that is inside a (moving) circle of radius r for a duration T

Patterns in trajectories • Longest flock: given a radius r and subset size m, Patterns in trajectories • Longest flock: given a radius r and subset size m, determine the longest time interval for which m entities were within each other’s proximity (circle radius r) Time = 0 1 2 3 4 5 6 7 8 m=3 longest flock in [ 1. 8 , 6. 4 ]

Patterns in trajectories • Meet: near some position of (sub)trajectories for some subset of Patterns in trajectories • Meet: near some position of (sub)trajectories for some subset of the entities – clustering-type pattern • Given: radius r, subset size m, and duration T, a meet is a subset of size m that is inside a (stationary) circle of radius r for a duration T this was “moving” for flock

Patterns in trajectories • The same subset required for a flock or meet? Example: Patterns in trajectories • The same subset required for a flock or meet? Example: meet with m = 4; duration is 3+ time steps or 4+ time steps?

Patterns in trajectories fixed subset variable subset flock meet examples for m = 3 Patterns in trajectories fixed subset variable subset flock meet examples for m = 3

Patterns in trajectories fixed subset flock meet NP-hard O(n 4 2 log n + Patterns in trajectories fixed subset flock meet NP-hard O(n 4 2 log n + n 2 3) variable subset O(n 3 log n) O(n 4 2 log n + n 2 3) Exact results ( input size is n )

Patterns in trajectories • A radius-2 approximation of the longest flock can be computed Patterns in trajectories • A radius-2 approximation of the longest flock can be computed in time O(n 2 log n). . . meaning: if the longest flock of size m for radius r has duration T, then we surely find a flock of size m and duration T for radius 2 r longest flock for r at least as long a flock for 2 r

Patterns in trajectories Approximate radius results ( input size is n ) fixed subset Patterns in trajectories Approximate radius results ( input size is n ) fixed subset flock O(n 2 log n) variable subset O((n 2 log n) / 2) factor 2 NP-hard meet factor 2+ O(n 3 log n) O((n 2 log n) / (m 2)) factor 1+ O(n 4 2 log n + n 2 3)

Fixed subset flock • It is NP-complete to decide if a graph has a Fixed subset flock • It is NP-complete to decide if a graph has a subgraph with m nodes that is a clique v 2 For every node of the graph, make an entity with a trajectory v 1 v 2 v 3 v 4 v 5 v 1 v 6 v 7 v 4 v 7 v 3 v 6 v 5 r all nodes not adjacent to v 1 go here v 1 is not adjacent to v 4, v 5, and v 7

Fixed subset flock v 2 v 4 in flock v 1 v 2 v Fixed subset flock v 2 v 4 in flock v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 1 v 4 v 3 v 6 v 4 not in flock v 7 v 5

Fixed subset flock v 2 v 1 v 2 v 3 v 4 v Fixed subset flock v 2 v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 1 v 4 v 7 v 3 v 6 v 5 flock {v 4, v 5, v 7} of (full) duration 23 (3· 7+2) and size 3 The trajectories have a fixed flock of size m and full duration if and only if the graph has a clique of size m

Fixed subset flock • Longest fixed flock is NP-hard • Max clique has no Fixed subset flock • Longest fixed flock is NP-hard • Max clique has no approximation cannot approximate duration, nor flock size • The reduction applies for all radii < 2 r v 1 v 2 v 4 in flock v 3 v 4 v 5 v 4 not in flock v 6 v 7

Flock and meet algorithms • Go into 3 D (space-time) for algorithms time 4 Flock and meet algorithms • Go into 3 D (space-time) for algorithms time 4 3 2 1 0 flock meet

Fixed subset flock, approximation • An efficient radius-2 approximation algorithm of longest fixed flock Fixed subset flock, approximation • An efficient radius-2 approximation algorithm of longest fixed flock exists • Idea: if some vi is in the longest flock, then all other entities are within distance 2 r from vi flock with vi vi radius 2 r, centered at vi 2 r

Fixed subset flock, approximation • For each vj, we can determine the O( ) Fixed subset flock, approximation • For each vj, we can determine the O( ) time intervals where vj is in the column of vi • Maintain the intersections for all entities in an augmented tree in O(n log n) time • Do this for all columns (role of vi) and report longest overall pattern Total: O(n 2 log n) time

Variable subset flock, exact • The subset that forms the flock may change entities, Variable subset flock, exact • The subset that forms the flock may change entities, but must stay of size m • Any flock subset at any instant has a disk D of radius r with at least 2 entities on the boundary defining entities r defining entities

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • Two entities define two cylinders through time by tracing Variable subset flock, exact • Two entities define two cylinders through time by tracing the two possible radius r disks

Variable subset flock, exact • A critical moment is where another entity is on Variable subset flock, exact • A critical moment is where another entity is on the boundary of the disk; it may go outside or inside

Variable subset flock, exact • At a critical moment: – a variable subset flock Variable subset flock, exact • At a critical moment: – a variable subset flock may start (m entities) – a variable subset flock may stop (

Variable subset flock, exact • Let the O(n 3 ) critical moments be the Variable subset flock, exact • Let the O(n 3 ) critical moments be the nodes in a directed acyclic graph G • Edges of G are between two consecutive critical moments of the same two defining entities – directed from earlier to later – weight is time between critical moments – only if at least m entities are inside the disk time A longest variable subset flock is a maximum weight path in G

Variable subset flock, exact • The graph G can be built in O(n 3 Variable subset flock, exact • The graph G can be built in O(n 3 log n) time • A maximum weight path can be found in O(n 3 log n) time A longest variable subset flock is a maximum weight path in G

Patterns in trajectories, summary • Flock and meet patterns require algorithms in 3 dimensional Patterns in trajectories, summary • Flock and meet patterns require algorithms in 3 dimensional space (space-time) • Exact algorithms are inefficient only suitable for smaller data sets • Approximation can reduce running time with one or two orders of magnitude

Patterns in trajectories, summary fixed subset apx flock exact O(n 2 log n) factor Patterns in trajectories, summary fixed subset apx flock exact O(n 2 log n) factor 2 NP-hard variable subset O((n 2 log n) / 2) factor 2+ O(n 3 log n) apx O((n 2 log n) / (m 2)) factor 1+ meet exact O(n 4 2 log n + n 2 3)

Future research on longest trajectories • Faster exact and approximation algorithms • Better approximation Future research on longest trajectories • Faster exact and approximation algorithms • Better approximation factors • Remove restriction of fixed shape of flocking region (compact or elongated both possible during same flock) • Longest duration convergence longest convergence

Patterns in trajectories • Flock and meet patterns require algorithms in 3 dimensional space Patterns in trajectories • Flock and meet patterns require algorithms in 3 dimensional space (space-time) • Exact algorithms are inefficient only suitable for smaller data sets • Approximation can reduce running time with an order of magnitude

To conclude • With an exact definition of a spatial or spatiotemporal pattern, geometric To conclude • With an exact definition of a spatial or spatiotemporal pattern, geometric algorithms can be used to compute all patterns • Many known structures from computational geometry are useful (Voronoi diagrams, arrangements, . . . ) • Since the (exact) algorithms may be inefficient, approximation may be a solution

To discuss • What patterns must be detected in practice (both spatial and spatio-temporal)? To discuss • What patterns must be detected in practice (both spatial and spatio-temporal)? • What is the most appropriate definition (formalization) of these? • Spatial association rules, auto-correlation, irregularities, classification, . . . and other computable things in spatial/spatio-temporal data mining