SNA(Social Network Analysis)

Notice

Recent Posts

Recent Comments

Link

« 2024/12 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tags more

Archives

Today

Total

관리 메뉴

개발자 끄적끄적

SNA(Social Network Analysis) 본문

데이터 마이닝

SNA(Social Network Analysis)

햏치 2023. 5. 6. 21:12

<SNA(Social Network Analysis)>
- 사회연결망(Social Network)
  - 개체(개인, 조직, 혹은 webpage)간의 관계를 표현한 상회적 관계 구조
    - 개체는 노드로 그들 간의 관계는 에지(링크)로 표현

- SNA는 노드의 관계성을 중심으로 구조를 분석
  - 시각화, 노드의 특성, 네트워크 특성, 연결 예측

<가상의 링크드인 네트워크>
- 6개의 노드와 노드를 연결하는 선으로 구성
- 무방향 네트워크
#build a dataframe that defines the edges and use to build the graph
df = pd.DataFrame([
  ("Dave", "Jenny"), ("Peter", "Jenny"), ("John", "Jenny"),
  ("Dave, "Petter"), ("Dave", "John"), ("Peter", "Sam"),
  ("Sam', "Albert"), ("Peter", "John")
], columns=['from', 'to])
G = nx.from_pandas_edgelist(df, 'from', 'to')

#Plot it
nx.draw(G, with_labels=True, node_color='lightblue', node_size=1600)
plt.show()ㅁ

//df : Pandas DataFrame
//source : str or int
//target : str or int
//edag_att : str or int, iterable, True, or None
//create_usting : NetworkX graph constructor, optional(default=nx.Graph)

<방향 네트워크>
G = nx.from_pandas_edgelist(df, 'from', 'to', create_using=nx.DiGraph())

nx.draw(G, with_labels=True, node_color='lightblue', node_size=1600)
plt.show
//Digraph : 방향 그래프
//from : 소스, to : 타겟

nx.fraw(G, with_labels=True, node_color='lightblue', node_size=1600)

- Edge weight(thickness)
  - strength of relationship
  - number of communications, value or number of transactions, etc
- 그래프 내 연결선에 가중치를 두어 그 강도를 표현할 수 있다

<네트워크 분석 및 시각화>
- x-y coordinates are not meaningful - these two graphs convey the same information

G = nx.form_pandas_edgelist(df, 'from', 'to')

plt.subplots(nrow=1, ncols=2, figsize=(10,4))

plt.subplot(121)
nx.draw_circular(G, with_labels=True, node_color='lightblue', node_size=1600)
plt.subplot(122)
nx.draw_kamada_kawai(G, with_labels=True, node_color='lightblue', node_size=1600)
plt.tight_layout()
plt.show()
*plt.show() 는 화면에 표시하는 기능
*plt.subplot()는 add_subplot와 동일하게, 인수에 행의 수, 열의 수 및 몇 번째 등을 지정할
*tight_layout() : 자동으로 레이아웃을 맞춰주는 함수

<Principle of graph layout : 그래프를 그릴 때 원칙>
1. 모든 노드는 눈으로 확인 가능해야 한다
2. 모든 노드에서 노드의 연결도를 수치화할 수 있어야 한다
3. 모든 연결은 시작점과 끝나는 점이 명확해야 한다
4. 노드의 군집과 이상 노드는 확인 가능해야 한다

- draw_circular : Draw the graph with a circular layout
- draw_kamada_kawai : Draw the graph G with a Kamada-Kawai force-directed layout
- draw_planar : Draw a planar networkx graph G with planar layout
  - 선들이 겹치지 않게 그리는 것
- draw_random : Draw a graph G with a random layout
- draw_spectral : Draw the graph G with a spectral 2D layout
- draw_spring : Draw the graph G with a spring layout
  - spring형태의 모양
- draw_shell : Draw networkx graph G with shell layout

<그래프 레이아웃>
drug_df = pd.read_csv('drug.csv')

G = nx.from_pandas_edgelist(drug_df, 'Entity', 'Related Entitiy') //변수 : Entity, Related Entitiy

centrality = nx.eigenvector_centrality(G)
//A metric of node importance(eigenvector_centrality(G))그래프 내에 있는 각각의 노드의 중요성을 개선하는 측도)

nx.draw(G, with_labels=False, node_color='skyblue', node_size=[400*centrality[n] for n in G.nodes()]
//node_size=[400*centrality[n]] - 노드 size를 중요도 크기만큼 그려라. 즉 중요도 큰 노드가 제일 크게 그려진다
plt.show()

<인접 리스트와 인접행렬(Adjacency list and matrix)>
- 인접 리스트 : 두 열의 개체는 노드를 의미하고, 각 행은 두 노드간의 연결을 의미한다
- 인접 행렬 : 개체의 관계를 행렬 형태로 표현
  - 각 셀은 가장 왼쪽 열에 있는 헤더로부터 최상단 행에 있는 헤더 방향으로 연결 여부를 표시
  - 네트워크 데이터를 정형화된 행렬 테이터로 변환하여 분류와 예측에 사용
- ex)
G = nx.from_pandas_edgelist(df, 'from', 'to', create_using=nx.Digraph())
print(nx.to_numpy_matrix(G))

<소셜 데이터의 측정측도>
Terms
- 연결선 가중치(Edge weight) : strength of relationship
  - ex) 이메일 네트워크에서는 두 명 사이에서 주고 받은 이메일 빈도가 연결선 가중치가 된다

- 경로(Path)와 경로 길이(Path length)
  - 경로 : route from one node to another(노드A에서 노드B로 가는 길)
  - 경로 길이 : 경로 상에 있는 edge의 개수

- 연결 네트워크(Connected network) : each node in the network has a path to all others

- 클릭(Clique) : each node directly connected by single edge(직접 연결) to each other

- 고립노드(Singleton) : unconnected node

<노드 관점에서의 중심 측도>
Node Metrics : How central/important is a node?
- 연결도(Degree) : number of connections to a node
  - 많은 연결선을 보유하는 노드가 중요

- 근접성(Closeness) : 네트워크 내 한 노드가 다른 노드들과 가까운 정도
  - inverse of the average path length(경로길이) to other nodes
  - closeness(v) = 1/sum(d(v,i) i != v)
  - 특정 노드 v와 연결된 모든 노드 간의 최단 거리의 평균
  - closeness(v) = 𝑛_𝑣/sum( d(v,i), I != v) 즉, 경로 길이가 길면 근접성은 떨어지고 경로길이가 짧으면 근접성은 높아진다
    - d(v,i): v에서 i로 가는 최단 거리,
    - 𝑛_𝑣 : v에서 최단 거리 개수

- If there is no (directed) path between vertex v and i, then the total number of vertices is used in the formula
- ex) Closeness(Albert)
          = 5/(1+2+3+3+3) = 0.417

- 중계성(Betweenness)
  - 특정 노드가 다른 노드간의 최단 경로상에서 얼마나 중계사 역할을 하는 지 정도
  - A-B-C가 있을 때, B는 중계사 역할
  - 노드의 중계 역할 경로 개수 혹은 전체 경로에서의 비율로 표현
  - ex) Betweenness(Peter) = 6 or  6/10, 전체 경로 개수=𝐶_5^2 =10

- 고유 벡터 중심성(Eigenvextor centraility)
  - 많은 연결선이 있는 개체로 가는 링크를 연결성이 얼마 없는 개체로 가는 링크보다 더 중요시하는 연결 중심성 측도
  - 특정 노드로부터 가는 링크 개수와 이 링크들로부터 뻗어나가는 연결 노드 개수의 합으로 0과 1사이의 값
  - 0 : 중심성 x, 1 : 최대 중심성
  - 중심성은 네트워크에서 노드의 크기로 표현
  - 중심성이 클수록 중요한 노드

<자기 중심 네트워크(Egocentric network)>
- 개별 노드 중심으로 모인 네트워크
- 연결도 1인 자가 중심 네트워크는 특정 개별 노드와 연결된 모든 노드로 이루어진 네트워크
- 에고라고 불리는 하나의 중심이 되는 노드와 알터라 불리는 다른 노드들 간의 연결로 구성된 네트워크

ego_graph(G, n, radius=1, center=True, undirected=False, distance=None)
- G:grpah
  - 전체 그래프(A NetworkX Graph or Digraph)

- n:node
  - A single node(중심되는 노드)

- radius(연결도, default=1) : number, optional
- Include all neighbors of distance <= radius from n

<네트워크 관점에서의 중심 측도>
Network metrics : Describing the network as a whole

- Degress distribution(연결도 분포)
  - Distribution of number of connections per node //node별 연결도 분포

- Density(밀도)
  - Ratio of # of edges to maximum possible # of edges
  n은 노드 개수, e는 연결선의 개수
  density(directed) = e/n(n-1)
  density(undirected) = e/(n(n-1)/2)

- ex)
  degressCount = collections.Counter(d for node, d in G.degree()) //d는 연결도가 key값(몇 회 발생), 그 연결도를 갖고 있는 node는 몇 개인지
  degressDistrubution = [0] * (1+max(degressCount))
  for degree, count in degreeCount.items() : //딕셔너리의 키와 value를 리턴(key : degree)
    degressDistribution[degree] = count //degree를 갖고 있는 인덱스의 카운트 값(1이라는 연결도를 갖고 있는 node는 몇 개인지)
  degreeDistribution

- ex)
Counter는 collections 모듈 내에서도 딕셔너리에 특화된 클래스이다
리스트나 문자열 등 이터러블한 객체나 이터러블 객체의 집합을 받아서 값이 같은 것끼리 묶고
그 갯수가 몇개인지를 키로 받아서 딕셔너리 형태로 리턴하는 계산기 클래스이다
  c = Counter()
  c = Counter('gllagad')
  c = Counter({'red' : 4, 'blue':2})
  c = Counter(cats=4, dogs=8)

출력 결과
Counter()
Counter({'g': 1, 'a': 3, 'l': 2, 'h': 1, 'd': 1})
Counter({'red': 4, 'blue': 2})
Counter({'cats': 4, 'dogs': 8})

<네트워크 측도를 이용한 예측과 분류>
- 네트워크 속성과 예측변수를 함께 사용하여 분류 및 예측
  - 매칭사이트 : 회원 간의 개인 정보 + 회원 간의 연계정보 이용

- Link Prediction
  - 네트워크가 주어졌을 때, 다음 연결선을 어디에 형성할까를 예측
  1. For each noce, score similarity to all other nodes
    - Traditional predictive model variables could be used to calculate similarity[see nearest-neighbor methods]
    - Network metrics(최단경로, 공통된 이웃의 개수, 혹은 연결선 가중치) can also be uesed for similarity

  2. The unlinked pair with hightest similarity score is predicted next link

'데이터 마이닝' 카테고리의 다른 글

다중선형 회귀분석 (1)	2023.05.10
사회연결망 분석 (0)	2023.04.27
Clustering Analysis (0)	2023.04.12

'데이터 마이닝' Related Articles

개발자 끄적끄적

SNA(Social Network Analysis) 본문

SNA(Social Network Analysis)

'데이터 마이닝' 카테고리의 다른 글

티스토리툴바